LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 01, 2020, 09:05:46 AM Last edit: April 29, 2024, 06:54:09 AM by LoyceV Merited by pooya87 (8), Welsh (8), bitmover (4), hosseinimr93 (2), marlboroza (2), BTCW (2), vapourminer (1), seoincorporation (1), ABCbits (1), NotATether (1), friends1980 (1), MrFreeDragon (1), naufragus (1) |
|
BackgroundTo follow up on List of all Bitcoin addresses with a balance and this post, I made a list of all Bitcoin addresses that have ever been used. The dataSee alladdresses.loyce.club (new location)I now have the resources (RAM, CPU power and disk space) and code to show unique addresses in their original order. Each address is only shown once. I have 2 large files: 1. All Bitcoin addresses ever used, in chronological order, without duplicates. Sample: all_Bitcoin_addresses_ever_used_in_order_of_first_appearance.txt.gz: ( Warning: 33 GB): 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa 12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX 1HLoD9E4SDFFPDiYfNYnkBLQ85Y51J3Zb1 ....... 3GFfFQAFgXKiA1qqUK6rqBpEpG4vZDos6t 3Mbtv47gZ2eN6Fy7owpgHHwSLYHS42P56P 38JyF2RQknBUMETyRT2yGndDJFYSp6hJNg 2. All Bitcoin addresses ever used, sorted by address, without duplicates. Sample: all_Bitcoin_addresses_ever_used_sorted.txt.gz: ( Warning: 29 GB): 1111111111111111111114oLvT2 111111111111111111112BEH2ro 111111111111111111112xT3273 ....... s-ffd80dee5966fb23c1a483b28f6bfcbc s-fff5d0faa9628c188e97661f0e185fce s-ffff291613d413b4ac128df96a462294 UpdatesUpdates happen on Tuesday! Sorting a list that doesn't fit in the server's RAM is slow. Therefore I only do weekly updates (for now). Check the file date here to see how old it is. If an update fails, please post here. In between updates, I create daily updates: alladdresses.loyce.club/daily_updates/. These txt-files contain unique addresses (for that day) in order of appearance. I won't keep older snapshots. BandwidthThis server should have enough bandwidth to support all my blockchain data projects. If things get crazy, I may have to resort to using torrents. CreditsBlockchair Database Dumps has a staggering amount of data, easily accessible (at 10 kB/s (or recently 100 kB/s)) with daily updates. All data presented in this topic comes from Blockchair. No spam please. Self-moderated against spam. Discussion and questions are welcome. Q&ACan you please clarify, what is the type of these d- and s- addresses? This is how Blockchair.com shows OP_RETURN. From the main page the search field doesn't show them, but you can replace a Bitcoin address in the URL to find them: https://blockchair.com/bitcoin/address/d-d0d953f2e7043342540a1407243e49fe. Tips and tricksSome suggestions for Linux/VPS users: wget http://alladdresses.loyce.club/addresses_sorted.txt.gz -O - | gunzip > addresses_sorted.txt This doesn't save the .gz but extracts it while downloading. comm -12 <(sort list.txt) addresses_sorted.txt This outputs all Bitcoin addresses from "list.txt" that have ever been funded. comm -12 <(sort list.txt) addresses_sorted.txt > output.txt This does the same, but writes to output.txt instead of console. This search is fast, even with millions of addresses in list.txt, it's mainly limited by how fast your computer can read from disk.
Related topicsBitcoin block data available in CSV formatList of all Bitcoin addresses with a balanceList of all Bitcoin addresses ever used[~500 GB] Bitcoin block data: inputs, outputs and transactions[800 GB] Ethereum data
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
Some interesting (?) statistics (updated until blockchair_bitcoin_outputs_20200719.tsv.gz) Total address count: 1,484,589,7491... address count: 1,039,899,708 3... address count: 343,485,961 bc1q... address count: 55,006,904 ...-... (with a "dash") address count: 46,197,161 Unique address count: 693,180,8301... address count: 470,943,308 3... address count: 167,941,821 bc1q... address count: 39,137,878 ...-... (with a "dash") weird address count: 15,157,808 Addresses with most receiving transactionsThis is the Top 100, the number in front of the address shows how many transactions it has received: 4467608 1HckjUpRGcrrRAtFaaCAUaGjsPx9oYmLaZ 1900428 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4 1601193 1dice8EMZmqKvrGE4Qc9bUFf9PX3xaYDp 1527471 1FoWyxwPXuj4C6abqwhjDWdz6D4PZgYRjA 1204787 1LuckyR1fFHEsXYyx5QK4UFzv3PEAepPMK 1105406 1dice97ECuByXAvqXpaYzSaQuPVvrtmz6 1021575 3CD1QW6fjgTwKq3Pj97nty28WZAVkziNom 1009836 1G47mSr3oANXMafVrR8UC4pzV7FEAzo3r9 929737 3JXRVxhrk2o9f4w3cQchBLwUeegJBj6BEp 872274 1J37CY8hcdUXQ1KfBhMCsUVafa8XjDsdCn 859422 3422VtS7UtCvXYxoXMVp6eZupR252z85oC 841967 168o1kqNquEJeR9vosUB5fw4eAwcVAgh8P 832807 1P9RQEr2XeE3PEb44ZE35sfZRRW1JHU8qx 782811 1VayNert3x1KzbpzMGt2qdqrAThiRovi8 689574 37Tm3Qz8Zw2VJrheUUhArDAoq58S6YrS3g 676674 1DUb2YYbQA1jjaNYzVXLZ7ZioEhLXtbUru 663458 bc1qwqdg6squsna38e46795at95yu9atm8azzmyvckulcc7kytlcckxswvvzej 631610 17kb7c9ndg7ioSuzMWEHWECdEVUegNkcGc 595853 1dice9wcMu5hLF4g81u8nioL5mmSHTApw 580565 1Po1oWkD2LmodfkBYiAktwh76vkF93LKnh 573787 1LAnF8h3qMGx3TSwNUHVneBZUEpwE4gu3D 520889 1NDyJtNTjmwk5xPNhjgAMu4HDHigtobu1s 505956 13vHWR3iLsHeYwT42RnuKYNBoVPrKKZgRv 448252 1Fi9J5TeaWPHdU5cTJ4e9jr3V58SrWtUuT 437634 1dice7fUkz5h4z2wPc1wLMPWgB5mDwKDx 406471 1MPxhNkSzeTNTHSZAibMaS8HS1esmUL1ne 395663 1dice7W2AicHosf5EL3GFDUVga7TgtPFn 394249 1LuckyY9fRzcJre7aou7ZhWVXktxjjBb9S 389038 1D5bPm1YAdn9WvAAixht7PbACU3TtkqtJJ 376310 17A16QmavnUfCW11DAApiJxp7ARnxN5pGX 364311 3HNSiAq7wFDaPsYDcUxNSRMD78qVcYKicw 363898 3MfN5to5K5be2RupWE8rjJHQ6V9L8ypWeh 357641 3HRZjedwF2AJejNTtgznWnas4E6froNP5r 354691 1LuckyG4tMMZf64j6ea7JhCz7sDpk6vdcS 346986 366Dgw4pi3rnvu5zizVWZF6nijWxZWc6RA 341430 1dice6YgEVBf88erBFra9BHf6ZMoyvG88 326839 d-d0d953f2e7043342540a1407243e49fe 325099 38jMiiZs2C5n5MPkyc5pSA7wwW6H4p6hPa 293567 38ENmTr2AD1avJrmmi9iM7PfS6nZVmuMKf 289070 d-0e9deef32abfc454392d21725f9defef 285507 1N52wHoVR79PMDishab2XmRHsbekCdGquK 282321 3PUuiYu5cFMsagkffArrKZzQFtWdHttU3x 280691 367f4YWz1VCFaqBqwbTrzwi2b1h2U3w1AF 280107 1FoxBitjXcBeZUS4eDzPZ7b124q3N7QJK7 262539 d-73fd8c31c9fc1d084f44b301bb7adb6a 262317 1Fi57hAqyYYwaQVdA7a9qSKfiukBbt31G3 253795 1K2SXgApmo9uZoyahvsbSanpVWbzZWVVMF 252344 1dice5wwEZT2u6ESAdUGG6MHgCpbQqZiy 251282 3JnFBLxDCutY3bZEZsPTkHAaUA1bxmEMX2 250862 1diceDCd27Cc22HV3qPNZKwGnZ8QwhLTc 247797 352zT3Ts9piSDhZpBsDoZMvdtDmJioQNBo 246472 12JYmnfYU2ghzjwUAspzJsSnmJtK9bZPYR 243955 1x6YnuBVeeE65dQRZztRWgUPwyBjHCA5g 240428 3A4U175prUGEn3B1gUDkz32u8fnF9Nx3Ly 232303 357d4rAjQhDPaWhZrBAFY7aizVPkNSq2DH 230290 18rdKmjrg1EawxgiVT3ikLExj6GWS2MNCk 229128 3JjPf13Rd8g6WAyvg8yiPnrsdjJt1NP4FC 226837 1HWqsgnSd12Gv8SpoUMi1Cj8hp79BTSpW7 226259 1changemCPo732F6oYUyhbyGtFcNVjprq 224451 138o15eFWEEPv2ayKW2CZCgVvv5ZaZvomP 224217 d-752ed0099932a96fbc0a854a4d3a300f 219697 bc1qnsupj8eqya02nm8v6tmk93zslu2e2z8chlmcej 219174 s-e3b0c44298fc1c149afbf4c8996fb924 215870 1Kr6QSydW9bFQG1mXiPNNu6WpJGmUa9i1g 215691 37p9pUugydmoLpQyFLLqGAgjWmUFERa1Pq 215520 19iVyH1qUxgywY8LJSbpV4VavjZmyuEyxV 212059 1dice7EYzJag7SxkdKXLr8Jn14WUb3Cf1 209001 1F89hmmrtonJfAQNAqDmeDadcw7AsZcvXG 207701 1NDpZ2wyFekVezssSXv2tmQgmxcoHMUJ7u 207697 1Bd5wrFxHYRkk4UCFttcPNMYzqJnQKfXUE 207524 15fXdTyFL1p53qQ8NkrjBqPUbPWvWmZ3G9 207499 14719bzrTyMvEPcr7ouv9R8utncL9fKJyf 207424 18uvwkMJsg9cxFEd1QDFgQpoeXWmmSnqSs 207385 1J4yuJFqozxLWTvnExR4Xxe9W4B89kaukY 207376 1Bqm5MDo82m1FTxV3qYNUUEKnESPRhk9jd 207256 1HVpyjYEPwQhvRQ3dL8tGe9kiydti616sX 207228 17NKcZNXqAbxWsTwB1UJHjc9mQG3yjGALA 207218 1HjDauL2kth6KJUz5vX198Nvp1xN1hgYRb 207187 13h1DP2Boo9TAsenphroACxhNy7pGxDYXd 207138 1MSzmVTBaaSpKDARK3VGvP8v7aCtwZ9zbw 207053 1GoK6fv4tZKXFiWL9NuHiwcwsi8JAFiwGK 207006 13HFqPr9Ceh2aBvcjxNdUycHuFG7PReGH4 206834 1L4EThM6x3Rd2PjNbs1U136FpMq4Gmo3fJ 206826 14ChPPM8rPYJeHnw6kMVUDnNNKx1KnjYW4 206808 1AdN2my8NxvGcisPGYeQTAKdWJuUzNkQxG 206760 1DpsR91YmHUDTtiuH1pPCuG3RqAkmg6YKB 206707 1PeohaRGaTF8cSzDqP1yYfzDah66xiriEQ 206664 1JmcV7G3r8k7ev2EkS84MmsvxGyhiRGP84 206572 1HZHBnH2FbHNWieMxAh4xBPfgfuxW15UPt 206469 18czPiA9PcCs7rFTBZnhvNAWuh1pEZRpGJ 206346 12Cf6nCcRtKERh9cQm3Z29c9MWvQuFSxvT 206344 1MPerpQzTABa1K2eXQxsQTDSZtDQHWf6vk 206247 1dice1e6pdhLzzWQq7yMidf6j8eAg7pkY 206243 18XSLnBZ8ydMUkaifU6sQBMJzmm7JvDeUp 205690 bc1quq29mutxkgxmjfdr7ayj3zd9ad0ld5mrhh89l2 203334 3QQB6AWxaga6wTs6Xwq8FYppgrGinGu15f 201993 3M92sq9ssFaNbEwF47uteVKJsbw125juS7 199135 1AScRhqdXMrJyxNmjEapMZi1PLFsqmLquG 196271 18p9Ftp3m4435tdpZTvoBsm3yjUgkvTF2b 193271 33fDiKKhr2F2uRv2jJzdKT3ECuK3wzCq5d
|
|
|
|
MrFreeDragon
|
|
August 17, 2020, 05:56:46 PM |
|
Very interesting statistics, thank you! -snip- Addresses with most receiving transactions This is the Top 100, the number in front of the address shows how many transactions it has received: -snip- 326839 d-d0d953f2e7043342540a1407243e49fe ... 289070 d-0e9deef32abfc454392d21725f9defef ... 262539 d-73fd8c31c9fc1d084f44b301bb7adb6a ... 224217 d-752ed0099932a96fbc0a854a4d3a300f ... 219174 s-e3b0c44298fc1c149afbf4c8996fb924 -snip-
Can you please clarify, what is the type of these d- and s- addresses?
|
|
|
|
Casdinyard
|
|
August 18, 2020, 09:12:54 AM |
|
~
Can you also scrape all the Bitcoin Address used here in forum and the user that uses it? Yes, some users would have used the same wallet as they are just alts of someone (with a lot of investigation just to be proven correct). And I think it would help labeling the users and alt accounts throughout the entire forum, and would make it easier to detect which accounts are linked to each other and which are disobeying campaign rules and even forum rule (enrolling many accounts in a single bounty or sig campaign)
|
..Stake.com.. | | | ▄████████████████████████████████████▄ ██ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄ ██ ▄████▄ ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██ ██████ ██ ██████████ ██ ██ ██████████ ██ ▀██▀ ██ ██ ██ ██████ ██ ██ ██ ██ ██ ██ ██████ ██ █████ ███ ██████ ██ ████▄ ██ ██ █████ ███ ████ ████ █████ ███ ████████ ██ ████ ████ ██████████ ████ ████ ████▀ ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██ ██ ▀▀▀▀▀▀▀▀▀▀ ██ ▀█████████▀ ▄████████████▄ ▀█████████▀ ▄▄▄▄▄▄▄▄▄▄▄▄███ ██ ██ ███▄▄▄▄▄▄▄▄▄▄▄▄ ██████████████████████████████████████████ | | | | | | ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▀▄ █▀▀█▀▄▄ █ █▀█ █ ▐ ▐▌ █ ▄██▄ █ ▌ █ █ ▄██████▄ █ ▌ ▐▌ █ ██████████ █ ▐ █ █ ▐██████████▌ █ ▐ ▐▌ █ ▀▀██████▀▀ █ ▌ █ █ ▄▄▄██▄▄▄ █ ▌▐▌ █ █▐ █ █ █▐▐▌ █ █▐█ ▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█ | | | | | | ▄▄█████████▄▄ ▄██▀▀▀▀█████▀▀▀▀██▄ ▄█▀ ▐█▌ ▀█▄ ██ ▐█▌ ██ ████▄ ▄█████▄ ▄████ ████████▄███████████▄████████ ███▀ █████████████ ▀███ ██ ███████████ ██ ▀█▄ █████████ ▄█▀ ▀█▄ ▄██▀▀▀▀▀▀▀██▄ ▄▄▄█▀ ▀███████ ███████▀ ▀█████▄ ▄█████▀ ▀▀▀███▄▄▄███▀▀▀ | | | ..PLAY NOW.. |
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 18, 2020, 09:43:57 AM |
|
Can you also scrape all the Bitcoin Address used here in forum and the user that uses it? I actually can I found this regexp on Stackoverflow: egrep --regexp="^[13][a-km-zA-HJ-NP-Z1-9]{25,34}$" filename With some slight changes it stops matching parts of Eth-addresses: egrep -w --regexp="[13][a-km-zA-HJ-NP-Z1-9]{25,34}" * I could run this code on 53 million archived posts, but the main problem will be excluding quotes. That's annoying and slow to do, and if I don't exclude them, it will completely mess up the data. On the other hand, quotes may still contain information that was deleted by the user who posted it. Even without quotes, users still post Bitcoin addresses that aren't theirs, for instance when providing evidence on a scammer. I think it would help labeling the users and alt accounts throughout the entire forum, and would make it easier to detect which accounts are linked to each other A smart user would simply use different addresses. An even smarter user would use different wallets, so they don't create a blockchain trail when they make a payment. As a quick test, 51 out of 9999 posts contain at least one Bitcoin address (starting with 1 or 3, ignoring Bech32). For now I won't go continue this search. If I ever do, I'll move this discussion to Reputation.
|
|
|
|
Casdinyard
|
|
August 18, 2020, 11:01:17 AM |
|
Can you also scrape all the Bitcoin Address used here in forum and the user that uses it? I actually can I found this regexp on Stackoverflow: egrep --regexp="^[13][a-km-zA-HJ-NP-Z1-9]{25,34}$" filename With some slight changes it stops matching parts of Eth-addresses: egrep -w --regexp="[13][a-km-zA-HJ-NP-Z1-9]{25,34}" * I could run this code on 53 million archived posts, but the main problem will be excluding quotes. That's annoying and slow to do, and if I don't exclude them, it will completely mess up the data. On the other hand, quotes may still contain information that was deleted by the user who posted it. Even without quotes, users still post Bitcoin addresses that aren't theirs, for instance when providing evidence on a scammer. I think it would be possible if and only if you scraped the following boards: - Services
- Bounties
- Marketplace in general (both BTC and Alt)
- And Marketplaces of all local boards if applicable/available
With that, detection with evidences on a scam wouldn't be a problem to the matter. And yes, it would be hard especially if threads/posts were deleted. But it mustn't be a problem as long as a list can be made to simply be a reference of which user had used nor mentioned any addresses throughout his post history. I think it would help labeling the users and alt accounts throughout the entire forum, and would make it easier to detect which accounts are linked to each other A smart user would simply use different addresses. An even smarter user would use different wallets, so they don't create a blockchain trail when they make a payment. As a quick test, 51 out of 9999 posts contain at least one Bitcoin address (starting with 1 or 3, ignoring Bech32). For now I won't go continue this search. If I ever do, I'll move this discussion to Reputation. I'm looking forward to make it happen. Have I already mentioned my project on making an app (a BPIP ripoff) and such data would be helpful in it. I'm still on the planning stage to which should I go first and with many scraped data you've done, it would help me to make less scraping but rather make an API to just look up on your data.
|
..Stake.com.. | | | ▄████████████████████████████████████▄ ██ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄ ██ ▄████▄ ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██ ██████ ██ ██████████ ██ ██ ██████████ ██ ▀██▀ ██ ██ ██ ██████ ██ ██ ██ ██ ██ ██ ██████ ██ █████ ███ ██████ ██ ████▄ ██ ██ █████ ███ ████ ████ █████ ███ ████████ ██ ████ ████ ██████████ ████ ████ ████▀ ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██ ██ ▀▀▀▀▀▀▀▀▀▀ ██ ▀█████████▀ ▄████████████▄ ▀█████████▀ ▄▄▄▄▄▄▄▄▄▄▄▄███ ██ ██ ███▄▄▄▄▄▄▄▄▄▄▄▄ ██████████████████████████████████████████ | | | | | | ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▀▄ █▀▀█▀▄▄ █ █▀█ █ ▐ ▐▌ █ ▄██▄ █ ▌ █ █ ▄██████▄ █ ▌ ▐▌ █ ██████████ █ ▐ █ █ ▐██████████▌ █ ▐ ▐▌ █ ▀▀██████▀▀ █ ▌ █ █ ▄▄▄██▄▄▄ █ ▌▐▌ █ █▐ █ █ █▐▐▌ █ █▐█ ▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█ | | | | | | ▄▄█████████▄▄ ▄██▀▀▀▀█████▀▀▀▀██▄ ▄█▀ ▐█▌ ▀█▄ ██ ▐█▌ ██ ████▄ ▄█████▄ ▄████ ████████▄███████████▄████████ ███▀ █████████████ ▀███ ██ ███████████ ██ ▀█▄ █████████ ▄█▀ ▀█▄ ▄██▀▀▀▀▀▀▀██▄ ▄▄▄█▀ ▀███████ ███████▀ ▀█████▄ ▄█████▀ ▀▀▀███▄▄▄███▀▀▀ | | | ..PLAY NOW.. |
|
|
|
TryNinja
Legendary
Offline
Activity: 2954
Merit: 7378
|
|
August 18, 2020, 11:31:28 AM |
|
I could run this code on 53 million archived posts, but the main problem will be excluding quotes. That's annoying and slow to do, and if I don't exclude them, it will completely mess up the data. On the other hand, quotes may still contain information that was deleted by the user who posted it. Even without quotes, users still post Bitcoin addresses that aren't theirs, for instance when providing evidence on a scammer. This is planned for my post archive. I had done that but only with ETH addresses and the 15m posts you sent me + the new scraped one. I plan to scan all old posts + new ones for ETH and BTC addresses after everything is working fine (new bot + full database with the whole post archive).
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 18, 2020, 11:36:26 AM |
|
This is planned for my post archive. I had done that but only with ETH addresses and the 15m posts you sent me + the new scraped one. Great, saves me the trouble Can I request a CSV of all the results? That makes it so much easier to use all data than getting them per address through your site. Just something with (at least) "address,userID,msgID" would be great for further analysis. I'm still on the planning stage to which should I go first and with many scraped data you've done, it would help me to make less scraping but rather make an API to just look up on your data. I can get you a copy of all archived posts like I gave TryNinja if it helps. It beats scraping the forum again, although I didn't keep track of board names per topic.
|
|
|
|
TryNinja
Legendary
Offline
Activity: 2954
Merit: 7378
|
|
August 18, 2020, 11:41:10 AM |
|
Great, saves me the trouble Can I request a CSV of all the results? That makes it so much easier to use all data than getting them per address through your site. Just something with (at least) "address,userID,msgID" would be great for further analysis. Of course. Once in the database, it's pretty easy to export them to the format I want.
|
|
|
|
BTCW
Copper Member
Full Member
Offline
Activity: 193
Merit: 244
Click "+Merit" top-right corner
|
|
August 19, 2020, 11:03:18 PM Last edit: August 20, 2020, 08:12:25 AM by BTCW Merited by LoyceV (6), MrFreeDragon (2) |
|
This is a wonderful initiative! A comment: Sorting a very large list with little RAM is not necessarily a problem! Try: mkdir tmp cat unsorted.txt | sort -u -S 65% -T tmp > sorted.txt rm -r tmp -S will tell your machine to use at most 65% CPU; this is some sort of optimum, according to my experience -T puts temporary files in a directory (here named "tmp") and not in RAM; if you have an SSD, the speed isn't too shabby I have sorted huge lists (>80 GB) on budget laptops using these two arguments. Worth a shot! If you want better hosting, PM me.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 20, 2020, 09:08:20 AM |
|
cat unsorted.txt | sort -u -S 65% -T tmp > sorted.txt I'm already using " sort", which uses /tmp by default. I'll try " sort -u" though, it might need less temporary storage than " sort | uniq". The next update is scheduled for tomorrow, I'll see how it performs. -S will tell your machine to use at most 65% CPU I think you mean RAM, not CPU. This VM has only 256 MB, so I'll let " sort" figure it out on it's own. -T puts temporary files in a directory (here named "tmp") and not in RAM; if you have an SSD, the speed isn't too shabby That's default behaviour It doesn't have an SSD though, and I'm using " cputool" to keep server load low. I'm okay without daily updates on this, I wouldn't want users to download this large file on a daily basis anyway. I have sorted huge lists (>80 GB) on budget laptops using these two arguments. Worth a shot! If you want better hosting, PM me. Since last year, I'm using an AWS server donated by suchmoon for loyce.club. However, since AWS charges $0.15/GB, I'm not comfortable hosting very large files on suchmoon's server. When I tested sorting data on AWS, it started throtting disk IO after a while, which made it very slow. I've also tested a pay-by-the-hour-VPS, and obviously it was a lot faster. There's one thing on my wish list though: a method to show only unique addresses in order of appearance ( without sorting them). It can be done with awk '!a[$0]++', but this requires a lot of memory and doesn't use temporary files.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1722
Merit: 7240
In memory of o_e_l_e_o
|
|
August 20, 2020, 12:40:56 PM |
|
-S will tell your machine to use at most 65% CPU I think you mean RAM, not CPU. This VM has only 256 MB, so I'll let " sort" figure it out on it's own. That is correct, the argument to -S is the amount of memory for sort(1) to use for its main buffer ( manpage source). With a percentage it should calculate the amount of memory to reserve. But I think even a 256MB buffer is too small for the size of the dataset you're sorting, it will hit the disk too much. -T puts temporary files in a directory (here named "tmp") and not in RAM; if you have an SSD, the speed isn't too shabby That's default behaviour It doesn't have an SSD though, and I'm using " cputool" to keep server load low. I'm okay without daily updates on this, I wouldn't want users to download this large file on a daily basis anyway. I have sorted huge lists (>80 GB) on budget laptops using these two arguments. Worth a shot! If you want better hosting, PM me. Since last year, I'm using an AWS server donated by suchmoon for loyce.club. However, since AWS charges $0.15/GB, I'm not comfortable hosting very large files on suchmoon's server. When I tested sorting data on AWS, it started throtting disk IO after a while, which made it very slow. I've also tested a pay-by-the-hour-VPS, and obviously it was a lot faster. That's strange because all AWS servers have an SSD configured as the boot disk. If you are sorting in a VM, then all that sorting is done in a virtual hard disk, so not only are you moving memory into temporary host SSD space, it's being moved inside a virtual disk file inside said SSD and that puts extra strain on your hypervisor's emulated disk controller. So, it's emulating all the disk controller calls that read and write data from the disk, updates disk cache and its other jobs while sort(1) moves data between its memory buffer in RAM and the hard disk (which is actually a file on your host). And it's doing that for the entire 31GB of addresses, and the algorithm sort uses needs an O(n log(n)) space, which I calculate to be 310GB for your data. All this while running emulated disk writes and reads. On top of that there is the hardware-accelerated reads and writes that the host does for the VM to it's disk file. That explains the poor performance while sorting. You'll have better disk performance if you sort outside of a VM.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 20, 2020, 02:17:52 PM |
|
That's strange because all AWS servers have an SSD configured as the boot disk. I guess it wasn't clear that alladdresses.loyce.club:20319 doesn't run at AWS. It uses HDD. And it's doing that for the entire 31GB of addresses, and the algorithm sort uses needs an O(n log(n)) space, which I calculate to be 310GB for your data. It takes many hours while keeping server load low, but it really isn't a problem. If someone has enough RAM to experiment, I'd love to see the result of this (on the 31 GB file):
|
|
|
|
NotATether
Legendary
Offline
Activity: 1722
Merit: 7240
In memory of o_e_l_e_o
|
|
August 20, 2020, 09:50:01 PM |
|
@LoyceV how large is the uncompressed addresses.txt.gz? It is at least 200GB and counting and it's still extracting legacy addresses. I'm worried I may run out of disk space before it's all extracted. I have a 1TB quota. If you know how big is the uncompressed unique_addresses.txt.gz while you're at it that will be useful to know.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 21, 2020, 08:54:28 AM Last edit: November 28, 2020, 02:33:15 PM by LoyceV |
|
@LoyceV how large is the uncompressed addresses.txt.gz? It gets around 50% larger, Bitcoin addresses don't compress very well.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1722
Merit: 7240
In memory of o_e_l_e_o
|
|
August 21, 2020, 09:59:52 AM |
|
If someone has enough RAM to experiment, I'd love to see the result of this (on the 31 GB file): I suggest instead of the awk one-liner you look at gz-sort, it is a small linux program that sorts gzip-compressed files on disk while using a very small memory buffer, as low as 4 megabytes. You sort the file using gz-sort -u addresses.txt.gz addresses_sorted.txt.gz The -u switch removes duplicate lines from the sorted output, and you can increase the buffer size to give it a larger buffer for transporting stuff, but this isn't necessary. I used -S 1G to give it a 1 gigabyte buffer and it took around 7 hours to complete so not much shorter than the advertised completion time, 9 or 10 hours. So this program will run well in your VM, the RAM factor isn't important. You need to compile it yourself using make but it has minimal dependencies, only zlib and GNU headers. I used it to find the smallest address in the dump using zcat addresses_sorted.txt.gz | head -n 55405 | uniq This prints 1111111111111111111114oLvT2. This address was used 55405 times (!) Here are some the other smallest addresses: 1111111111111111111114oLvT2 111111111111111111112BEH2ro 111111111111111111112xT3273 1111111111111111111141MmnWZ 111111111111111111114ysyUW1 1111111111111111111184AqYnc 11111111111111111111BZbvjr 11111111111111111111CJawggc 11111111111111111111HV1eYjP 11111111111111111111HeBAGj 11111111111111111111QekFQw 11111111111111111111UpYBrS 11111111111111111111g4hiWR 11111111111111111111jGyPM8 11111111111111111111o9FmEC 11111111111111111111ufYVpS 111111111111111111121xzjPWX1 111111111111111111128gzo7iT 11111111111111111112AmVxQeF 11111111111111111112Fr3DURyz 11111111111111111112GvNtZ1K 11111111111111111112VUYD4wA 1111111111111111111313xyAwW 111111111111111111137vGPgFbT 11111111111111111113aT9ZSLG 111111111111111111168xDACCG 11111111111111111116B8w87yU
Maybe you can also make a list of addresses sorted by balance, now that you have an efficient way to deduplicate them.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 21, 2020, 11:29:22 AM Last edit: November 28, 2020, 02:37:12 PM by LoyceV |
|
If someone has enough RAM to experiment, I'd love to see the result of this (on the 31 GB file): I suggest instead of the awk one-liner you look at gz-sort, it is a small linux program that sorts gzip-compressed files on disk while using a very small memory buffer, as low as 4 megabytes. I checked, but it does what I'm doing already. The awk-command removes duplicate lines without sorting the lines. I'd like to do it, but I can't run it. This prints 1111111111111111111114oLvT2. This address was used 55405 times (!) I'd be interested to see which real address is the shortest. The 111111111-addresses are all burn addresses. I'm not entirely sure what determines address length, but from what I've seen, shorter addresses are much harder to find. I've been looking for short addresses created from mini-private-keys, and they were quite rare. To find a real short address, it needs to have sent funds too. Maybe you can also make a list of addresses sorted by balance See List of all Bitcoin addresses with a balance.
|
|
|
|
naufragus
Newbie
Offline
Activity: 29
Merit: 50
|
I actually can I found this regexp on Stackoverflow: egrep --regexp="^[13][a-km-zA-HJ-NP-Z1-9]{25,34}$" filename With some slight changes it stops matching parts of Eth-addresses: egrep -w --regexp="[13][a-km-zA-HJ-NP-Z1-9]{25,34}" * I have compiled these from various sources and use them to automatically set my blockchain explorer options based on user input, and also keep them at my .zshrc : #cryptocurrency greps
#btc1 and btc2 combined alias btcgrep="grep -Ee '\b[13][a-km-zA-HJ-NP-Z1-9]{25,34}\b' -e '\bbc(0([ac-hj-np-z02-9]{39}|[ac-hj-np-z02-9]{59})|1[ac-hj-np-z02-9]{8,87})\b'"
#legacy addresses only alias btcgrep1="grep -E '\b[13][a-km-zA-HJ-NP-Z1-9]{25,34}\b'" #http://mokagio.github.io/tech-journal/2014/11/21/regex-bitcoin.html
#bech32 v1 and v0 addresses alias btcgrep2="grep -E '\bbc(0([ac-hj-np-z02-9]{39}|[ac-hj-np-z02-9]{59})|1[ac-hj-np-z02-9]{8,87})\b'" #https://stackoverflow.com/questions/21683680/regex-to-match-bitcoin-addresses
#bech32 addresses only alias btcgrep3="grep -E '\bbc1[ac-hj-np-zAC-HJ-NP-Z02-9]{11,71}\b'"
#both legacy and bech32 alias btcgrep4="grep -E '\b([13][a-km-zA-HJ-NP-Z1-9]{25,34}|bc1[ac-hj-np-zAC-HJ-NP-Z02-9]{11,71})\b'" #http://mokagio.github.io/tech-journal/2014/11/21/regex-bitcoin.html
#private keys alias btcgrep5="grep -E '\b[5KL][1-9A-HJ-NP-Za-km-z]{50,51}\b'" #word boundary: '\b' #https://bitcoin.stackexchange.com/questions/56737/how-can-i-find-a-bitcoin-private-key-that-i-saved-in-a-text-file
#transaction hashes alias btcgrep6="grep -E '\b[a-fA-F0-9]{64}\b'" #https://stackoverflow.com/questions/46255833/bitcoin-block-and-transaction-regex #https://bitcoin.stackexchange.com/questions/70261/recognize-bitcoin-address-from-block-hash-and-transaction-hash
#block hashes alias btcgrep7="grep -E '\b[0]{8}[a-fA-F0-9]{56}\b'" #https://stackoverflow.com/questions/46255833/bitcoin-block-and-transaction-regex
#ethereum address hash #test for 'plausibility' alias ethgrep="grep -E '\b(0x)?[0-9a-fA-F]{40}\b'" #https://ethereum.stackexchange.com/questions/1374/how-can-i-check-if-an-ethereum-address-is-valid
#ethereum transaction hash alias ethgrep2="grep -E '\b(0x)?([A-Fa-f0-9]{64})\b'" #parentheses are not necessary #https://ethereum.stackexchange.com/questions/34285/what-is-the-regex-to-validate-an-ethereum-transaction-hash/34286 Flag -w is 'word bondary' and can also be set within the regex with '\b' at the ends. Very good work on compiling those addresses, mate!
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17344
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 24, 2020, 11:39:02 AM Last edit: August 24, 2020, 05:16:35 PM by LoyceV |
|
If someone has enough RAM to experiment, I'd love to see the result of this (on the 31 GB file): This looks very promising: cat -n input.txt | sort -uk2 | sort -nk1 | cut -f2- > output.txt I'll be testing it soon. Some results: The awk-thing uses just over 1 GB memory for 10 million addresses. So for 1.5 billion addresses, a 256 GB server should be enough. At AWS, that would cost a few dollars per hour. I've tested with the first 10 million lines, and can confirm both give the same result: head -n 10000000 addresses.txt | awk '!a[$0]++' | md5sum head -n 10000000 addresses.txt | nl | sort -uk2 | sort -nk1 | cut -f2 | md5sum As expected, awk is faster.
|
|
|
|
seoincorporation
Legendary
Offline
Activity: 3276
Merit: 3066
|
|
August 24, 2020, 11:31:39 PM |
|
This is an awesome apport for the community, some weeks ago i see a user asking for a list like this to make a bruteforce... Some users use their addy as password, that's why a list like this is a great tool, thanks again to LoyceV for making it fo us.
|
|
|
|
|