Bitcoin Forum
May 02, 2024, 02:14:08 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: [1]
1  Bitcoin / Project Development / Re: List of all Bitcoin addresses ever used - currently available on temp location on: January 21, 2021, 12:41:01 PM
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.

I found a bit of time to write this. Testing it now..

The first results of yesterday's testing look promising. I should go back and double check if the outputs are correct, but they seem to be.

I created a VM with 2 cores/2 threads (so no hyperthreading or whatever AMD's equivalent is called) of my Ryzen 3600 and 512mb of RAM (just because Ubuntu Server, for which I had an ISO handy, wouldn't boot with 256MB). To make the numbers mean anything I first benchmarked your current setup:

Code:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
Loycev: real    51m26.730s
JustHereReading: real 40m13.684s

My cores were pushed to ~50% so unsurprisingly pigz yielded an improvement in my setup. However, I was a little surprised by the amount of improvement.
Code:
time sort -mu <(pigz -dc addresses_sorted.txt.gz) <(sort -u daily-file-long.txt) | pigz > output.txt.gz
real    14m29.865s

And now... for the main event:
Code:
time gunzip -c addresses_sorted.txt.gz | python3 add_new_adresses_sorted.py | gzip > output.txt.gz
real    39m42.574s
The script ran slightly faster than your current setup. In that time it sorted (and compressed) the first list in addition to creating a text file that can be appended to the second list. Unfortunately I overwrote the results of your current setup, so I didn't verify the output (yet).
2  Bitcoin / Project Development / Re: List of all Bitcoin addresses ever used - currently available on temp location on: January 20, 2021, 06:20:10 PM
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.

I found a bit of time to write this. Testing it now..

Just to check with you, I was sorta wrong here:
Given two sorted lists:
n = 1 5 10 11 12 13 14 15 16 19 20
k = 3 6 18

We can read n from disk line by line and compare it to the current position in k.

1 < 3, write 1 to new file.
5 > 3, write 3 to file.
5 < 6, write 5 to file.
10 > 6, write 6 to file.
10 < 18, write 10 to file.
11 < 18, write 11 to file.
....
16 < 18, write 16 to file.
19 > 18, write 18 to file.
19 & nothing left in k, write 19 to file.
20 & nothing left in k, write 20 to file.

That's n + k instead of n * k, right?

Since we're sorting as strings it would actually be:
n = 1 10 11 12 13 14 15 16 19 20 5
k = 18 3 6

The whole list would then become:
all = 1 10 11 12 13 14 15 16 18 19 20 3 5 6

Correct?
3  Bitcoin / Project Development / Re: List of all Bitcoin addresses ever used - currently available on temp location on: January 16, 2021, 09:20:19 AM
Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.
That's a significant improvement. You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.

Quote
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
That's going over my head, and probably far too complicated for something this simple.
Honestly, the bloomfilter was a silly suggestion. It will probably not be a big improvement (if any) compared to your current code.

I use Blockchair's daily outputs to update this, not the daily list of addresses.
See: http://blockdata.loyce.club/alladdresses/daily_updates/ for old daily files.
Thanks! Hoping to do some experimenting soon (if I have the time...)
4  Bitcoin / Project Development / Re: List of all Bitcoin addresses ever used - currently available on temp location on: January 12, 2021, 09:25:32 PM
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
       -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.

I do see that for the other list it might be a bit more difficult...

It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.

I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
Some quick math:
1GB: 1 in 13 false positives
2GB: 1 in ~170
3GB: 1 in ~2,200
4GB: 1 in ~28,000
5GB: 1 in ~365,000
6GB: 1 in ~4,700,000
7GB: 1 in ~61,000,000
8GB: 1 in ~800,000,000


Of course this would require some hashing overhead, but this should greatly outweigh looping over your 1.5 billion addresses. Unfortunately you'd still have to double check any positives, because they might be false.

I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.

By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files? I should be able to throw some code together that makes this work, fairly quickly.
5  Bitcoin / Project Development / Re: List of all Bitcoin addresses ever used - currently available on temp location on: January 12, 2021, 01:14:39 PM
We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

(...)

I might be utterly mistaking, but hear me out:

Given two sorted lists:
n = 1 5 10 11 12 13 14 15 16 19 20
k = 3 6 18

We can read n from disk line by line and compare it to the current position in k.

1 < 3, write 1 to new file.
5 > 3, write 3 to file.
5 < 6, write 5 to file.
10 > 6, write 6 to file.
10 < 18, write 10 to file.
11 < 18, write 11 to file.
....
16 < 18, write 16 to file.
19 > 18, write 18 to file.
19 & nothing left in k, write 19 to file.
20 & nothing left in k, write 20 to file.

That's n + k instead of n * k, right?
6  Bitcoin / Project Development / Re: List of all Bitcoin addresses ever used - currently available on temp location on: January 12, 2021, 12:20:12 PM
First of all, great project!



(...)
Quote
The longer the list, the longer it will take to sort one additional line.
At some point a database might beat raw text sorting, but for now I'm good with this Smiley
Using a database will not solve this problem. There are some things a DB can do to make sorting go from O^2 to O^2/n, but this is still exponential growth.

You make the argument that your input size is sufficiently small such that having exponential complexity is okay, and you may have a point.
Going with these two versions:
(...)
Since I got no response to my question above, I'll go with 2 versions:
  • All addresses ever used, without duplicates, in order of first appearance.
  • All addresses ever used, without duplicates, sorted.
The first file feels nostalgic, the second file will be very convenient to match addresses with a list of your own.

I don't see how sorting would be exponential for any of these lists..

All addresses ever used, without duplicates, sorted.
  • We already have a list with all the addresses ever used sorted by address (length n).
  • We have a list of (potentially) new addresses (length k).
  • We sort the list of new items in O(k log k).
  • We check for duplicates in the new addresses in O(k).
  • We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.

Resulting in O(n + k log k + 2k). In this particular case one might even argue that n > k log k + 2k, therefore O(2n) = O(n) However, it's late here and I don't like to argue.

You only need enough memory to keep the new addresses in memory and enough disk space to keep both the new and old version on disk at the same time.

The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.

I'll see if I can whip some code together.


File hosting
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.

You might even release it in a RSS feed so that some contributors could automatically add it to their torrent clients and start downloading with e.g. max 1 Mb/s and uploading with >1Mb/s, this will quickly allow the files to spread over the peers and further move downloads away from your server.


Pages: [1]
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!