Show Posts
|
Pages: [1]
|
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
I found a bit of time to write this. Testing it now.. The first results of yesterday's testing look promising. I should go back and double check if the outputs are correct, but they seem to be. I created a VM with 2 cores/2 threads (so no hyperthreading or whatever AMD's equivalent is called) of my Ryzen 3600 and 512mb of RAM (just because Ubuntu Server, for which I had an ISO handy, wouldn't boot with 256MB). To make the numbers mean anything I first benchmarked your current setup: time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz Loycev: real 51m26.730s JustHereReading: real 40m13.684s My cores were pushed to ~50% so unsurprisingly pigz yielded an improvement in my setup. However, I was a little surprised by the amount of improvement. time sort -mu <(pigz -dc addresses_sorted.txt.gz) <(sort -u daily-file-long.txt) | pigz > output.txt.gz real 14m29.865s
And now... for the main event: time gunzip -c addresses_sorted.txt.gz | python3 add_new_adresses_sorted.py | gzip > output.txt.gz real 39m42.574s The script ran slightly faster than your current setup. In that time it sorted (and compressed) the first list in addition to creating a text file that can be appended to the second list. Unfortunately I overwrote the results of your current setup, so I didn't verify the output (yet).
|
|
|
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
I found a bit of time to write this. Testing it now.. Just to check with you, I was sorta wrong here: Given two sorted lists: n = 1 5 10 11 12 13 14 15 16 19 20 k = 3 6 18
We can read n from disk line by line and compare it to the current position in k.
1 < 3, write 1 to new file. 5 > 3, write 3 to file. 5 < 6, write 5 to file. 10 > 6, write 6 to file. 10 < 18, write 10 to file. 11 < 18, write 11 to file. .... 16 < 18, write 16 to file. 19 > 18, write 18 to file. 19 & nothing left in k, write 19 to file. 20 & nothing left in k, write 20 to file.
That's n + k instead of n * k, right?
Since we're sorting as strings it would actually be: n = 1 10 11 12 13 14 15 16 19 20 5 k = 18 3 6 The whole list would then become: all = 1 10 11 12 13 14 15 16 18 19 20 3 5 6 Correct?
|
|
|
Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them. It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results: Old process: time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz real 90m2.883s
Faster new process: time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz real 51m26.730s The output is the same. Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes. Most CPU time is spent compressing the new gzip file. That's a significant improvement. You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself. I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB). That's going over my head, and probably far too complicated for something this simple. Honestly, the bloomfilter was a silly suggestion. It will probably not be a big improvement (if any) compared to your current code. Thanks! Hoping to do some experimenting soon (if I have the time...)
|
|
|
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of " sort -mu": -m, --merge merge already sorted files; do not sort This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort. Update: I'm testing this now. Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them. I do see that for the other list it might be a bit more difficult... It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM. I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB). Some quick math: 1GB: 1 in 13 false positives 2GB: 1 in ~170 3GB: 1 in ~2,200 4GB: 1 in ~28,000 5GB: 1 in ~365,000 6GB: 1 in ~4,700,000 7GB: 1 in ~61,000,000 8GB: 1 in ~800,000,000Of course this would require some hashing overhead, but this should greatly outweigh looping over your 1.5 billion addresses. Unfortunately you'd still have to double check any positives, because they might be false. I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed: The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.
By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files? I should be able to throw some code together that makes this work, fairly quickly.
|
|
|
We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory. The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system. (...) I might be utterly mistaking, but hear me out: Given two sorted lists: n = 1 5 10 11 12 13 14 15 16 19 20 k = 3 6 18 We can read n from disk line by line and compare it to the current position in k. 1 < 3, write 1 to new file. 5 > 3, write 3 to file. 5 < 6, write 5 to file. 10 > 6, write 6 to file. 10 < 18, write 10 to file. 11 < 18, write 11 to file. .... 16 < 18, write 16 to file. 19 > 18, write 18 to file. 19 & nothing left in k, write 19 to file. 20 & nothing left in k, write 20 to file. That's n + k instead of n * k, right?
|
|
|
First of all, great project!
(...) The longer the list, the longer it will take to sort one additional line. At some point a database might beat raw text sorting, but for now I'm good with this  Using a database will not solve this problem. There are some things a DB can do to make sorting go from O^ 2 to O^ 2/n, but this is still exponential growth. You make the argument that your input size is sufficiently small such that having exponential complexity is okay, and you may have a point. Going with these two versions: (...) Since I got no response to my question above, I'll go with 2 versions: - All addresses ever used, without duplicates, in order of first appearance.
- All addresses ever used, without duplicates, sorted.
The first file feels nostalgic, the second file will be very convenient to match addresses with a list of your own. I don't see how sorting would be exponential for any of these lists.. All addresses ever used, without duplicates, sorted.- We already have a list with all the addresses ever used sorted by address (length n).
- We have a list of (potentially) new addresses (length k).
- We sort the list of new items in O(k log k).
- We check for duplicates in the new addresses in O(k).
- We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
Resulting in O( n + k log k + 2k). In this particular case one might even argue that n > k log k + 2k, therefore O(2 n) = O( n) However, it's late here and I don't like to argue. You only need enough memory to keep the new addresses in memory and enough disk space to keep both the new and old version on disk at the same time. The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way. I'll see if I can whip some code together. File hostingHave you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server. You might even release it in a RSS feed so that some contributors could automatically add it to their torrent clients and start downloading with e.g. max 1 Mb/s and uploading with >1Mb/s, this will quickly allow the files to spread over the peers and further move downloads away from your server.
|
|
|
|