Bitcoin Forum
December 12, 2024, 11:33:40 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3] 4 5 6 7 8 »  All
  Print  
Author Topic: List of all Bitcoin addresses ever used - weekly updates work again!  (Read 4137 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic. (3 posts by 1+ user deleted.)
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 07, 2021, 04:08:03 PM
Last edit: January 10, 2021, 10:43:22 AM by LoyceV
 #41

Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
brainless
Member
**
Offline Offline

Activity: 350
Merit: 35


View Profile
January 12, 2021, 08:48:26 AM
 #42

Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.

daily updates also need to be post there, if possible,
Thankx

13sXkWqtivcMtNGQpskD78iqsgVy9hcHLF
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 12, 2021, 09:22:49 AM
Last edit: January 12, 2021, 12:24:26 PM by LoyceV
 #43

daily updates also need to be post there, if possible
This VPS is currently downloading other data from Blockchair, which only allows once connection at a time. I expect this to take another month (at 100 kB/s), after that I can enable daily updates (txt-files with unique addresses for that day) again.

I haven't decided yet how and where to do regular updates to the 20 GB files (this is quite resource intensive).

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
JustHereReading
Newbie
*
Offline Offline

Activity: 6
Merit: 15


View Profile
January 12, 2021, 12:20:12 PM
Merited by LoyceV (4), PrimeNumber7 (1)
 #44

First of all, great project!



(...)
Quote
The longer the list, the longer it will take to sort one additional line.
At some point a database might beat raw text sorting, but for now I'm good with this Smiley
Using a database will not solve this problem. There are some things a DB can do to make sorting go from O^2 to O^2/n, but this is still exponential growth.

You make the argument that your input size is sufficiently small such that having exponential complexity is okay, and you may have a point.
Going with these two versions:
(...)
Since I got no response to my question above, I'll go with 2 versions:
  • All addresses ever used, without duplicates, in order of first appearance.
  • All addresses ever used, without duplicates, sorted.
The first file feels nostalgic, the second file will be very convenient to match addresses with a list of your own.

I don't see how sorting would be exponential for any of these lists..

All addresses ever used, without duplicates, sorted.
  • We already have a list with all the addresses ever used sorted by address (length n).
  • We have a list of (potentially) new addresses (length k).
  • We sort the list of new items in O(k log k).
  • We check for duplicates in the new addresses in O(k).
  • We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.

Resulting in O(n + k log k + 2k). In this particular case one might even argue that n > k log k + 2k, therefore O(2n) = O(n) However, it's late here and I don't like to argue.

You only need enough memory to keep the new addresses in memory and enough disk space to keep both the new and old version on disk at the same time.

The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.

I'll see if I can whip some code together.


File hosting
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.

You might even release it in a RSS feed so that some contributors could automatically add it to their torrent clients and start downloading with e.g. max 1 Mb/s and uploading with >1Mb/s, this will quickly allow the files to spread over the peers and further move downloads away from your server.


LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 12, 2021, 12:44:18 PM
 #45

We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

I may be able to improve on the sorted list by merging lists, and I may be able to improve on everything by keeping big temp files instead of only compressed files (but as always I need some time to do this).

Quote
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.
No, until now download bandwidth isn't a problem. Only a few people have been crazy enough to download these files. If this ever goes viral it would be a great solution though.

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
JustHereReading
Newbie
*
Offline Offline

Activity: 6
Merit: 15


View Profile
January 12, 2021, 01:14:39 PM
Merited by Quickseller (2)
 #46

We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

(...)

I might be utterly mistaking, but hear me out:

Given two sorted lists:
n = 1 5 10 11 12 13 14 15 16 19 20
k = 3 6 18

We can read n from disk line by line and compare it to the current position in k.

1 < 3, write 1 to new file.
5 > 3, write 3 to file.
5 < 6, write 5 to file.
10 > 6, write 6 to file.
10 < 18, write 10 to file.
11 < 18, write 11 to file.
....
16 < 18, write 16 to file.
19 > 18, write 18 to file.
19 & nothing left in k, write 19 to file.
20 & nothing left in k, write 20 to file.

That's n + k instead of n * k, right?
NotATether
Legendary
*
Offline Offline

Activity: 1820
Merit: 7476


Top Crypto Casino


View Profile WWW
January 12, 2021, 04:47:56 PM
 #47

Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.

I don't remember if I offered you this before but I can host this data for you if it's not too big (I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.

███████████████████████
████▐██▄█████████████████
████▐██████▄▄▄███████████
████▐████▄█████▄▄████████
████▐█████▀▀▀▀▀███▄██████
████▐███▀████████████████
████▐█████████▄█████▌████
████▐██▌█████▀██████▌████
████▐██████████▀████▌████
█████▀███▄█████▄███▀█████
███████▀█████████▀███████
██████████▀███▀██████████

███████████████████████
.
BC.GAME
▄▄▀▀▀▀▀▀▀▄▄
▄▀▀░▄██▀░▀██▄░▀▀▄
▄▀░▐▀▄░▀░░▀░░▀░▄▀▌░▀▄
▄▀▄█▐░▀▄▀▀▀▀▀▄▀░▌█▄▀▄
▄▀░▀░░█░▄███████▄░█░░▀░▀▄
█░█░▀░█████████████░▀░█░█
█░██░▀█▀▀█▄▄█▀▀█▀░██░█
█░█▀██░█▀▀██▀▀█░██▀█░█
▀▄▀██░░░▀▀▄▌▐▄▀▀░░░██▀▄▀
▀▄▀██░░▄░▀▄█▄▀░▄░░██▀▄▀
▀▄░▀█░▄▄▄░▀░▄▄▄░█▀░▄▀
▀▄▄▀▀███▄███▀▀▄▄▀
██████▄▄▄▄▄▄▄██████
.
..CASINO....SPORTS....RACING..


▄▄████▄▄
▄███▀▀███▄
██████████
▀███▄░▄██▀
▄▄████▄▄░▀█▀▄██▀▄▄████▄▄
▄███▀▀▀████▄▄██▀▄███▀▀███▄
███████▄▄▀▀████▄▄▀▀███████
▀███▄▄███▀░░░▀▀████▄▄▄███▀
▀▀████▀▀████████▀▀████▀▀
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 12, 2021, 06:34:56 PM
Last edit: January 12, 2021, 07:16:20 PM by LoyceV
 #48

We can read n from disk line by line and compare it to the current position in k.
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
      -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

However, the bigger problem remains: updating 1.5 billion unique addresses in chronological order. Those lists are unsorted, so for example:
Existing long list with 12 years of data:
Code:
5
3
7
2
9
New daily list:
Code:
4
3
The end result should be:
Code:
5
3
7
2
9
4
It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.
I ended up with sort -uk2 | sort -nk1 | cut -f2.
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.



I don't remember if I offered you this before but I can host this data for you if it's not too big
You did (more or less):
If network bandwidth is a problem I'm able to host this on my hardware if you like.
So I guess you missed my reply too:
I'm more in need for more disk space for sorting this data, but I haven't decided yet where to host it.

(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer Smiley Currently, disk space isn't the problem. I am looking for a webhost that allows me to abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
JustHereReading
Newbie
*
Offline Offline

Activity: 6
Merit: 15


View Profile
January 12, 2021, 09:25:32 PM
Merited by LoyceV (4)
 #49

Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
      -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.

I do see that for the other list it might be a bit more difficult...

It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.

I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
Some quick math:
1GB: 1 in 13 false positives
2GB: 1 in ~170
3GB: 1 in ~2,200
4GB: 1 in ~28,000
5GB: 1 in ~365,000
6GB: 1 in ~4,700,000
7GB: 1 in ~61,000,000
8GB: 1 in ~800,000,000


Of course this would require some hashing overhead, but this should greatly outweigh looping over your 1.5 billion addresses. Unfortunately you'd still have to double check any positives, because they might be false.

I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.

By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files? I should be able to throw some code together that makes this work, fairly quickly.
brainless
Member
**
Offline Offline

Activity: 350
Merit: 35


View Profile
January 13, 2021, 05:55:06 PM
 #50

you discussion about sort and remove duplicate , and make list available raw and sorted
my system is i3-6100 processor with 16gb ddr4 ram, and i am managing there all sort and remove duplicate from raw 19gb file within 1 hour, on work daily data is just few min job
let me explain
simple do
sort  raw.txt >> sorted.txt
split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
next is remove duplicate by perl for fast and aprox can load 3gb file, but we make it more fast by selecting 50m lines
perl -ne'print unless $_{$_}++' xaa > part1.txt
2nd file
perl -ne'print unless $_{$_}++' xab > part2.txt
last you have compelete all files within 1 hour

now combine all file
cat part*.txt >> full-sorted.txt
or like sorted ( selected all part1.txt... part10.txt )
cat part1.txt part2.txt part3.txt >> full-sorted.txt

stage 2
2nd group you can continuous onword 21 dec 2020, all daily update files, combine, sort and remove duplicate
you name it new-group.txt

command is
join new-group.txt full-sorted.txt >> filter1.txt

here filter.txt is common on 2 files(new-group.txt and full-sorted.txt)
now need remove filter.txt from newgroup.txt for get pure only new addresses

awk 'FNR==NR{ a[$1]; next } !($1 in a)' filter.txt new-group.txt >> pure-new-addresses.txt

stage 3
if you still need all in one file

combine pure-new-address.txt and full-sorted.txt
cat pure-new-address.txt full-sorted.txt >> pre-full-sorted.txt
sort pre-full-sorted.txt >> new-full-addresses


its recomemnded leave 1 file as last created on 21 dec 2020
start 2nd file onword,  perform only stage 2, you will have only new addresses which is no apear in first 19gb file

hope i try to explain all points, and will help you and community , any further info, ask me , love to provide info what ever i have

13sXkWqtivcMtNGQpskD78iqsgVy9hcHLF
NotATether
Legendary
*
Offline Offline

Activity: 1820
Merit: 7476


Top Crypto Casino


View Profile WWW
January 13, 2021, 06:30:35 PM
 #51

(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer Smiley Currently, disk space isn't the problem. I am looking for a webhost that allows me to abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.

I have a server on RAID0 with 882MB/s read and 191MB/s write, so copying this stuff to a different place on the same disk will take about 40 seconds or so for a 30GB dataset.

AWS VPS's run on shared hardware so that's probably why you're getting throttled. There are dedicated servers on AWS you can get where you're in total control over the hardware and they don't throttle you and stuff. But I'm glad the RamNode account worked out for you. Let me know if you need help writing automation stuff.

███████████████████████
████▐██▄█████████████████
████▐██████▄▄▄███████████
████▐████▄█████▄▄████████
████▐█████▀▀▀▀▀███▄██████
████▐███▀████████████████
████▐█████████▄█████▌████
████▐██▌█████▀██████▌████
████▐██████████▀████▌████
█████▀███▄█████▄███▀█████
███████▀█████████▀███████
██████████▀███▀██████████

███████████████████████
.
BC.GAME
▄▄▀▀▀▀▀▀▀▄▄
▄▀▀░▄██▀░▀██▄░▀▀▄
▄▀░▐▀▄░▀░░▀░░▀░▄▀▌░▀▄
▄▀▄█▐░▀▄▀▀▀▀▀▄▀░▌█▄▀▄
▄▀░▀░░█░▄███████▄░█░░▀░▀▄
█░█░▀░█████████████░▀░█░█
█░██░▀█▀▀█▄▄█▀▀█▀░██░█
█░█▀██░█▀▀██▀▀█░██▀█░█
▀▄▀██░░░▀▀▄▌▐▄▀▀░░░██▀▄▀
▀▄▀██░░▄░▀▄█▄▀░▄░░██▀▄▀
▀▄░▀█░▄▄▄░▀░▄▄▄░█▀░▄▀
▀▄▄▀▀███▄███▀▀▄▄▀
██████▄▄▄▄▄▄▄██████
.
..CASINO....SPORTS....RACING..


▄▄████▄▄
▄███▀▀███▄
██████████
▀███▄░▄██▀
▄▄████▄▄░▀█▀▄██▀▄▄████▄▄
▄███▀▀▀████▄▄██▀▄███▀▀███▄
███████▄▄▀▀████▄▄▀▀███████
▀███▄▄███▀░░░▀▀████▄▄▄███▀
▀▀████▀▀████████▀▀████▀▀
Vod
Legendary
*
Offline Offline

Activity: 3920
Merit: 3167


Licking my boob since 1970


View Profile WWW
January 13, 2021, 07:18:26 PM
 #52

It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM... it's not that expensive to use it a couple hours per month only.

You are on AWS, right?   Why not have your sponsor upgrade your instance to a higher class for a few hours?  That's the beauty of on-demand processing. Smiley

EC2?   Those are dedicated resources, not shared.

I post for interest - not signature spam.
https://elon.report - new BPI Reports!
https://vod.fan - profitable/free image sharing - coming early 2025
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 14, 2021, 06:18:06 PM
 #53

This post is the result of some trial&error. I also noticed blockdata.loyce.club/ gets terribly slow once in a while, which made it useless to use RamNode for this data.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.

Quote
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
That's going over my head, and probably far too complicated for something this simple.

Quote
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.
I'll try:
Code:
Old code:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 | gzip > newchronological.txt.gz
real    194m24.456s

New:
time comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) > newaddresses.txt
real    8m4.045s
time cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 > all_daily_addresses_chronological_order.txt
real    1m14.593s
cat all_daily_addresses_chronological_order.txt newaddresses.txt | nl -nln | sort -k2 -S80% > test.txt
real    0m36.948s

I discovered uniq -f1 on stackexchange:
Code:
cat test.txt | uniq -df1 | sort -nk1 -S80% | cut -f2 > test2.txt
real    0m7.721s

Code:
Combined:
time cat <(cat <(cat ../daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c ../addresses_sorted.txt.gz) <(sort -u ../daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2 > newaddresses_chronological.txt
real    9m45.163s
Even more combined:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) <(cat <(cat <(cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2) > new.alladdresses_chronological.txt
real    19m34.926s
This can significantly improve performance, especially if I keep uncompressed files for faster access. But it's wrong, I have 3 different output files from 3 different methods.

Quote
By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files?
I use Blockchair's daily outputs to update this, not the daily list of addresses.
See: http://blockdata.loyce.club/alladdresses/daily_updates/ for old daily files.



split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
I don't see any benefit in splitting files for processing.



I have a server on RAID0 with 882MB/s read and 191MB/s write, so copying this stuff to a different place on the same disk will take about 40 seconds or so for a 30GB dataset.
Dedicated? Cheesy That's the dream Shocked But even then, sorting data means reading and writing the same data several times.

Quote
AWS VPS's run on shared hardware so that's probably why you're getting throttled. There are dedicated servers on AWS you can get where you're in total control over the hardware and they don't throttle you and stuff.
AWS dedicated is totally out of my price range (for this side project that already got out of hand). I wasn't planning on spending a lot of money on this long-term, but if I can find a very affordable solution I can just keep adding servers to my collection.
So far it's looking good on speed improvements, especially getting rid of the sort-command executed on disk helps a lot.



You are on AWS, right?   Why not have your sponsor upgrade your instance to a higher class for a few hours?  That's the beauty of on-demand processing. Smiley
I don't want to be demanding, and AWS charges (my sponsor) $0.09 per GB. That's okay for HTML, but not for large files. My List of all Bitcoin addresses with a balance alone transferred 450 GB since the start of this year. That would be $1000 per year on AWS, while it costs only a fraction elsewhere. I love how reliable AWS is, it just always works, but that's not necessary for my blockdata.

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
Vod
Legendary
*
Offline Offline

Activity: 3920
Merit: 3167


Licking my boob since 1970


View Profile WWW
January 14, 2021, 06:42:40 PM
 #54

I don't want to be demanding, and AWS charges (my sponsor) $0.09 per GB. That's okay for HTML, but not for large files. My List of all Bitcoin addresses with a balance alone transferred 450 GB since the start of this year. That would be $1000 per year on AWS, while it costs only a fraction elsewhere. I love how reliable AWS is, it just always works, but that's not necessary for my blockdata.

Full price storage for AWS:
First 50 TB / Month   $0.023 per GB
Next 450 TB / Month   $0.022 per GB
Over 500 TB / Month   $0.021 per GB
You can then reduce these costs up to 72% if you commit to a certain spend.

Data transfer out of AWS:
Up to 1 GB / Month      $0.00 per GB
Next 9.999 TB / Month   $0.09 per GB (About $500 a year)

Consider your data is all alone on your VPS too.  If you were on AWS, you could transfer your data to other AWS clients (like me) $0.01 per GB.  Smiley

Also, you can get a server with 256GB of RAM and 32 processors for $1.50 per hour.  You attach your storage to the VPS, run your queries for however long it takes, then terminate the instance and move your storage back to your existing lower powered system.

Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested.  I'm in training to be certified as a cloud consultant. 


I post for interest - not signature spam.
https://elon.report - new BPI Reports!
https://vod.fan - profitable/free image sharing - coming early 2025
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 15, 2021, 08:48:57 AM
 #55

Next 9.999 TB / Month   $0.09 per GB (About $500 a year)
Today's counter is at 50 GB (in 15 days), that brings me at $1000 per year if I'd have to pay $0.09 per GB. At current rate, I'll hit the data limit by the end of this month for this VPS, and until now traffic is going up. My current limit is 1 TB/month, and for $0.00067 per GB I can double that.

Quote
Also, you can get a server with 256GB of RAM and 32 processors for $1.50 per hour.  You attach your storage to the VPS, run your queries for however long it takes, then terminate the instance and move your storage back to your existing lower powered system.

Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested.  I'm in training to be certified as a cloud consultant.
The offer is good, but AWS wants my creditcard, which I don't want to link to this. I only use hosting that accepts crypto.

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
Vod
Legendary
*
Offline Offline

Activity: 3920
Merit: 3167


Licking my boob since 1970


View Profile WWW
January 15, 2021, 06:09:41 PM
 #56

The offer is good, but AWS wants my creditcard, which I don't want to link to this. I only use hosting that accepts crypto.

Hello!  (waving)   You can have full control of an account minus billing.  I can pay the bill and accept crypto.

I post for interest - not signature spam.
https://elon.report - new BPI Reports!
https://vod.fan - profitable/free image sharing - coming early 2025
JustHereReading
Newbie
*
Offline Offline

Activity: 6
Merit: 15


View Profile
January 16, 2021, 09:20:19 AM
 #57

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.
That's a significant improvement. You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.

Quote
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
That's going over my head, and probably far too complicated for something this simple.
Honestly, the bloomfilter was a silly suggestion. It will probably not be a big improvement (if any) compared to your current code.

I use Blockchair's daily outputs to update this, not the daily list of addresses.
See: http://blockdata.loyce.club/alladdresses/daily_updates/ for old daily files.
Thanks! Hoping to do some experimenting soon (if I have the time...)
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 16, 2021, 04:56:49 PM
 #58

You can have full control of an account minus billing.  I can pay the bill and accept crypto.
It's really not worth it for this project. I prefer to pay a low amount once a year, and once it reaches it's data limit, it just shuts down until the next month starts.

You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.
Parallel compression is only useful when server load isn't a restriction. For now I stick to the standard.

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
MrFreeDragon
Sr. Member
****
Offline Offline

Activity: 443
Merit: 350


View Profile
January 18, 2021, 11:42:25 PM
 #59

Hi! Is it possible to link the public key for every bitcoin address in your database? (of course only for whose of them where public key was exposed).

LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3528
Merit: 17819


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
January 19, 2021, 09:53:27 PM
 #60

Is it possible to link the public key for every bitcoin address in your database?
If I can get the data I can add it. I'm no expert on this, can I use anything from inputs (maybe spending_signature_hex?) to get this data?

▄▄███████████████████▄▄
▄█████████▀█████████████▄
███████████▄▐▀▄██████████
███████▀▀███████▀▀███████
██████▀███▄▄████████████
█████████▐█████████▐█████
█████████▐█████████▐█████
██████████▀███▀███▄██████
████████████████▄▄███████
███████████▄▄▄███████████
█████████████████████████
▀█████▄▄████████████████▀
▀▀███████████████████▀▀
Peach
BTC bitcoin
Buy and Sell
Bitcoin P2P
.
.
▄▄███████▄▄
▄████████
██████▄
▄██
█████████████████▄
▄███████
██████████████▄
███████████████████████
█████████████████████████
████████████████████████
█████████████████████████
▀███████████████████████▀
▀█████████████████████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀

▀▀▀▀███▀▀▀▀
EUROPE | AFRICA
LATIN AMERICA
▄▀▀▀











▀▄▄▄


███████▄█
███████▀
██▄▄▄▄▄░▄▄▄▄▄
████████████▀
▐███████████▌
▐███████████▌
████████████▄
██████████████
███▀███▀▀███▀
.
Download on the
App Store
▀▀▀▄











▄▄▄▀
▄▀▀▀











▀▄▄▄


▄██▄
██████▄
█████████▄
████████████▄
███████████████
████████████▀
█████████▀
██████▀
▀██▀
.
GET IT ON
Google Play
▀▀▀▄











▄▄▄▀
Pages: « 1 2 [3] 4 5 6 7 8 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!