Print Page - LoyceV's small Linux commands for handling big data

Title: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 19, 2022, 10:44:38 AM

I like simple (Linux) commands that process a lot of data. A console is much more powerful than a GUI.
I've posted them in various topics, but from now on will collect them here. Use them at your own risk.

Warning
Don't just copy paste anything (https://www.cyberciti.biz/faq/understanding-bash-fork-bomb/) you find online into a console! Try to understand it first.

Self-moderated
No spam please. Questions are of course okay. Adding content is appreciated.

Overview

Get pubkeys out of Bitcoin block data (https://bitcointalk.org/index.php?topic=5395212.msg59906483#msg59906483)

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 19, 2022, 11:08:31 AM

Get pubkeys out of Bitcoin block data (https://bitcointalk.org/index.php?topic=5307550.0) (this was requested here (https://bitcointalk.org/index.php?topic=5254914.msg59902358#msg59902358)).

Note: this list is not meant for verbatim copy/pasting, it's my own notes of what I did (more or less).

Get outputs data (currently 148 GB)

Code:

wget -r blockdata.loyce.club/outputs/

Get all addresses with pubkey

Code:

for day in `ls outputs/*gz`; do echo $day; gunzip -c $day | grep -v is_from_coinbase | grep -v pubkeyhash | grep pubkey | cut -f 1,2,4-11 >> output.txt; done

(output.txt includes more columns than strictly needed)

Get currently funded Bitcoin addresses and their balance (1 GB)

Code:

wget addresses.loyce.club/blockchair_bitcoin_addresses_and_balance_LATEST.tsv.gz

Get all unique addresses that are in both lists

Code:

comm -12 <(cat output.txt | cut -f6 | sort -u -S40%) <(gunzip -c blockchair_bitcoin_addresses_and_balance_LATEST.tsv.gz | grep -v balance | cut -f1 | grep "^1" | sort -S40% ) > list

Get list of balances, addresses and pubkeys

Code:

gunzip -c blockchair_bitcoin_addresses_and_balance_LATEST.tsv.gz | grep "^1" | sort -rS60% | uniq -w30 -d > address_and_balance
cat <(cat output.txt | cut -f 6,8) list | sort -rS40% | uniq -w30 -d > address_and_pubkey
for addy in `cat list`; do balance="`grep $addy address_and_balance | cut -f2`"; pubkey="`grep $addy address_and_pubkey | cut -f2`"; echo "$balance $addy $pubkey"; done | sort -nr > balance_addy_pubkey.txt

(the last for-loop is quick and dirty, slow and inefficient, but considering there's not that much data, I didn't bother improving it)

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on April 25, 2022, 10:51:10 PM

Hi LoyceV,

Can you do something similar to what you did for the outputs? Can you process the inputs data dumps by extracting the following columns from them: recipient, type=pubkeyhash only, spending_witness. Once done, remove duplicate entries and cross-reference with the "List of all funded Bitcoin addresses" keeping only the recipients with a positive balance. Compare the recipients with balance_addy_pubkey.txt (http://balance_addy_pubkey.txt) and the difference will be the results I'm looking for. The spending_witness will contain the pubkey of those pubkeyhash addresses (truncating the 1st 148 characters will leave only the pukey remaining).

The instructions would be useful too.

Regards,

Code:

53225	24cf2dedab3c7898ec0f0532e177f15f41984b0c820202bcb143e43eec0c25d2	1	2010-04-27 00:08:11	4106000000	0.4106	15Z5YJaaNSxeynvr6uW6jQZLwq3n1Hu6RX	pubkeyhash	76a91431f19a7d0379f56cb3be0761c21f1f0c9553a47f88ac	0	-1	53241	8084468e05c6faa4029fbfd7b9b9d33e2274e9f40634207bcf86197fc6f83af5	1	2010-04-27 02:45:48	0.4106	4294967295	483045022069e95f67cc6fed7db01885d76e3294e443d9833316228bbd057f0f3a3bdd51630221009ef89fa8c34f37a245333c67d56feaef27b26651c4a30b6499e1bc386337ca8f0141047a51392bace353f4c3788c9c090ef4f635ec211159ec3b9f1bb7da7679517e126e98e0012bcb4d2b023c479afaaa1ad703ea1b24e1910e2cdad38744ba7aab8a		9457	4.494264120370371

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 26, 2022, 08:56:45 AM

Quote from: DeepComplex on April 25, 2022, 10:51:10 PM

Can you process the inputs data dumps by extracting the following columns from them: recipient, type=pubkeyhash only, spending_witness.

I did this:

Code:

for day in `inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,19 | grep -v spending_witness | grep pubkeyhash | grep -vP "\t$" >> output.txt; done

It's still running (and will take many hours to complete), but up until 2014, there is no output at all. I'll let it finish to be sure.
Update: as expected: no output.

Quote

Code:

53225	24cf2dedab3c7898ec0f0532e177f15f41984b0c820202bcb143e43eec0c25d2	1	2010-04-27 00:08:11	4106000000	0.4106	15Z5YJaaNSxeynvr6uW6jQZLwq3n1Hu6RX	pubkeyhash	76a91431f19a7d0379f56cb3be0761c21f1f0c9553a47f88ac	0	-1	53241	8084468e05c6faa4029fbfd7b9b9d33e2274e9f40634207bcf86197fc6f83af5	1	2010-04-27 02:45:48	0.4106	4294967295	483045022069e95f67cc6fed7db01885d76e3294e443d9833316228bbd057f0f3a3bdd51630221009ef89fa8c34f37a245333c67d56feaef27b26651c4a30b6499e1bc386337ca8f0141047a51392bace353f4c3788c9c090ef4f635ec211159ec3b9f1bb7da7679517e126e98e0012bcb4d2b023c479afaaa1ad703ea1b24e1910e2cdad38744ba7aab8a		9457	4.494264120370371

Are you looking for "483045022069e95f67cc6fed7db01885d76e3294e443d9833316228bbd057f0f3a3bdd516302210 09ef89fa8c34f37a245333c67d56feaef27b26651c4a30b6499e1bc386337ca8f0141047a51392b ace353f4c3788c9c090ef4f635ec211159ec3b9f1bb7da7679517e126e98e0012bcb4d2b023c479 afaaa1ad703ea1b24e1910e2cdad38744ba7aab8a"? That's not spending_witness, it's spending_signature_hex. For pubkeyhash entries, spending_witness is empty (there are double tabs in the .tsv file without any data in between, so without using a spreadsheet it's easy to overlook).

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on April 26, 2022, 01:08:57 PM

Hi LoyceV,

Yes, I'm looking for spending_signature_hex. Sorry about the mixup.

Kindly let it process all the blocks up to the current one.

Thanks again

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 26, 2022, 04:33:44 PM

Quote from: DeepComplex on April 26, 2022, 01:08:57 PM

Yes, I'm looking for spending_signature_hex. Sorry about the mixup.

I'm currently running this:

Code:

for day in `ls inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,18 | grep -v spending_signature_hex | grep pubkeyhash | grep -vP "\t$" >> output2.txt; done

~~If it doesn't run out of disk space, I'll continue from there tomorrow.~~ It won't fit for sure :( That's going to be a problem, because I need a few times more disk space to sort the data. 1 TB might not even be enough.

This is a bit smaller:

Code:

for day in `ls /var/www/blockdata.loyce.club/public_html/inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,18 | grep -v spending_signature_hex | grep pubkeyhash | grep -vP "\t$" | cut -f1,3 >> output2.txt; done

MrFreeDragon (https://bitcointalk.org/index.php?topic=5265993.msg56210642#msg56210642) already described what we're looking for here.

I'll try this:

Code:

for day in `ls /var/www/blockdata.loyce.club/public_html/inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,18 | grep -v spending_signature_hex | grep pubkeyhash | grep -vP "\t$" | cut -f1,3 > /dev/shm/tmp.file; paste <(cat /dev/shm/tmp.file | cut -f1) <(cat /dev/shm/tmp.file | cut -f2 | cut -c149-) | sort -u -S40% >> output2.txt; rm /dev/shm/tmp.file; done

Explanation: it truncates spending_signature_hex to only the pubkey (I used a detour, but it works without reducing performance much). This reduces the size, and I can remove duplicates for each day already. That should reduce the output size.
I'll let it run overnight.
Update: This produces different results:

Code:

1zxhKVZtMBt8kf7km2shn2mkR4NLHGigT       0405eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e
1zxhKVZtMBt8kf7km2shn2mkR4NLHGigT       05eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e

Somehow the 148 characters isn't a constant.
Another example with full data:

Code:

block_id	transaction_hash	index	time	value	value_usd	recipient	type	script_hex	is_from_coinbase	is_spendable	spending_block_id	spending_transaction_hash	spending_index	spending_time	spending_value_usd	spending_sequence	spending_signature_hex	spending_witness	lifespan	cdd
355251	e73b8e08f86b6130ed00531cf3fb9fff4718f87d33bee879ff6119bd6b88f142	0	2015-05-06 18:42:33	300000	0.7111	14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C	pubkeyhash	76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac	0	-1	356405	12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0	11	2015-05-14 15:51:23	0.7307	4294967295	47304402204a20d3044915706057c2a8a2a2ec7ba7e6ceb5735ab172329565df9bce70ea68022022f81e2d8b041c9eb3526168ab6ae0a6d375b1696abd86a0721cd64c7951be8101410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4		680930	0.023643402777777777
355939	077b2b53f3dc4ab060ad0e5752dd7ab4d036085d58d6e61f1bf335a8e9385c1a	0	2015-05-11 15:34:00	40000	0.097	14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C	pubkeyhash	76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac	0	-1	356405	12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0	12	2015-05-14 15:51:23	0.0974	4294967295	4730440220066b7baba05a4dca623e7719a7ddca253c0ac9dd9b11c8a4d78a480c2a7b9fbf02203cc0f7f64a919ed44ca5b05cc766f708697e091fee13b9fe3be84b5a12c656c501410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4		260243	0.0012048287037037036
355948	8d43ed844cc99bfdb0faa1fa0f7803af32e21457053aba385380d712393773f5	0	2015-05-11 16:35:42	28300	0.0687	14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C	pubkeyhash	76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac	0	-1	356405	12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0	13	2015-05-14 15:51:23	0.0689	4294967295	4830450221008bf415b6c4bc7118a1d93ef8f6c63b0801d9abe2e41e390670acf9677ee58e5602200da3df76f11ae04758c947a975f84dd7dba990e00c146b451dc4fa514c6cb52d01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4		256541	0.0008402905439814814
356093	65c62d27591ae513e9b5ddb1630b083e9f804c0bc89052c5c17e9ee3cdde29a6	0	2015-05-12 14:16:22	10000	0.0242	14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C	pubkeyhash	76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac	0	-1	356405	12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0	14	2015-05-14 15:51:23	0.0244	4294967295	483045022100a83ca95b6b3153c5fce971c1eebbeebc892ba6c297157c326a8359c9b408ce1902201904060ce4e1fbd455403546232779dc9ca7bfe3582d3055270f27f245575d0901410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4		178501	0.00020659837962962966
356247	9873b5b567b7947fdb271ed37240da6ee45085f9804565c82a02f8a3a376833d	0	2015-05-13 14:46:45	12000	0.0292	14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C	pubkeyhash	76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac	0	-1	356405	12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0	15	2015-05-14 15:51:23	0.0292	4294967295	47304402202a6531662b10fdc9d4b4e7ada1a4db2cdba6e0d22155ebbcedb8545a9ddfd1170220760a431d5eac0f32e8f6fc5812cc9ea5c1530eac37f11768288185bb87cba3ea01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4		90278	0.00012538611111111112
356375	a6afeb2893da616d75c39a4eeaed51c91fef715930f356988665a32028cad057	204	2015-05-14 10:33:58	99600	0.2426	14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C	pubkeyhash	76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac	0	-1	356405	12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0	16	2015-05-14 15:51:23	0.2426	4294967295	47304402202ad9c30164315b7ae2a3d280f629a014d316f4177e4dbde6167bd83c628a4d050220308bba5e628de5a47606d1e0ecd939332f7036a6da7eb6458e2bd6e690bd553d01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4		19045	0.00021954652777777774

Any idea why? The pubkey doesn't seem to have a fixed length either, so counting characters from the end won't work.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on April 26, 2022, 06:55:53 PM

Hi, I have no idea why it is doing that.

Is it possible to keep with one of the lines? I'd probably guess that the address will be empty anyways or might have to do some manual on the data and the final comparison for a positive balance.

Regards,

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 28, 2022, 08:27:52 AM

Quote from: DeepComplex on April 26, 2022, 06:55:53 PM

Hi, I have no idea why it is doing that.

Is it possible to keep with one of the lines?

I'd need to know which one to keep, and even better: how to decide that from the raw data. The current output is 81 GB (815,497,912 lines), which is too large to sort on this server (only 93 GB disk space remaining). Sorting is needed to remove all duplicates, which is needed to reduce the file size before checking for addresses with a balance.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on April 28, 2022, 04:48:19 PM

Keep the records that are 131 and 66 characters long. 131 chars are for the uncompress pubkey and 66 chars for the compressed pubkey. The uncompressed pubkey prefix is 04 and the compressed pukey is 02 or 03.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 28, 2022, 05:40:46 PM

Quote from: DeepComplex on April 28, 2022, 04:48:19 PM

This one is 130 characters:

Code:

0405eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e

The compressed ones are indeed only half the size. But I meant the ones where the length varies by only 2, which is probably caused by truncating the first 148 characters.
Could it be the 148 characters isn't always the same?
Compare those 2:

Code:

483045022100a83ca95b6b3153c5fce971c1eebbeebc892ba6c297157c326a8359c9b408ce1902201904060ce4e1fbd455403546232779dc9ca7bfe3582d3055270f27f245575d0901410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4
47304402202a6531662b10fdc9d4b4e7ada1a4db2cdba6e0d22155ebbcedb8545a9ddfd1170220760a431d5eac0f32e8f6fc5812cc9ea5c1530eac37f11768288185bb87cba3ea01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on April 28, 2022, 10:32:44 PM

It seems like it's not always 148 chars that can be truncated.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 29, 2022, 07:45:46 AM

Quote from: DeepComplex on April 28, 2022, 10:32:44 PM

It seems like it's not always 148 chars that can be truncated.

It seems like it :) Combined with the fact that only a part of the spending_witness data is the pubkey, and that the pubkey length itself can vary, I don't know how to proceed. If you can figure it out, I'll continue this, but I don't have the time to search for it myself now.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: PawGo on April 29, 2022, 02:51:15 PM

Quote from: LoyceV on April 28, 2022, 05:40:46 PM

Quote from: DeepComplex on April 28, 2022, 04:48:19 PM

This one is 130 characters:

Code:

0405eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e

Code:

483045022100a83ca95b6b3153c5fce971c1eebbeebc892ba6c297157c326a8359c9b408ce1902201904060ce4e1fbd455403546232779dc9ca7bfe3582d3055270f27f245575d0901410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4
47304402202a6531662b10fdc9d4b4e7ada1a4db2cdba6e0d22155ebbcedb8545a9ddfd1170220760a431d5eac0f32e8f6fc5812cc9ea5c1530eac37f11768288185bb87cba3ea01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4

Hmmm
It is not like leading zero removed somewhere during the process?
I though pubkeys should always have 65 or 33 bytes (including the one for flag).

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on April 29, 2022, 04:29:34 PM

Quote from: PawGo on April 29, 2022, 02:51:15 PM

It is not like leading zero removed somewhere during the process?
I though pubkeys should always have 65 or 33 bytes (including the one for flag).

I'm not sure. If there are leading zeros when the pubkey is "shorter", I may be able to include them by simply counting from the right instead of from the left.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on April 29, 2022, 05:11:04 PM

That is a good idea. You'll have to search for both the uncompressed and compressed keys in the list.

Quote from: LoyceV on April 29, 2022, 04:29:34 PM

Quote from: PawGo on April 29, 2022, 02:51:15 PM

It is not like leading zero removed somewhere during the process?
I though pubkeys should always have 65 or 33 bytes (including the one for flag).

I'm not sure. If there are leading zeros when the pubkey is "shorter", I may be able to include them by simply counting from the right instead of from the left.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on May 01, 2022, 11:55:52 PM

Hi LoyceV,

Any updates?

Regards,

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on May 02, 2022, 06:35:27 AM

Quote from: DeepComplex on May 01, 2022, 11:55:52 PM

Any updates?

Nope:

Quote from: LoyceV on April 29, 2022, 07:45:46 AM

I don't know how to proceed. If you can figure it out, I'll continue this, but I don't have the time to search for it myself now.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: iceland2k14 on May 08, 2022, 07:50:36 AM

@LoyceV The Values in Sigscript (Contains R,S, Pubkey) is not fixed but they have defined structure. One piece of the Structure is As shown by @MrFreeDragon in this Link https://pastebin.com/Q55PyUgB (https://pastebin.com/Q55PyUgB)
But even in this structure the length is not always 0x21 or 0x20 or 0x41. it varies and therefore the length of R and S and Pubkey will vary. You will need to use dynamic sizing variables to extract them. Perhaps use a Awk script or Python. That might be easier. Don't know if the Bash Shell can do all of it.

The basic way to decode and extract the variable size of the data can be taken by following code below...

Code:

def get_rs(sig):
    rlen = int(sig[2:4], 16)
    r = sig[4:4+rlen*2]
#    slen = int(sig[6+rlen*2:8+rlen*2], 16)
    s = sig[8+rlen*2:]
    return r, s
    
def split_sig_pieces(script):
    sigLen = int(script[2:4], 16)
    sig = script[2+2:2+sigLen*2]
    r, s = get_rs(sig[4:])
    pubLen = int(script[4+sigLen*2:4+sigLen*2+2], 16)
    pub = script[4+sigLen*2+2:]
#    assert(len(pub) == pubLen*2)
    return r, s, pub

r, s, pub = split_sig_pieces(script)

Code:

script:  8b4830450221008bf415b6c4bc7118a1d93ef8f6c63b0801d9abe2e41e390670acf9677ee58e5602200da3df76f11ae04758c947a975f84dd7dba990e00c146b451dc4fa514c6cb52d01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4
R:  008bf415b6c4bc7118a1d93ef8f6c63b0801d9abe2e41e390670acf9677ee58e56
S:  0da3df76f11ae04758c947a975f84dd7dba990e00c146b451dc4fa514c6cb52d
pub:  0421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4

This way you can not only extract all the Pubkeys but can also extract all the R & S values of the Signature, if needed.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on May 15, 2022, 02:37:18 PM

@iceland2k14: thanks, but it feels like I'm in over my head. I've abandoned the pubkey project (at least for now).

It looks like I'm going to need to learn using a database.
Let's say I have a list like this:

Code:

sender recipient value fee
0xae0bf57678cf8151ff95889078e944a7696e18d5 0x930509a276601ca55d508cb5983c2c0d699fd7e9 1 39890258223793255
0x8c0fcd139568055e92a2b96c48ac85fa076c6c6a 0x202f1cbc8a208ee6dece54bb8837950b89e704b6 0 316430894723098310
0x1c7e19f5283aa41a496c1f351b36e96dbaad507f 0x7e75aefd78dbfd7e0846cf608151563164fbb7b2 0 42016257624091770
0xeca2e2d894d19778939bd4dfc34d2a3c45e96456 0xeca2e2d894d19778939bd4dfc34d2a3c45e96456 0 7521743554059000
0x26bce6ecb5b10138e4bf14ac0ffcc8727fef3b2e 0x26bce6ecb5b10138e4bf14ac0ffcc8727fef3b2e 0 7521743554059000
0x57845987c8c859d52931ee248d8d84ab10532407 0xd9e1ce17f2641f24ae83637ab66a2cca9c378b9f 63984998281303121920 9298601644341820
0x6604ac53a82cd784525e5f90652c4d6e6b2252af 0x8a0f69b5f97d5c5a2573314e91ef9d7f46ba6da1 0 32647788000000000
0x23d9e4be4d1d2b2a43a51cc66da725f0bd25ec43 0x95172ccbe8344fecd73d0a30f54123652981bd6f 0 17067300000000000
0x6e90ae41af1dea6f0006aa7752d9db2cf5e6a49f 0xd9e1ce17f2641f24ae83637ab66a2cca9c378b9f 55455427699986136143 38403951353497129

Want I want, is a database that only contains Addresses and their Value. To get this, Value and Fee get subtracted from Sender's balance, and Value gets added to Recipient's balance. The input data will be around 200 GB, including many duplicate addresses.

Given that I know nothing about databases, how would I start doing this? Is it going to be a problem if the database is larger than my RAM? If needed, I can (easily) split this list up into 2 lists: one with Sender, Value and Fee, and the other with Recipient and Value.

@TryNinja: Considering the performance you managed to get on ninjastic.space, I think you're the right person to ask :) Allow me to notify you :)

To make it easier to understand what I need, I can turn the above table into this:

Code:

0x930509a276601ca55d508cb5983c2c0d699fd7e9 1
0xd9e1ce17f2641f24ae83637ab66a2cca9c378b9f 63984998281303121920
0xd9e1ce17f2641f24ae83637ab66a2cca9c378b9f 55455427699986136143

0xae0bf57678cf8151ff95889078e944a7696e18d5 -1
0x57845987c8c859d52931ee248d8d84ab10532407 -63984998281303121920
0x6e90ae41af1dea6f0006aa7752d9db2cf5e6a49f -55455427699986136143

0xae0bf57678cf8151ff95889078e944a7696e18d5 -39890258223793255
0x8c0fcd139568055e92a2b96c48ac85fa076c6c6a -316430894723098310
0x1c7e19f5283aa41a496c1f351b36e96dbaad507f -42016257624091770
0xeca2e2d894d19778939bd4dfc34d2a3c45e96456 -7521743554059000
0x26bce6ecb5b10138e4bf14ac0ffcc8727fef3b2e -7521743554059000
0x57845987c8c859d52931ee248d8d84ab10532407 -9298601644341820
0x6604ac53a82cd784525e5f90652c4d6e6b2252af -32647788000000000
0x23d9e4be4d1d2b2a43a51cc66da725f0bd25ec43 -17067300000000000
0x6e90ae41af1dea6f0006aa7752d9db2cf5e6a49f -38403951353497129

Sorting gives this:

Code:

0x1c7e19f5283aa41a496c1f351b36e96dbaad507f -42016257624091770
0x23d9e4be4d1d2b2a43a51cc66da725f0bd25ec43 -17067300000000000
0x26bce6ecb5b10138e4bf14ac0ffcc8727fef3b2e -7521743554059000
0x57845987c8c859d52931ee248d8d84ab10532407 -63984998281303121920
0x57845987c8c859d52931ee248d8d84ab10532407 -9298601644341820
0x6604ac53a82cd784525e5f90652c4d6e6b2252af -32647788000000000
0x6e90ae41af1dea6f0006aa7752d9db2cf5e6a49f -38403951353497129
0x6e90ae41af1dea6f0006aa7752d9db2cf5e6a49f -55455427699986136143
0x8c0fcd139568055e92a2b96c48ac85fa076c6c6a -316430894723098310
0x930509a276601ca55d508cb5983c2c0d699fd7e9 1
0xae0bf57678cf8151ff95889078e944a7696e18d5 -1
0xae0bf57678cf8151ff95889078e944a7696e18d5 -39890258223793255
0xd9e1ce17f2641f24ae83637ab66a2cca9c378b9f 55455427699986136143
0xd9e1ce17f2641f24ae83637ab66a2cca9c378b9f 63984998281303121920
0xeca2e2d894d19778939bd4dfc34d2a3c45e96456 -7521743554059000

Sorting is going to be slow for the full data set, and probably takes about 500 GB tmp space, but it's doable if it helps.

Main question: how do I put this in .db format?

Title: Re: LoyceV's small Linux commands for handling big data
Post by: PawGo on May 15, 2022, 05:07:00 PM

I do not really understand what do you mean as a "database". Do you think about any particular implementation? What do you mean by ".db format"?
Why not to launch mysql or maybe better postgresql server? Loading file like that to database table is trivial.
RAM has nothing to do with that I think. I mean - it helps, but is not a blocking constraint.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on May 15, 2022, 05:14:56 PM

Quote from: PawGo on May 15, 2022, 05:07:00 PM

I do not really understand what do you mean as a "database". Do you think about any particular implementation? What do you mean by ".db format"?

That confirms that I know nothing about databases :(

Quote

Why not to launch mysql or maybe better postgresql server? Loading file like that to database table is trivial.
RAM has nothing to do with that I think. I mean - it helps, but is not a blocking constraint.

I've heard of mysql, but not PostgreSQL.

"Trivial" sounds great :D But I have no idea how :P Google shows this (https://stackoverflow.com/questions/18223665/postgresql-query-from-bash-script-as-database-user-postgres), if that's the right track I can try it.
Any idea how to handle duplicate addresses: 1 address with 2 balances that have to be added together?

Title: Re: LoyceV's small Linux commands for handling big data
Post by: PawGo on May 15, 2022, 05:31:03 PM

Mysql is also good, they give https://www.mysql.com/products/workbench/
I will not force you to install MS SQLServer or any monster from Oracle.
Let's say you decide to use postgresql. Then you receive a very nice client - pgAdmin https://www.pgadmin.org/ Using tool like that will be very helpful for you.

Then you may for example: create table (tx id, address, balanceChange),
txid could be our primary key (unique), you should also create index on recipient, as you will launch search using that field.
load data:
https://sunitc.dev/2021/07/17/load-csv-data-into-a-database-table-mysql-postgres-etc/
(create index after you load data, otherwise loading will take ages)

Then you may very easily check balance change (delta) for each recipient:

Code:

select address, sum(balance) from tableName group by address

It should give you list of addresses with their balance changes.

Just try to list all the possible use cases, think what do you need, how you want to use it - to build a correct data model. It may be the most difficult task - just not to duplicate data, etc
https://www.guru99.com/relational-data-model-dbms.html
https://en.wikipedia.org/wiki/Database_normalization

But if you start, sky is the limit ;) it will be much easier than playing with text files.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on May 15, 2022, 05:59:46 PM

Thanks! I only now realize I can just add all balances for each address, and sum them later when needed.

The drawback of such a versatile database is that I have a lot of catching up to do. Thanks for the links, I'll see if I can get something working tomorrow :)

Update: I don't need a database anymore, I'll stick to what I know: clear text :)

Title: Re: LoyceV's small Linux commands for handling big data
Post by: DeepComplex on May 21, 2022, 01:34:40 PM

I also prefer the clear tex model.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on August 20, 2022, 11:30:30 AM

Question

I have a file (300 MB) with 8.9 million Bitcoin addresses. I also have a directory (67 GB) with all Bitcoin addresses. I want to know which address from the file is in the direcory more than once.

I use this:

Code:

grep -hf addresses.txt alladdys/* | sort | uniq -d > output.txt

For small lists of addresses, this works fine! However, with the 300 MB list, grep takes 94% of my 16 GB RAM, and there doesn't seem to be any progress. I didn't expect grep would use this much memory for a 300 MB file.
What would be a better solution?

Title: Re: LoyceV's small Linux commands for handling big data
Post by: seoincorporation on August 20, 2022, 04:38:37 PM

Quote from: LoyceV on May 15, 2022, 02:37:18 PM

...
Main question: how do I put this in .db format?

You don't have to put it in a DB format at all because you can import text files to a data base. The tric is to use tabs and not space betweet the address and the balance.

Code:

echo "hello word" | sed -e 's/ /\t/g'

Once you have changed that, then you can load it in to a table with:

Code:

LOAD DATA INFILE '/tmp/addys.txt' INTO TABLE AddresTable;

Source: https://stackoverflow.com/questions/13579810/how-to-import-data-from-text-file-to-mysql-database

Title: Re: LoyceV's small Linux commands for handling big data
Post by: PawGo on August 21, 2022, 10:23:55 AM

Quote from: LoyceV on August 20, 2022, 11:30:30 AM

Hi,
I do not understand what do you mean by "directory" - is it a file with addresses? Or directory on hdd where each file has name like an address? Then how you may have the same address twice?
I have prepared a small program for you:
https://github.com/PawelGorny/Loyce60787783
It reads into memory list of addresses and then reads "directory" file with addresses - if address exists in memory, is marked, if the same address in hit for the second time, is removed from memory and saved to file. If you want to calculate how many times address was hit, the change is needed.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on August 21, 2022, 11:46:18 AM

Quote from: PawGo on August 21, 2022, 10:23:55 AM

I do not understand what do you mean by "directory" - is it a file with addresses? Or directory on hdd where each file has name like an address? Then how you may have the same address twice?

It's a directory with files. Each file has all Bitcoin addresses that were used that day, some of them more than once.

To my surprise, my grep script actually completed! It used up all RAM and added some swap, and after 24 hours of high load, it's done :D

Quote

I have prepared a small program for you:
https://github.com/PawelGorny/Loyce60787783
It reads into memory list of addresses and then reads "directory" file with addresses - if address exists in memory, is marked, if the same address in hit for the second time, is removed from memory and saved to file.

Thanks for this! I tried to test it, but I don't really want to install java on the server just for this. I am curious how this would perform though.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: PawGo on August 21, 2022, 11:54:37 AM

Quote from: LoyceV on August 21, 2022, 11:46:18 AM

Quote from: PawGo on August 21, 2022, 10:23:55 AM

I do not understand what do you mean by "directory" - is it a file with addresses? Or directory on hdd where each file has name like an address? Then how you may have the same address twice?

It's a directory with files. Each file has all Bitcoin addresses that were used that day, some of them more than once.

Ok, I understand now. That way I may change program to process all the files from the given directory, not only the one file (daily snapshot). The question would be if you look for double hits per day/per file, or totally in any of files.

If you change your mind, give it a try, maybe it will use less resources. I do not know how much memory will take 8.7mln addresses.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on August 21, 2022, 12:11:39 PM

Quote from: PawGo on August 21, 2022, 11:54:37 AM

The question would be if you look for double hits per day/per file, or totally in any of files.

The total :) For now, I got it covered.
I can try something else too: if I use the list of addresses that are used more than once, it's only 3 GB (instead of 67), and I can search against that list. That was slower in my initial test, but that didn't cause memory problems, and I have to do it less often so it may pay off.

I'm looking for all addresses funded with 1, 2, 4, ...., 8196 mBTC in one transaction, that don't have more transactions than 1 funding and (possibly) one spending. I want to count how many of those chips exist on each day. It could be a good measure for privacy.

Title: Re: LoyceV's small Linux commands for handling big data
Post by: citb0in on August 30, 2022, 03:23:16 PM

Hello all and thanks to LoyceV providing this great ressource of information. For a certain query I'd like to have a file containing all addresses which

either

are funded (=positive balance)

or

had an output in the past (=sent some coins to someone)

Is it possible somehow to generate such a big file with this data which I could use for a query? Alternatively, I don't mind having two separate files: one that already exists <blockchair_bitcoin_addresses_and_balance_LATEST.tsv> and one additional which contains all addresses with outputs. I could run my query agains both of those, that would certainly do the job.

I'm grateful for any helpful information. Thank you so much!

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on August 30, 2022, 04:29:20 PM

Quote from: citb0in on August 30, 2022, 03:23:16 PM

are funded (=positive balance)

See List of all Bitcoin addresses with a balance (https://bitcointalk.org/index.php?topic=5254914.0).

Quote

had an output in the past (=sent some coins to someone)

Interesting, I don't have such a list, but it should be quite easy to get from outputs (http://blockdata.loyce.club/outputs/blockchair_bitcoin_outputs_20111111.tsv.gz).

Quote

Is it possible somehow to generate such a big file with this data which I could use for a query?

Are you sure you're not looking for "and" instead of "or", so all addresses that sent funds before, and still hold a balance?

Quote

I could run my query agains both of those, that would certainly do the job.

Adding them together and removing duplicates is easy.

To be clear:

Quote from: LoyceV on February 18, 2021, 10:15:43 AM

outputs (http://blockdata.loyce.club/outputs/blockchair_bitcoin_outputs_20111111.tsv.gz):

Code:

block_id	transaction_hash	index	time	value	value_usd	recipient	type	script_hex	is_from_coinbase	is_spendable
152764	498ec7f88857d23bd19370252da439876960dad296a64534987ad54081a9cc39	0	2011-11-11 00:05:55	5001400010	143.5402	17aA19GvhzMHsq8xPwSXAPZutyr6kuzLEB	pubkey	4104a8dd3f118a6122c2f8c6be3261670f7af76568cac8ff0ed95e5ef63238018e69b223dbdef9f9d5e94a053ff1afc390e230844c0b71f3648405807cd668979958ac	1	1
152764	5cc203b9389f0f7eb50669eba04ac32d666892fc95f07c25603da8a6ed9316ae	0	2011-11-11 00:05:55	1199480	0.0344	1KPxwAbFVoDimPrVECF2zgiyfX9jGW9TCy	pubkeyhash	76a914c9ca1b452087cdc6b89754c1090928d7a67ef23988ac	0	-1

Would 17aA19GvhzMHsq8xPwSXAPZutyr6kuzLEB and 1KPxwAbFVoDimPrVECF2zgiyfX9jGW9TCy be what you're looking for?

Title: Re: LoyceV's small Linux commands for handling big data
Post by: citb0in on August 30, 2022, 04:43:39 PM

Not really. I am interested in addresses like that:

address, balance, outputs

1aDdressExampLeFundedxXxx, 123456, 789
bc1qnotfundedbutspent0utput, 0, 3

Addresses with balance=0 AND outputs=0 should not be listed. Only those matching this condition

if balance>0 OR (balance=0 AND outputs>0)

Title: Re: LoyceV's small Linux commands for handling big data
Post by: LoyceV on August 30, 2022, 05:40:10 PM

Quote from: citb0in on August 30, 2022, 04:43:39 PM

Addresses with balance=0 AND outputs=0 should not be listed.

No balance and no outputs, that means the address is unused. Those aren't in any of the data dumps.

Quote

Only those matching this condition
if balance>0

That list I have :)

Quote

OR (balance=0 AND outputs>0)

I'm confused: why would the 2 addresses I gave above not qualify for this?

Title: Re: LoyceV's small Linux commands for handling big data
Post by: seoincorporation on August 30, 2022, 05:50:49 PM

Hello LoyceV, i have been working in the Address to HASH160 conversion and i made some scripts that i would like to add to your Linux Commands.

Script to get all the HASH160 from the addyBalance.tsv file from address starting with 1.

Code:

for a in $(cat addyBalance.tsv | cut -f1 | sed '/^b/d' | sed '/^1/d')
do
python3 -c "import binascii, hashlib, base58; hash160 = binascii.hexlify(base58.b58decode_check(b'$a')).decode()[2:]; print(hash160)";
done

Script to get all the HASH160 from the addyBalance.tsv file from address starting with bc1.

Code:

for a in $(cat addyBalance.tsv | cut -f1 | sed '/.\{70\}/d' | sed '/^3/d' | sed '/^1/d')
do
python3 -c "import bech32; hash1 = bech32.decode(\"bc\", \"$a\"); hash2 = bytes(hash1[1]); print(hash2.hex())";
done

Script to get all the HASH160 from the addyBalance.tsv file from address starting with 1.

Run:

You can print the HASH with:

Code:

sh addy.sh

Or oyu can send it to a file:

Code:

sh addy.sh > a.txt

The script prints an error because the first word in the file is 'Addres', but it works fine:

Code:

$ sh addy.sh
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/windows/.local/lib/python3.10/site-packages/base58/__init__.py", line 157, in b58decode_check
    raise ValueError("Invalid checksum")
ValueError: Invalid checksum
23e522dfc6656a8fda3d47b4fa53f7585ac758cd
cec49f4d16b05fe163e41b15856732d985d36277
d401a46b6d847399a45878b1f25f499aad959830
4017a4219d12f11e4649d2ae9eef5bd4c9bf0d80
c8ca150ee82589d47f69b8dcd7cad684d88283f1
288b23e4a5886136a544159550a8e99f2e5672ab
cd49d5f5215aaaa0fdbf1bd2e454250edf8a54e2
cafacdc40cf8d3daa60aa479774ccd9085b4c512
b91e28f4f8f6fced313112c0c72407d85ecec39a
4a782fe173a0b6718d39667b420d9c8b07e94262
9518af9ff9c31a47f94a36b95dce267e5edcd82d

And I made a small script for single address too:

sh bc.sh bc1qBitcoinAddress

Code:

python3 -c "import bech32; hash1 = bech32.decode(\"bc\", \"$1\"); hash2 = bytes(hash1[1]); print(hash2.hex())"

sh 1.sh 1BitcoinAddress

Code:

python3 -c "import binascii, hashlib, base58; hash160 = binascii.hexlify(base58.b58decode_check(b'$1')).decode()[2:]; print(hash160)"

You will need the Python dependencies to run this script.

Code:

pip install base58 bech32 binascii hashlib

Title: Re: LoyceV's small Linux commands for handling big data
Post by: citb0in on August 31, 2022, 06:19:59 AM

@seoincorporation: thanks for the scripts you provided, but: this should be very time-consuming and slow. Imagine you would run your script against the file of LoyceV which contains all funded addresses (1.8 GB file size). It would take weeks (?) until it gets finished ? what do you think, any ways for optimization ?

Title: Re: LoyceV's small Linux commands for handling big data
Post by: seoincorporation on September 01, 2022, 03:55:28 AM

Quote from: citb0in on August 31, 2022, 06:19:59 AM

Code:

$ cat addyBalance.tsv | cut -f1 | sed '/.\{70\}/d' | sed '/^3/d' | sed '/^1/d' |wc -l
11407170

I know the data is big 1.1 Million addys starting with 1, but i don't think it would take weeks.

I replace cat with head -n 10000, and with the time command i get:

Code:

real	0m37.877s
user	0m31.956s
sys	0m5.951s

So, 10,000 on 40 seconds, that's 4,000 seconds for 1 million, that's 66 minutes, or a little more than 1 hour.

I think it should be faster if you do it all with Python and not calling python from Bash as i did.

Bitcoin Forum

Bitcoin => Project Development => Topic started by: LoyceV on April 19, 2022, 10:44:38 AM