DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 10, 2014, 04:51:41 PM Last edit: June 21, 2014, 03:09:20 AM by DeathAndTaxes |
|
Is there a relatively easy way to extract the UTXO from Bitcoin Core? { "height" : 305134, "bestblock" : "00000000000000003f0bb243b026fa904946f262eebf0684614ebb2622e4e554", "transactions" : 3304525, "txouts" : 11337577, "bytes_serialized" : 393434666, "hash_serialized" : "4132c6a9bdb7f2492348ba9cf8ca1c1fb954f619805de0c3cd01847f969ece65", "total_amount" : 12878214.79102854 } I would like the set of these 11,337,577 outputs so they can be parsed and dumped into a database for analysis. Any links to information or relevant code on how Bitcoin-Core stores the UTXO? I understand I could use a blockchain parser to build the UTXO and that is probably happening next (as I want to see how it has changed over time) but a dump would be a good start. Even just a dump of the index (tx_out_hash & tx_out_index) would be great.
|
|
|
|
piotr_n
Legendary
Offline
Activity: 2055
Merit: 1359
aka tonikt
|
|
June 10, 2014, 04:59:42 PM |
|
I am not sure about this specific implementation, but normally UTXO database is just a database: keys => records
Keys are usually txid+vout, while records contain at least value+pk_script, though usually also things like block_height, etc.
In case of Bitcon Core, it should be the same, so I'd say: use any LevelDB engine to load the DB and browse through it - that should be exactly what you need. The format of the records should not be hard to figure out.
|
Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.PGP fingerprint: AB9E A551 E262 A87A 13BB 9059 1BE7 B545 CDF3 FD0E
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 10, 2014, 05:09:28 PM |
|
On a related note what is the "bytes serialized" representing. 393434666 / 11337577 = ~35 bytes per output. The tx hash (32 bytes) and output index (1-2 bytes) would be 33 to 34 bytes right there. Is it just referring to the index? Or is it just referring the raw output data (value & script) not the lookup values? It can't be the full set as that would be closer to
To validate an the input of an arbitrary tx isn't the following needed? tx_out_hash (32 bytes) tx_out_index (4 bytes) value (8 bytes) script_len (variant - usually 1 byte) script (variable - 25 bytes for P2PKH outputs which make up 90%+ of UXTO) That is 70 bytes minimum, so for 11,337,577 outputs the UXTO would be a minimum of 794MB right? What am I missing?
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 10, 2014, 05:11:34 PM |
|
I am not sure about this specific implementation, but normally UTXO database is just a database: keys => records
Keys are usually txid+vout, while records contain at least value+pk_script, though usually also things like block_height, etc.
In case of Bitcon Core, it should be the same, so I'd say: use any LevelDB engine to load the DB and browse through it - that should be exactly what you need. The format of the records should not be hard to figure out.
My post above was posted while you were typing but that makes a lot of sense. It is very likely the UXTO reported is just the key (hash & index) that would coincide with the reported size. So the question becomes which file is the UXTO index stored in? On edit: chainstate subdirectory [v0.8 and above] A LevelDB database with a compact representation of all currently unspent transaction outputs and some metadata about the transactions they are from. The data here is necessary for validating new incoming blocks and transactions. It can theoretically be rebuilt from the block data (see the -reindex command line option), but this takes a rather long time. Without it, you could still theoretically do validation indeed, but it would mean a full scan through the blocks (7 GB as of may 2013) for every output being spent.
Ok so here is where to start. Lets see if I can decode it. Thanks. Would be nice if the format structure was reported somewhere other than just in the code.
|
|
|
|
piotr_n
Legendary
Offline
Activity: 2055
Merit: 1359
aka tonikt
|
|
June 10, 2014, 05:15:59 PM |
|
I would not count the record sizes like this. LevelDB uses compression that may be especially efficient on the keys part.
There is definitely the 32-bytes of TxID per record. And the output index must be 4 bytes - that's in the block chain protocol, so no sane person would dare to make it shorter in the DB.
I guess someone will soon tell you what is the actual format of the index and the records, but studying it with some database browser I'm sure you can easily figure it out yourself.
If you need some specific stats, I have UTXO db in my own format/engine - gladly to extract something for you.
|
Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.PGP fingerprint: AB9E A551 E262 A87A 13BB 9059 1BE7 B545 CDF3 FD0E
|
|
|
kjj
Legendary
Offline
Activity: 1302
Merit: 1026
|
|
June 10, 2014, 05:51:07 PM |
|
On a related note what is the "bytes serialized" representing. 393434666 / 11337577 = ~35 bytes per output. The tx hash (32 bytes) and output index (1-2 bytes) would be 33 to 34 bytes right there. Is it just referring to the index? Or is it just referring the raw output data (value & script) not the lookup values? It can't be the full set as that would be closer to
To validate an the input of an arbitrary tx isn't the following needed? tx_out_hash (32 bytes) tx_out_index (4 bytes) value (8 bytes) script_len (variant - usually 1 byte) script (variable - 25 bytes for P2PKH outputs which make up 90%+ of UXTO) That is 70 bytes minimum, so for 11,337,577 outputs the UXTO would be a minimum of 794MB right? What am I missing?
Look in txdb.cpp, CCoinsViewDB::GetStats. stats.nSerializedSize += 32 + slValue.size();
I didn't follow the objects around, but at first glance it looks like the txid and sequence number (in varint form). That works out pretty well with your 35 bytes per output calculation.
|
17Np17BSrpnHCZ2pgtiMNnhjnsWJ2TMqq8 I routinely ignore posters with paid advertising in their sigs. You should too.
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 10, 2014, 11:49:53 PM Last edit: June 11, 2014, 12:09:31 AM by DeathAndTaxes |
|
So a lot of smashing into walls and I figured some of it out. The key is txid + 0x63 ("c") to indicate coins? The value represents the multiple unspent outputs for that txid From the source code (coins.h) /** pruned version of CTransaction: only retains metadata and unspent transaction outputs * * Serialized format: * - VARINT(nVersion) * - VARINT(nCode) * - unspentness bitvector, for vout[2] and further; least significant byte first * - the non-spent CTxOuts (via CTxOutCompressor) * - VARINT(nHeight) * * The nCode value consists of: * - bit 1: IsCoinBase() * - bit 2: vout[0] is not spent * - bit 4: vout[1] is not spent * - The higher bits encode N, the number of non-zero bytes in the following bitvector. * - In case both bit 2 and bit 4 are unset, they encode N-1, as there must be at * least one non-spent output). * * Example: 0104835800816115944e077fe7c803cfa57f29b36bf87c1d358bb85e * <><><--------------------------------------------><----> * | \ | / * version code vout[1] height * * - version = 1 * - code = 4 (vout[1] is not spent, and 0 non-zero bytes of bitvector follow) * - unspentness bitvector: as 0 non-zero bytes follow, it has length 0 * - vout[1]: 835800816115944e077fe7c803cfa57f29b36bf87c1d35 * * 8358: compact amount representation for 60000000000 (600 BTC) * * 00: special txout type pay-to-pubkey-hash * * 816115944e077fe7c803cfa57f29b36bf87c1d35: address uint160 * - height = 203998 * * * Example: 0109044086ef97d5790061b01caab50f1b8e9c50a5057eb43c2d9563a4eebbd123008c988f1a4a4de2161e0f50aac7f17e7f9555caa486af3b * <><><--><--------------------------------------------------><----------------------------------------------><----> * / \ \ | | / * version code unspentness vout[4] vout[16] height * * - version = 1 * - code = 9 (coinbase, neither vout[0] or vout[1] are unspent, * 2 (1, +1 because both bit 2 and bit 4 are unset) non-zero bitvector bytes follow) * - unspentness bitvector: bits 2 (0x04) and 14 (0x4000) are set, so vout[2+2] and vout[14+2] are unspent * - vout[4]: 86ef97d5790061b01caab50f1b8e9c50a5057eb43c2d9563a4ee * * 86ef97d579: compact amount representation for 234925952 (2.35 BTC) * * 00: special txout type pay-to-pubkey-hash * * 61b01caab50f1b8e9c50a5057eb43c2d9563a4ee: address uint160 * - vout[16]: bbd123008c988f1a4a4de2161e0f50aac7f17e7f9555caa4 * * bbd123: compact amount representation for 110397 (0.001 BTC) * * 00: special txout type pay-to-pubkey-hash * * 8c988f1a4a4de2161e0f50aac7f17e7f9555caa4: address uint160 * - height = 120891 */ Using this I have been able to locate some tx and decode some portions of the vouts. The height and value are still a mystery though due to this "compact format". Any ideas? 0x86af3b = 120,891 0x8bb85e = 203,998 0x86ef97d579 = 234,925,952 0x8358 = 60,000,000,000
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
June 11, 2014, 12:37:58 AM |
|
Using this I have been able to locate some tx and decode some portions of the vouts. The height and value are still a mystery though due to this "compact format". Any ideas? 0x86af3b = 120,891 0x8bb85e = 203,998 0x86ef97d579 = 234,925,952 0x8358 = 60,000,000,000
Oh come on, just follow the calls.. not so hard to find the ctxoutcompressor and the call to decompress amount: // Amount compression: // * If the amount is 0, output 0 // * first, divide the amount (in base units) by the largest power of 10 possible; call the exponent e (e is max 9) // * if e<9, the last digit of the resulting number cannot be 0; store it as d, and drop it (divide by 10) // * call the result n // * output 1 + 10*(9*n + d - 1) + e // * if e==9, we only know the resulting number is not zero, so output 1 + 10*(n - 1) + 9 // (this is decodable, as d is in [1-9] and e is in [0-9])
uint64_t CTxOutCompressor::CompressAmount(uint64_t n) { if (n == 0) return 0; int e = 0; while (((n % 10) == 0) && e < 9) { n /= 10; e++; } if (e < 9) { int d = (n % 10); assert(d >= 1 && d <= 9); n /= 10; return 1 + (n*9 + d - 1)*10 + e; } else { return 1 + (n - 1)*10 + 9; } }
uint64_t CTxOutCompressor::DecompressAmount(uint64_t x) { // x = 0 OR x = 1+10*(9*n + d - 1) + e OR x = 1+10*(n - 1) + 9 if (x == 0) return 0; x--; // x = 10*(9*n + d - 1) + e int e = x % 10; x /= 10; uint64_t n = 0; if (e < 9) { // x = 9*n + d - 1 int d = (x % 9) + 1; x /= 9; // x = n n = x*10 + d; } else { n = x+1; } while (e) { n *= 10; e--; } return n; }
Because there are millions of txouts the format works fairly hard to save every byte (the leveldb compression does nothing useful for txouts at all).
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 11, 2014, 12:45:15 AM |
|
Thanks I located that after posting but I am still not decoding correctly so either it is an endian issue or I am missing something else.
i.e. from the example decoding 0x86af3b to 120,891 or encoding 120,891 to 0x86af3b.
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 11, 2014, 01:54:41 AM |
|
Another part of the mystery solved. block heights are recorded as MSB base-128 variable length integers (not the same as the varint in bitcoin protocol). // Variable-length integers: bytes are a MSB base-128 encoding of the number. // The high bit in each byte signifies whether another digit follows. To make // the encoding is one-to-one, one is subtracted from all but the last digit. // Thus, the byte sequence a[] with length len, where all but the last byte // has bit 128 set, encodes the number: // // (a[len-1] & 0x7F) + sum(i=1..len-1, 128^i*((a[len-i-1] & 0x7F)+1)) // // Properties: // * Very small (0-127: 1 byte, 128-16511: 2 bytes, 16512-2113663: 3 bytes) // * Every integer has exactly one encoding // * Encoding does not depend on size of original integer type // * No redundancy: every (infinite) byte sequence corresponds to a list // of encoded integers. // // 0: [0x00] 256: [0x81 0x00] // 1: [0x01] 16383: [0xFE 0x7F] // 127: [0x7F] 16384: [0xFF 0x00] // 128: [0x80 0x00] 16511: [0x80 0xFF 0x7F] // 255: [0x80 0x7F] 65535: [0x82 0xFD 0x7F] // 2^32: [0x8E 0xFE 0xFE 0xFF 0x00]
0x86af3b [134] [175] [59] 7*128^2+48*128^1+59*128^0 = 120,891 0x8bb85e [139] [184] [94] 12*128^2 + 57*128^1 + 94*128^0 = 203,998
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 11, 2014, 02:17:15 AM |
|
Got it finally.
For btc values you must first decode it from MSB base-128 format and then decode it using the Decompressor.
value: 0x86ef97d579 = 234,925,952
86 ef 95 d5 79 [134][239][151][213][79]
134*128^4 + 239*128^3 + 151*128^2 + 213*128^1 + 79*128^0 = 2,114,333,561
Passing 2,114,333,561 through the decompressor function produces 234,925,952.
|
|
|
|
piotr_n
Legendary
Offline
Activity: 2055
Merit: 1359
aka tonikt
|
|
June 11, 2014, 01:19:11 PM |
|
That would actually be something worth documenting. Preferably on the wiki.
|
Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.PGP fingerprint: AB9E A551 E262 A87A 13BB 9059 1BE7 B545 CDF3 FD0E
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 12, 2014, 01:53:24 AM |
|
Ok one last thing.
When iterating through all keypairs in chainstate there is one obviously non-tx pair. Any ideas?
Key: 0x42 Value: 0x61E663DEB7E063C6A8293CB3244BD6A1ECA9DB867F58161E0000000000000000
|
|
|
|
dexX7
Legendary
Offline
Activity: 1106
Merit: 1026
|
|
June 12, 2014, 03:50:23 AM |
|
Ok one last thing.
When iterating through all keypairs in chainstate there is one obviously non-tx pair. Any ideas?
Key: 0x42 Value: 0x61E663DEB7E063C6A8293CB3244BD6A1ECA9DB867F58161E0000000000000000
0x42 is the character 'B' and what's defined there is the last (newest?) block hash. In a more familiar form: 'B': 00000000000000001e16587f86dba9eca1d64b24b33c29a8c663e0b7de63e661
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 12, 2014, 09:53:56 PM |
|
Ok one last thing.
When iterating through all keypairs in chainstate there is one obviously non-tx pair. Any ideas?
Key: 0x42 Value: 0x61E663DEB7E063C6A8293CB3244BD6A1ECA9DB867F58161E0000000000000000
0x42 is the character 'B' and what's defined there is the last (newest?) block hash. In a more familiar form: 'B': 00000000000000001e16587f86dba9eca1d64b24b33c29a8c663e0b7de63e661Thanks. That makes sense. It is the blockhash the chainstate currently represents.
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 12, 2014, 10:50:27 PM |
|
Ok one more decoding mystery Example tx: http://webbtc.com/tx/14015bd586c0c7a28979ca294b114441f23bfc97be17cd6077b9e12e2709fec3E:\Repos\UxtoParser\UxtoParser.Demo\bin\Debug>UxtoParser.Demo.exe tx 14015bd586c0c7a28979ca294b114441f23bfc97be17cd6077b9e12e2709fec3 Key: 63C3FE09272EE1B97760CD17BE97FC3BF24144114B29CA7989A2C7C086D55B0114
Value: 010C013600946CB2E08075BCBAF157E47BCB67EB2B2339D24268800F514104E6DA9C60084B43D28266243C636BCDAF4D8F17B5954E078D2DECE7D4659E0DEE34 19A40B939C24AC813C692A323CA5207A6FB387FFE28E48F706C95DBF46648F210226CB0561011D9045F6371CB09086BA7148D9942328BCF1DD78CB6EDB35CCDD A921022EAC137AB02D826DF0AF54E92A352945C9892DF6CD77F1A7C390FC82C8B0EDEA53AE8F9D01
So I start decoding like this. 010C013600946CB2E08075BCBAF157E47BCB67EB2B2339D24268800F514104E6DA9C60084B43D28266243C636BCDAF4D8F17B5954E078D2DECE7D4659E0DEE341 9A40B939C24AC813C692A323CA5207A6FB387FFE28E48F706C95DBF46648F210226CB0561011D90 45F6371CB09086BA7148D9942328BCF1DD78CB6EDB35CCDDA921022EAC137AB02D826DF0AF54E92 A352945C9892DF6CD77F1A7C390FC82C8B0EDEA53AE8F9D01 Version: 0x01 = Ver 1 Code: 0x0C01 = First byte (bit 1 = 0 = not coinbase, bit 2 = 0 = vout[0] is already spent, bit 3= 1 = vout[1] is unspent, bit 4-8 = 0x0001 = one byte follows. Second byte = 0x000000001. This is a bitmap of the rest of the unspent outputs (from vout[2]+). Bit 1 is set. 1+1 = 2 so vout[2] is unspent. Summary: 2 unspent outputs (vout[1] & vout[2], neither are a coinbase output). Vout[1] |36|00|946CB2E08075BCBAF157E47BCB67EB2B2339D242| Value: MSB-128: 0x36 -> VarInt: 54 -> 6000 Code: 0x00 = P2PH PkHash: 946cb2e08075bcbaf157e47bcb67eb2b2339d242 Vout[2] |68|800F|514104E6DA9C60084B43D28266243C636BCDAF4D8F17B5954E078D2DECE7D4659E0DEE3419A40B939C24AC813C692A323CA5207A6FB387FFE28E48F706C95DBF46648F210226CB0561011D9045F6371CB09086BA7148D9942328BCF1DD78CB6EDB35CCDDA921022EAC137AB02D826DF0AF54E92A352945C9892DF6CD77F1A7C390FC82C8B0EDEA53AE| Value: MSB-128: 0x68 -> VarInt: 104 -> 12000 Code: 0x800f. First byte is not 0x01 to 0x05 so the code represents the length of an arbitrary script. How to decode length? Script: 514104e6da9c60084b43d28266243c636bcdaf4d8f17b5954e078d2dece7d4659e0dee3419a40b9 39c24ac813c692a323ca5207a6fb387ffe28e48f706c95dbf46648f2102 26cb0561011d9045f6371cb09086ba7148d9942328bcf1dd78cb6edb35ccdda921022eac137ab02 d826df0af54e92a352945c9892df6cd77f1a7c390fc82c8b0edea53ae Block: MSB-128: 0x8f9d01 -> 265985 Of I am able to decode the entire UXTO record except the "code" element on the second vout[0]. Everything else matches the data from the blockchain. For standard tx types the code will be one byte and in the range 0x01-0x05 otherwise it somehow? represents the length of the script and can be one or two bytes. The source code isn't particularly descriptive (script.h). * Other scripts up to 121 bytes require 1 byte + script length. Above * that, scripts up to 16505 bytes require 2 bytes + script length.
Decoding 0x800f using MSB-128 results in 143 but the actual script is 137 bytes. Any ideas?
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 12, 2014, 11:22:45 PM |
|
Answer to my own question. In script.h static const unsigned int nSpecialScripts = 6;
...
void Serialize(Stream &s, int nType, int nVersion) const {
if (Compress(compr)) { s << CFlatData(&compr[0], &compr[compr.size()]); return; } unsigned int nSize = script.size() + nSpecialScripts; ... } This is then encoded in MSB-128 (varint) format and prepended to the script when serialized for the UXTO (chainstate). The thing is this seems to be a bug (of course one that would be very hard to fix now) as there aren't 6 special transactions there are 5. Only Pay2PubKeyHash (0x00), Pay2ScriptHash(0x01), and Pay2PubKey (0x02, 0x3, 0x04) are compressed. Also since it is zero based unless a script with length of zero is valid one would only need to offset the actual script length by 4. Right? Anyways not exactly intuitive (or I am just tired).
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
June 13, 2014, 12:23:44 AM |
|
That would actually be something worth documenting. Preferably on the wiki.
This is really a deep software internal and may change from version to version. You really should not be writing external software that does anything with it— and if you do, well— you get to keep the pieces. It's documented in the source (serialize.h) I'd worry that putting it on the wiki would make it sound like there was some commitment at all to preserve it.
|
|
|
|
DeathAndTaxes (OP)
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
June 13, 2014, 12:41:45 AM |
|
That would actually be something worth documenting. Preferably on the wiki.
This is really a deep software internal and may change from version to version. You really should not be writing external software that does anything with it— and if you do, well— you get to keep the pieces. It's documented in the source (serialize.h) I'd worry that putting it on the wiki would make it sound like there was some commitment at all to preserve it. Makes sense. I agree wiki is probably the wrong place. Maybe expanding the inline documentation in the source code. I don't really think this kind of code is useful to most applications but there is a lot of interesting stats which can be pulled from the UXTO. Distribution of output age, distribution of outputs by script type, distribution of output values. One could build the UXTO from the raw blocks but since the chainstate has that data (albeit in an somewhat opaque format) it seems a smarter place to pull that out.
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
June 13, 2014, 08:46:41 AM |
|
Makes sense. I agree wiki is probably the wrong place. Maybe expanding the inline documentation in the source code. I don't really think this kind of code is useful to most applications but there is a lot of interesting stats which can be pulled from the UXTO. Distribution of output age, distribution of outputs by script type, distribution of output values. One could build the UXTO from the raw blocks but since the chainstate has that data (albeit in an somewhat opaque format) it seems a smarter place to pull that out.
FWIW, the gettxoutsetinfo rpc iterates over the utxo set, in the past when I've wanted this data I've just added some instrumentation in that function to dump it out. Even if you're not very experienced with C++ it shouldn't be to hard to emulate the rest of the code and print it out... this might be easier (and also more reliable across versions) than trying to read the data.
|
|
|
|
|