Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: piotr_n on August 16, 2016, 10:24:56 AM



Title: Questions about hash_serialized, returned by gettxoutsetinfo
Post by: piotr_n on August 16, 2016, 10:24:56 AM
Recently I've read somewhere that the way hash_serialized gets calculated has changed, but I just want to ask about the last version (not yet released):

How exactly is it calculated?
Specifically, can someone please confirm that I need to have the utxo records sorted, in order to calculate it.

How are they sorted?

Also, are there any plans to make this hash more open and verifiable, for a purpose of setting up fresh nodes with a snapshots of an UTXO database?
I'd imagine setting up a node with a snapshot of the coinstate db, without the need to build it up from the blockchain.
That'd save shit loads of time, bandwidth and disk space.


Title: Re: Question about hash_serialized, returned by gettxoutsetinfo
Post by: achow101 on August 16, 2016, 12:54:45 PM
I'm not entirely sure how it is calculated, but the functions for getting all of that data is in the code at https://github.com/bitcoin/bitcoin/blob/37d83bb0a980996338d9bc9dbdbf0175eeaba9a2/src/rpc/blockchain.cpp#L633


Title: Re: Question about hash_serialized, returned by gettxoutsetinfo
Post by: piotr_n on August 16, 2016, 01:44:20 PM
I'm not entirely sure how it is calculated, but the functions for getting all of that data is in the code at https://github.com/bitcoin/bitcoin/blob/37d83bb0a980996338d9bc9dbdbf0175eeaba9a2/src/rpc/blockchain.cpp#L633
thanks. I have actually seen that after I sent the post - sorry :)

it seems that since 0.12.1 the "ss << key;" was added.

obviously, looking into this code, it's just a hash of a serialized data and the records must come in a sorted order.
I'm not sure how they are sorted, but I'd guess it's given by leveldb, so probably sorted by txid. either little, or big endian..

shame it uses a sorted order, as it becomes very hard to calculate this value with a different utxo db engine (one which does not keep sorted records).
I'd rather calculate a separate hash of each "key+record(s)" and xor all the hashes together to get the final value.


Title: Re: Question about hash_serialized, returned by gettxoutsetinfo
Post by: piotr_n on August 16, 2016, 02:11:19 PM
I imagine the idea behind hashing it like this was to be able to verify a disk snapshot of the database, using a single command line and some common existing tools.

which keeps making me curious about where is the code that stores the snapshot on the disk or imports it into the client..


Title: Re: Questions about hash_serialized, returned by gettxoutsetinfo
Post by: gmaxwell on August 17, 2016, 09:15:46 AM
The only purpose of it is bitcoin core specific software testing, it's effectively free to compute while returning those other statistics, and it allows rapid isolation of suspected database corruption or inconsistency.

The structure of the hashing is not well suited to other applications.

and xor all the hashes together to get the final value.
Congrats, you win today's failed cryptography trophy. :) That kind of structure is trivially to second preimage attacks using wagners algorithm for solutions to the subset sum problem.  Order independent accumulators are a tricky subject, the only that I'm aware of that have any real argument for security have huge hashes and are very slow.

The data hashed here is also _highly_ implementation specific and subject to change in any new version without warning.


Title: Re: Questions about hash_serialized, returned by gettxoutsetinfo
Post by: piotr_n on August 17, 2016, 09:27:17 AM
Congrats, you win today's failed cryptography trophy. :) That kind of structure is trivially to second preimage attacks using wagners algorithm for solutions to the subset sum problem.  

Sorry then.
I've been learning cryptography by reverse engineering your code ;)

Is it also so "trivial" when the final checksum is not 256, but e.g. 65536 bits? or 8388608 bits?
In order to get this, I'd xor each 256/512 bit hash into a different offset of the final checksum.
The offset would depend on some part of the input records (e.g. TXID).
Then, for convenience, I'd calculate and distribute a shorter hash (of the big "final checksum").

EDIT:
And, for extra security, I'd also feed the final hashing with the number of input records.