Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: jratcliff63367 on July 01, 2013, 07:46:23 PM

Title: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: jratcliff63367 on July 01, 2013, 07:46:23 PM

I just wrote a bitcoin blockchain parser in a tiny code snippet. This is implemented as just two source files, one header and one CPP, without any external dependencies. This code snippet does not use any memory allocation, templates, containers, STL, Boost, or anything more complicated than 'fopen, fread, fclose'.

I wrote this mostly as a learning tool so I could understand the blockchain format myself.

While writing it I ran into a couple of minor issues. One, is that sometimes a blockchain data file will just run out of data, the rest of the file containing zeros. I don't know if this is normal or expected, but I treat it as an end-of-file condition.

The other is that some blocks contain less data than is indicated by the block length; meaning that after all of the transactions are read in, the file pointer has not advanced as far as the 'block length' which was indicated. I am going to assume this is normal and expected?

This code snippet parses my copy of the blockchain (9.2gb) in roughly 95 seconds; which I figured was pretty good; though I don't know what I have to compare to.

At any rate, if anyone finds this code snippet useful or just wants to better understand the data layout of the bitcoin blockchain you can find it here:

http://codesuppository.blogspot.com/2013/07/a-bitcoin-blockchain-parser-as-single.html

Feedback, bugfixes, suggestions, all welcome.

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 02, 2013, 05:16:45 PM

Quote from: jratcliff63367 on July 01, 2013, 07:46:23 PM

Being someone who wrote a blockchain parser from scratch (which is what Armory does on every load/rescan), I can tell you that everything about that is normal (except for the second point below). And 95 seconds is pretty good. It depends what else you're doing while you're scanning. In my case, my code scans in about 150-240 seconds depending on HDD caching, but it's also identifying all the transaction scripts in order to search for coins that are related to my wallet (so I additionally parse and identify individual scripts and then do a lookup in a set<Address160> structure to see if it's relevant).

The zeros at the end of files are due to pre-allocation. Since Bitcoin-Qt version 0.8, they now always allocate blk*.dat files in chunks of 16 MB to reduce disk thrashing. i.e. if you keep appending to a file over and over again, the OS is forced to keep reallocating that space, or fragmenting it. By preallocating with zero-padding, the number of disk reallocations are cut down dramatically
I'm not sure what would cause the numBytes field to be incorrect. How often do you see it? Do you have an example?

I think doing exactly what you did is one of the best ways to learn the ins and outs of Bitcoin. If you want to do more, I recommend you pick some addresses and try adding code to watch for that address and accumulate its balance and spendable UTXOs.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 02, 2013, 06:56:30 PM

sorry, what do you mean by "blockchain parser"?
of course it doesn't need any external dependencies, since it doesn't even seem to bother about needing sha256 at any place.
but myself, I cannot imagine a bitcoin blockchain parser that would not need to use sha256

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 03, 2013, 03:07:43 PM

Thanks for the feedback, I agree this is the best way to learn about the blockchain.

By 'parser' I mean something that can scan every file in the block-chain on disk and return each block in a data structure suitable for further processing.

That is all my code snippet does, nothing more and nothing less. It's a first starting point anyone would need to do if they wanted to crack the walnut of data inside the block-chain files.

The next step is to actually interpret the data in those blocks and convert them into specific transactions. That requires actually parsing the input and output scripts. To do that, I have to implement the script virtual machine. Now, I could do that by borrowing someone else's source code, or I could just write it for myself as a learning exercise.

My personal goal is to extract all of the transaction data. I don't want to write a full client, I don't want to submit or even validate transactions. I suppose the question is can I parse out the transactions without also having to do the full validation? If I have to, then so be it, if not, that's fine too.

My personal motivation in doing this is that I want a way to track the embedded 'value' in all outstanding bitcoin transactions by comparing the value of the coin in US dollars at the time it was last transferred. I'm particular interested in large quantities of BTC which haven't moved for years but suddenly 'come to life'.

Perhaps someone has already written such an analysis tool, but I haven't seen one yet.

At any rate, I'm mostly doing this to satisfy my own personal curiosity and if someone else finds the source code useful and/or educational, that's a bonus.

So, I guess my follow up question is this. What part of the scripts represent the actual transaction; i.e. 'This amount from this BTC address to this BTC address.' Is that in both the input and output script? Do I have to run the full cryptography to get that answer, or is that just part of the basic information when you parse the script?

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 03, 2013, 04:23:34 PM

Quote from: jratcliff63367 on July 03, 2013, 03:07:43 PM

I suppose the question is can I parse out the transactions without also having to do the full validation? If I have to, then so be it, if not, that's fine too.

I don't think the reference client actually writes blocks to disk unless they have been validated.

One problem is that all blocks are written to the disk when validated, so you may have some blocks from orphan chains on the disk. You would need to exclude them from your analysis.

The easiest way would be to just take the longest chain in terms of blocks without worrying about POW.

If you ignore the last 100 blocks in the main chain, then you should get a single clean chain without having to worry about orphans.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: ProfMac on July 03, 2013, 04:33:57 PM

I also do projects like this to learn. I figure when someone says I don't understand something, a piece of from scratch working code is a powerful answer.

My next step would be to find the code snippets in the reference client that can do the parsing, and comprehend them with my new found knowledge, then write a utility that uses the reference client code when possible. I always do that sort of thing, I usually get frustrated mid-way through.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 03, 2013, 04:39:11 PM

Worrying about block validation is a whole new world you don't want to get into. For now you can simply leverage the fact that if the block is in the blk*.dat files it is valid along with all the transactions in it. As TierNolan said though, you'll probably have a few orphans in there, so you probably need your scanning code to either compute longest chain (which is a good exercise in itself) or just not assume that 100% of the blocks in those files are in the main chain.

For a first exercise, you don't need a full script evaluation engine. Just program in enough to recognize common TxOut scripts and ignore the non - std scripts. Although in terms of computing balances, it's kind of arbitrary... You can store every TxOut in a map<OutPoint, RawTxOut> and remove them as you see TxIns that spend them. When you're done, you should have a map containing all unspent TxOuts on the network.

Also for your analysis of looking for dormant coins being spent, look up "bitcoin days destroyed"

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: bluemeanie1 on July 04, 2013, 03:01:50 AM

I've been doing Block Chain processing as well in Java.

http://i.imagebanana.com/img/g11ev2ol/Selection_018email.png

this diagram shows the famous Hal Finney transaction (block #170), which spends an output from block #9. This output is later spent in block #181, then #182, and then spent in both #183 and #221.

I'm getting my data from the Bitcoind REST APIs though, it's a bit slower, but once the primary ETL is done then you don't have to worry much about it.

the data set though is far too large for the typical end user of Bitcoin though.

............. another cool one. looks like a miner collected a bunch of coins into one account and then spent it to another.

http://i.imagebanana.com/img/9e42925q/Selection_019.png

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 04, 2013, 02:35:50 PM

Does anyone know of a better explanation of how script processing works than I have found online?

Or could someone post an pseudo-code style description here?

My question is this.

There are (n1) input scripts and (n2) output scripts.

Does n1 always equal n2?

Does each input match up with each output?

In other words, in what order are the scripts executed and do they share the same stack?

For example, does input one script push values on the stack which are then popped by output script one?

It's not entirely clear to me must from reading the online documentation the order of execution and how the stack is shared between scripts.

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 04, 2013, 03:32:52 PM

Quote from: jratcliff63367 on July 04, 2013, 02:35:50 PM

There are (n1) input scripts and (n2) output scripts.

Right

Quote

Does n1 always equal n2?

Quote

Does each input match up with each output?

This works for certain kinds of multi-sig. You can say that your signature is conditional on a the matching output being unchanged (but you don't care about any of the other outputs).

Quote

In other words, in what order are the scripts executed and do they share the same stack?

They are all independent. However, some of the hashing steps for signatures blank include/exclude some outputs.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 04, 2013, 03:56:53 PM

The TxIns and TxOuts are independent. Think of the network as containing a bunch of floating [unspent coins]. Each of those unspent TxOuts is like a bill (like $20 bill). The TxIn is simply a reference to a particular bill with a signature proving you are authorized to spend it. If you want to send someone 10 BTC, you need to collect enough bills (say, a 4 BTC bill/TxOut, a 2 BTC bill/TxOut, and a 7 BTC bill/TxOut). 10BTC output, and one output sending the remainder back to yourself (3 BTC).

You can think of transactions as eating unspent TxOuts, and producing out new TxOuts (and you are authorizing the spending of the old TxOuts by the signatures in the TxIns). A transaction takes TxOuts of some ownership, and converts them into TxOuts of a new ownership. You can split them or merge them in any quantity, as long as you have at least one TxIn (corresponding to a previous TxOut being eaten) and one new TxOut.

Here's a random picture I made a long time ago that attempts to illustrate it.

https://dl.dropboxusercontent.com/u/1139081/BitcoinImg/tx_illustr.png

The dotted lines for the TxIns illustrates that they are really kind of "transient" data. They're simply for the network to see that you were authorized to move those coins, but in the far future, the old spent TxOuts, and the TxIns that moved them can all be pruned off.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 05, 2013, 06:42:14 PM

Ok, I think I'm making progress understanding this; a bit.

I now understand that each block as a certain number of transactions and each transaction has a certain number of inputs and outputs (potentially representing where btc is coming from and where it is going to).

Could you guys help clarify some things in regards to simply parsing the raw blockchain data to get a meaningful transaction history out?

(1) Can you confirm that it is true that a program which only wants to navigate the transaction history of the blockchain does not actually have to parse either the input or the output scripts? This is true, correct? Those scripts are only used for validating transactions but since I'm reading the blockchain off of my hard drive, they represent all properly validated transactions. Correct?

(2) All of the information needed to understand the blockchain (i.e. see the same kind of information shown on blockchain.info) can be found simply by parsing the blockchain data files (blkNNNNN.dat) files?

(3) When looking at the block data; there is a timestamp for the block; but I don't see a timestamp for individual transactions. Is this correct?

(4) When processing an input it specifies an input transaction hash and a transaction index number; how does one go from that hash/index and figure out a specific output transaction in a previous block?

(5) On BlockChain info, if you just look at Block #1 https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048 (https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048) It specifies both a hash and a previous block hash. I'm a little confused on that point; there is no 'previous' block to block #1. Also, there is no 'hash' field on block-chain data structure; only a previous hash field.

(6) On the same block #1 page, it shows an output address '12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX'. Where does that value come from? Was it in the script? I don't see it anywhere in the data structures? Also, this address isn't a hex value; is there some published algorithm that converts a binary hash into this ASCII string format?

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 05, 2013, 06:53:37 PM

Quote from: jratcliff63367 on July 05, 2013, 06:42:14 PM

(1) Can you confirm that it is true that a program which only wants to navigate the transaction history of the blockchain does not actually have to parse either the input or the output scripts? This is true, correct? Those scripts are only used for validating transactions but since I'm reading the blockchain off of my hard drive, they represent all properly validated transactions. Correct?

(2) All of the information needed to understand the blockchain (i.e. see the same kind of information shown on blockchain.info) can be found simply by parsing the blockchain data files (blkNNNNN.dat) files?

Right

Quote

(3) When looking at the block data; there is a timestamp for the block; but I don't see a timestamp for individual transactions. Is this correct?

Transactions don't have timestamps. They are "locked" to a time when they are included in a block. If they had a timestamp, then they would only work for a particular block, but miners decide which blocks to place transactions into.

They do have a "locktime". Transactions cannot be added into a block until that time has arrived (or that block number if < 50000000 or so).

Quote

(4) When processing an input it specifies an input transaction hash and a transaction index number; how does one go from that hash/index and figure out a specific output transaction in a previous block?

You need to create a map of hash(transaction) -> transaction (possibly file pointer).

You have to hash the entire transaction as it appears in the file. That is sha256(sha256(transaction)). Sometimes people get the big-endian/little-endian the wrong way around.

Quote

(5) On BlockChain info, if you just look at Block #1 https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048 It specifies both a hash and a previous block hash. I'm a little confused on that point; there is no 'previous' block to block #1. Also, there is no 'hash' field on block-chain data structure; only a previous hash field.

Yes there is, blocks start from block 0, there is no previous to block zero though.

However, the genesis block is presumably not in the file, since it is hard-coded.

Quote

(6) On the same block #1 page, it shows an output address '12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX'. Where does that value come from? Was it in the script? I don't see it anywhere in the data structures? Also, this address isn't a hex value; is there some published algorithm that converts a binary hash into this ASCII string format?

They calculate that from the script. There are a few standard (https://en.bitcoin.it/wiki/Script#Standard_Transaction_to_Bitcoin_address) transactions.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: ProfMac on July 05, 2013, 06:53:44 PM

Quote from: jratcliff63367 on July 05, 2013, 06:42:14 PM

Ok, I think I'm making progress understanding this; a bit.

(5) On BlockChain info, if you just look at Block #1 https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048 It specifies both a hash and a previous block hash. I'm a little confused on that point; there is no 'previous' block to block #1. Also, there is no 'hash' field on block-chain data structure; only a previous hash field.

Block #0 exists. Your link has a field for "Previous Block" on the right hand side.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: xavier on July 05, 2013, 07:07:33 PM

This is pretty cool! Thanks for doing this. I'll check it out over the weekend.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 05, 2013, 07:20:14 PM

Ok, thanks, that helps quite a bit.

I'm really surprised that transactions don't have a time stamp. You mean there is no way to know 'when' a transaction occurred?

So, it sounds like I don't need to parse the scripts to know the transactions but if I want to know the source/destination addresses of the public keys I *do* have to parse the scripts?

I guess I'm still a little confused on that point. I started writing a script parser, but the scripts I try to run against it are all kind of incomplete. The input scripts generally just push data on the stack but have no actual instructions to 'do' anything.

The output scripts tend to execute an instruction such as 'OP_CHECKSIG' which requires that two items are on the stack, but when I get here there is usually only one item on the stack; and I'm on clear on what was supposed to have been executed prior to it.

I'm not really understanding I guess how the input scripts match up to the output scripts and how current state of the stack is retained; I mean do you like run an input script; which pushes some stuff on the stack for the output script, or vice versa? And, if so, how do you know which input scripts match which output scripts?

For example, the block #1 output script pushes 65 bytes of data on the stack and then tries to execute OP_CHECKSIG which immediately fails in my VM and the reference code for bitcoin-qt because it requires there be at least two arguments on the stack.

Regarding my other question about the 'public-key' part of bitcoin addresses, I found that it's using this 'Elliptic Curve Digital Signature Algorithm' thing. What I can't find is a reference to how one converts the binary data for the key into the ASCII representation; this would be useful for debugging.

Sorry for the annoying questions but I will try to make up for it by posting some articles and open-source code snippets if I ever get this transaction analysis tool working.

Does anyone know of a C/C++ program someone else has already written that does what I'm trying to do here; just walk the block-chain and track transactions? All of these code bases have sooo...many dependencies that you dive into crytographic hell trying to decipher them.

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 05, 2013, 07:26:38 PM

You shouldn't be surprised there's no timestamps on transactions - that's the whole point of Bitcoin which is for a whole network of thousands of independent nodes to all agree on the ordering of transactions. In essence, a transaction doesn't even really exist until it is included in a block. Therefore, the block timestamp is when the TX became "real". Everyone has different clocks and receives the tx in different orders and times. The point of Bitcoin is for this network with no central authority to agree on which ones actually happened.

As for C++ code, go look at the Armory code base which had code that does exactly what you're looking for in the cppForSWIG directory. BlockUtils cppForSWIG. Should be a function called parseEntireBlockchain of something like that.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 05, 2013, 07:28:13 PM

Quote from: jratcliff63367 on July 05, 2013, 07:20:14 PM

That's what the output scripts are.
The spending scripts pull this data from the stack and verify against signatures.

What is put on the stack is the public key that you need to convert into 25 bytes of a "version+hash+chksum" - then you base58 encode it and that is the address.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 05, 2013, 07:32:43 PM

Quote from: jratcliff63367 on July 05, 2013, 07:20:14 PM

I guess I'm still a little confused on that point. I started writing a script parser, but the scripts I try to run against it are all kind of incomplete. The input scripts generally just push data on the stack but have no actual instructions to 'do' anything.

The output scripts tend to execute an instruction such as 'OP_CHECKSIG' which requires that two items are on the stack, but when I get here there is usually only one item on the stack; and I'm on clear on what was supposed to have been executed prior to it.

That is intentional.

Each transaction has half of the script.

You run the script for the input and then run the script for the output.

The standard address scripts are:

Output: OP_DUP OP_HASH160 <pubKeyHash> OP_EQUALVERIFY OP_CHECKSIG
Input: scriptSig: <sig> <pubKey>

The input is run and then the output.

So, the output says "Anyone who can setup the stack so that this script verifies can spend this coin".

The full instuctions are

OP_DUP: Copy the top of the stack
OP_HASH160: Apply RIPE-160 to the top of the stack
<pub-key-hash>: Push that value onto the stack
OP_EQUALVERIFY: Make sure the top 2 values on the stack are the same (and pop both)
OP_CHECKSIG: Check the signature

If you run the input, it pushes 2 values onto the stack

<sig> <public key>

This sets up a successful run for the output

OP_DUP: <sig> <public key> <public key>
OP_HASH160: <sig> <public key> RIPE-160(<public key>)
<pub-key-hash>: <sig> <public key> RIPE-160(<public key>) <pub-key-hash>
OP_EQUAL_VERIFY: <sig> <public key> (Also checked RIPE-160(<public key>) was equal to <pub-key-hash>)
OP_CHECKSIG: checks that sig is a valid signature using the public key

Quote

I'm not really understanding I guess how the input scripts match up to the output scripts and how current state of the stack is retained; I mean do you like run an input script; which pushes some stuff on the stack for the output script, or vice versa? And, if so, how do you know which input scripts match which output scripts?

Right, each is half.

Quote

Regarding my other question about the 'public-key' part of bitcoin addresses, I found that it's using this 'Elliptic Curve Digital Signature Algorithm' thing. What I can't find is a reference to how one converts the binary data for the key into the ASCII representation; this would be useful for debugging.

Addresses are encoded using Base 58 (https://en.bitcoin.it/wiki/Base58Check_encoding) encoding. Normal keys etc are normally shown using hex encoding.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 05, 2013, 07:54:06 PM

>>You run the script for the input and then run the script for the output.

Ok, I had previously asked in this thread how the input and output scripts related to each other and I did not get a clear answer.

In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 05, 2013, 07:55:40 PM

Quote from: jratcliff63367 on July 05, 2013, 07:54:06 PM

In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?

You take output script from a different transaction - the one pointed by txid+vout

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 05, 2013, 08:01:36 PM

Quote from: jratcliff63367 on July 05, 2013, 07:54:06 PM

In a given transaction they are unrelated. For each TxIn you look at its OutPoint and go fetch the TxOut it references from a previous transaction in the block database. You use that TxOut. The TxOut in this Tx will be used as inputs in a future transaction (when the recipient spends the money)

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 05, 2013, 08:03:56 PM

Quote from: jratcliff63367 on July 05, 2013, 07:54:06 PM

In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?

No, you run the input script from the spending transaction and then the output script from the source transaction.

If transaction 123456789 input 4 references transaction 987654321, output 7, then those are the 2 scripts you have to use together.

Also, if they are standard transactions, you don't even need to bother, the address will be in a constant location.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 05, 2013, 08:08:20 PM

so here is where you come to the point where you understand that you need to build and keep a database of unspent txouts - a big one.
and also you will need sha256, each time before putting a record into it :)

EDIT: let my know how many lines of code will your parser have, at the end :)

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 05, 2013, 08:43:16 PM

Yeah, I'm starting to no longer have fun doing this. The relationship between the inputs and outputs are about as clear as mud. Well, let me be clear, I get that each output needs an input, but hell if I know how to figure out which outputs refer to which inputs. Just a bit ago, I wanted to figure out how to convert the binary representation of a public-key to ASCII and even that it had a log of dependencies; even the 'no library dependency version' depended on some network thing called 'htonl' and on an encryption routine as well. That's a lot of code drug in just to convert some binary to ASCII.

I still do not believe that this needs to be that difficult, but everyone seems to start by including massive libraries like CrypoPP and boost, networking code, etc. so it's not much fun trying to dig through this stuff.

I like to write code with as few dependencies as humanly possible and often in what I call 'snippet' form; one header file and one CPP which demonstrates how to do just a single thing. But I can't find anything like that, instead it's these massive libraries which build on top of each other in layers so that breaking it out into 'snippet' form is a major task.

I also think that the code snippet I started with; which reads each block into memory, is valuable and useful, but converting that raw date into paired input/output transactions seems to be a bit of a mystery to me so far.

[*Edit*]

I may not give up just yet. I have some time this weekend and I finally was able to get BitcoinArmory 'cppForSwig' to build and run on my machine parse the bitcoin blockchain on my hard-drive. I had to change a few things to get it to work; but it is working now. Hopefully I will be able to unravel the mystery of inputs-outputs and transactions by stepping through this code.

It does use CryptoPP but perhaps I can strip out just the (hopefully) handful of routines which are actually used.

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 05, 2013, 08:48:44 PM

Oh, don't be a pussy.
This is exactly where the fun starts.

It all can be done - maybe easier than you imagine ATM.
As someone had said here, you can just keep the unspent outputs in memory, in maps. Hashmaps are standard in C++.
And sha256 function is a simple code that you can include in your sources.
You don't really need any dependencies, if you like writing code :)

And no, output does not need an input - only input needs an output.
So you only need to keep track of unspent outputs, in order to be able to 'spend' them by processing further inputs in the block chain.
It's not a really complex problem to solve.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: DeathAndTaxes on July 05, 2013, 09:05:34 PM

Quote from: jratcliff63367 on July 05, 2013, 08:43:16 PM

They don't. That might be why you are confused.

The only relationship between inputs and outputs in a tx is is that the sum of the inputs must be equal to or greater than the sum of the outputs. The tx fee is the difference between the sum of the inputs and the sum of the outputs.

So some rules to get you started.

1) The inputs of all txs (except coinbase = generation) are the output of prior transactions.
2) There is a 1:1 relationship between the INPUT of a given transaction and the output of a PRIOR transaction.
3) In a given transaction there is no relationship between the inputs and outputs.
4) For a tx to be valid the sum of the inputs must be equal to or greater than the sum of the outputs.
5) The difference between inputs and outputs is the tx fee.
6) Since a tx can have n inputs and m outputs we refer to a specific output by both the tx hash AND the index.

Quote

but converting that raw date into paired input/output transactions seems to be a bit of a mystery to me so far.

It will forever remain a mystery no matter how many libraries you use until you shake that flawed model.

Lets look at a random transaction.

http://blockexplorer.com/tx/a6d9c176ecb041c2184327b8375981127f3632758a7a8e61b041343efc3bcb6e

In raw form

Quote

{
  "hash":"a6d9c176ecb041c2184327b8375981127f3632758a7a8e61b041343efc3bcb6e",
  "ver":1,
  "vin_sz":1,
  "vout_sz":2,
  "lock_time":0,
  "size":257,
  "in":[
   {
   "prev_out":{
   "hash":"b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325",
   "n":0
   },
   "scriptSig":"304402200e3d4711092794574e9b2be11728cc7e44a63525613f75ebc71375f0a6dd080d02202ef 1123328b3ecddddb0bed77960adccac5bbe317dfb0ce149eeee76498c19b101 04a36b5d3b4caa05aec80752f2e2805e4401fbdbe21be1011dc60c358c5fc4d3bedd1e03161fb4b 3a021c3764da57fee0d73570f3570f1b3dd92a1b06aae968846"
   }
  ],
  "out":[
   {
   "value":"300.00000000",
   "scriptPubKey":"OP_DUP OP_HASH160 0331e5256416bc11ecf9088091f8424819553a10 OP_EQUALVERIFY OP_CHECKSIG"
   },
   {
   "value":"699.99950000",
   "scriptPubKey":"OP_DUP OP_HASH160 4186719d739ae983d8c75a0cb82958e94b7ae81e OP_EQUALVERIFY OP_CHECKSIG"
   }
  ]
}

Now the tx hash is a6d9c176ecb041c2184327b8375981127f3632758a7a8e61b041343efc3bcb6e
The # of inputs is 1 (vin_sz).
The # of outputs is 2 (vout_sz).

The single input of the transaction is referenced here:

Quote

"in":[
   {
   "prev_out":{
   "hash":"b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325",
   "n":0
   },
   "scriptSig":"304402200e3d4711092794574e9b2be11728cc7e44a63525613f75ebc71375f0a6dd080d02202ef 1123328b3ecddddb0bed77960adccac5bbe317dfb0ce149eeee76498c19b101 04a36b5d3b4caa05aec80752f2e2805e4401fbdbe21be1011dc60c358c5fc4d3bedd1e03161fb4b 3a021c3764da57fee0d73570f3570f1b3dd92a1b06aae968846"
   }

The INPUT for this transaction is the OUTPUT of a PREVIOUS transaction.
Specifically it is the tx with hash b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325. However a tx can have multiple outputs so how do we know which one of those otuptuts this input refers to? Simple the tx output index is provided.

Inputs of transactions are outputs of prior transactions.

In this case the input we are looking for is is specifically the output #0 of transaction hash b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325

So lets look at that transaction. The output #0 of this transaction is 1,000 BTC.
http://blockexplorer.com/tx/b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325

So back to our current transaction (hash ending e325). There are only 1 input. The value of the transaction is the sum of the inputs. In this case a single input of 1,000 BTC.

Now looking at the output section we can see we the two outputs
"out":[
   {
   "value":"300.00000000",
   "scriptPubKey":"OP_DUP OP_HASH160 0331e5256416bc11ecf9088091f8424819553a10 OP_EQUALVERIFY OP_CHECKSIG"
   },
   {
   "value":"699.99950000",
   "scriptPubKey":"OP_DUP OP_HASH160 4186719d739ae983d8c75a0cb82958e94b7ae81e OP_EQUALVERIFY OP_CHECKSIG"
   }

Output#0 is 300 BTC and output #1 is 699.9995 BTC.
The sum of the outputs is 999.9995 BTC
The sum of the inputs is 1,000.0000 BTC
The tx fee to miners is the difference or 0.0005 BTC.

If you want to decode further which addresses a transaction is sent to well you will need more code.

Lets look at the second output.

Quote

OP_DUP OP_HASH160 4186719d739ae983d8c75a0cb82958e94b7ae81e OP_EQUALVERIFY OP_CHECKSIG

So what is 4186719d739ae983d8c75a0cb82958e94b7ae81e? It is the RIPEMD160 hash of the public key. However Bitcoin addresses include a checksum and are converted to Base58.

BTW the public key hash 4186719d739ae983d8c75a0cb82958e94b7ae81e is Bitcoin address 16yTynjmSe5bsRGykDaaCL5bm2pxiEfcqP.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: 2112 on July 05, 2013, 10:28:06 PM

Quote from: jratcliff63367 on July 05, 2013, 07:20:14 PM

Does anyone know of a C/C++ program someone else has already written that does what I'm trying to do here; just walk the block-chain and track transactions? All of these code bases have sooo...many dependencies that you dive into crytographic hell trying to decipher them.

Yes, user znort987 did just that. But then he couldn't stand the pressure, deleted the key posts and quit participating in the forum.

His thread starts here:

https://bitcointalk.org/index.php?topic=88584.0

And the source code is still available:

https://github.com/znort987/blockparser

You may want to check his thread and his post to understand a bit of what had happened and avoid any possibility of future stress or nervous breakdown.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 05, 2013, 10:48:39 PM

Thanks, I looked at that before. At first it seems close, but it has dependencies on openssl, some google hash map template code, and only builds and runs on unix. I'm working on windows personally.

I suspect I'm really close to having this all figured out. It's not that I don't get the basic concepts of the transactions etc. I'm just unclear how some of these hashes are computed and how they relate to one another.

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 05, 2013, 10:56:03 PM

Quote from: jratcliff63367 on July 05, 2013, 10:48:39 PM

Thanks, I looked at that before. At first it seems close, but it has dependencies on openssl, some google hash map template code, and only builds and runs on unix. I'm working on windows personally.

You need to be able to do sha256, if you want to work out hashes. You could probably find code for just that.

OpenSSL is also used for all the other crypt stuff, but you don't need that.

You also need the base58 encoder to convert public keys (after hashing with RIPE-160) into bitcoin text addresses.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 06, 2013, 07:17:46 PM

Ok, I believe I am unblocked now on finishing up my parser/transaction tool. The piece that I was missing was the transaction hash.

Each input refers to a 'transactionHash' however there is no transaction hash stored in the blockchain .dat files.

After some google searching I discovered that the transactionHash is computed by running SHA256 on the raw transaction input buffer. Through the magic of SHA256 no two transactions are presumed to ever produce identical hashes. I don't need to know what 'block' the transaction is in, as the transaction data is a complete standalone dataload unto itself, unique in the world of bitcoin.

This was the missing piece of the puzzle for me because I didn't realize that I had to compute this hash when I parsed the input file, at least, if I ever had any hope of binding inputs to the correct transactions.

A couple of general questions. Are transactions always written to the last block in the block chain? In other words, old blocks never get re-written? I would assume this is the case otherwise the blockchain would become fragmented and need garbage collection, etc.

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: DeathAndTaxes on July 06, 2013, 07:30:50 PM

Older blocks are never rewritten but it is possible there are 2 or more competing chains.

Example the last block is block height 123

One miner solves a block with height height of 124 (it references the block hash 123 as the prior blockhash). We will call this block 124a.
At roughly the same time a different miner solves a different block with a block height 124. We will call this block 124b.

At this point the network is not in consensus. Some nodes believe block 124a is the next "valid" block and some believe 124b is.

Eventually some node will solve block 125. If the block is built on the 124a chain then block 124b becomes orphaned. If it is built on the 124b chain then block 124a is orphaned.

If your node had received block 124a before it became orphaned it will be contained in your block history. When block 124b orphans block 124a it will also be contained in your block history. You can detect this because both blocks will have the block hash for block 123 as the prior block hash.

In your block parser you can built a simplified system to remove orphans by working backwards. bitcoind will tell you the hash of the last block of the longest chain. Working from that block you can locate the prior block hash field in the header and then locate the corresponding block, and work all the way back to the genesis blocks. Any other blocks are orphans and for the purpose of parsing the historical blockchain can be pruned.

Simple version:
It is possible at any point in time there are multiple conflicting blockchains. The "longest"* chain is considered the consensus record.
* longest in this instance doesn't mean just highest block height (as that can be spoofed) it means the chain with highest aggregate difficulty.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: ProfMac on July 06, 2013, 08:59:40 PM

Quote from: DeathAndTaxes on July 06, 2013, 07:30:50 PM

Simple version:
It is possible at any point in time there are multiple conflicting blockchains. The "longest"* chain is considered the consensus record.
* longest in this instance doesn't mean just highest block height (as that can be spoofed) it means the chain with highest aggregate difficulty.

I want to produce just such a spool in the bytecoin blockchain. Someone mined blocks at the rate of 1 every minute or two at difficulty 8,000, and then quit mining when the difficulty climbed to 32,000. Since that transition to higher difficulty on about May 5 (two months) only about 250 blocks have been mined. This past week, 1 single block was mined.

I thought it might be interesting to do some experiments on that block chain, to go back to the next to last block at difficulty 8,000 and produce a fork. I would attempt to repair the chain to the extent possible. It should be possible to mine new blocks that honored previous payouts (or possibley not, in the case of the grief-miner)

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 06, 2013, 09:24:26 PM

Quote from: ProfMac on July 06, 2013, 08:59:40 PM

I thought it might be interesting to do some experiments on that block chain, to go back to the next to last block at difficulty 8,000 and produce a fork. I would attempt to repair the chain to the extent possible. It should be possible to mine new blocks that honored previous payouts (or possibley not, in the case of the grief-miner)

That is a chain fork, since the new "official" chain would have lower POW. If you are modifying the client, you could just put a blacklist check for the first high difficulty block.

Presumably, the miner traded his coins before departing?

It is a curse of alt coins that hashing is so variable.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: ProfMac on July 06, 2013, 10:28:24 PM

Quote from: TierNolan on July 06, 2013, 09:24:26 PM

Quote from: ProfMac on July 06, 2013, 08:59:40 PM

I don't know if he traded them or not. If he did, I would want to have the miner's block payout to whatever address has coins derived from his. That will take some analysis, which is how I ended up in this thread. I have mixed feelings about not replacing his coins.

I also wanted to choose what times-tamps to use on the forked blocks, instead of "now." I haven't read the client's source.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 06, 2013, 10:31:32 PM

Quote from: ProfMac on July 06, 2013, 10:28:24 PM

If you are changing the client anyway, then you could just force a difficulty change.

That is what some coins do after they lose lots of hashing power.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: ProfMac on July 06, 2013, 11:54:15 PM

Quote from: TierNolan on July 06, 2013, 10:31:32 PM

If you are changing the client anyway, then you could just force a difficulty change.

That is what some coins do after they lose lots of hashing power.

That would let us get back to hashing without forking the chain. I think a chain fork is a very big step.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 07, 2013, 12:00:27 AM

For those of you interested in seeing the progress of this project; I just uploaded my current work. The report of what's changed and what this update contains is located at this link on my personal coding blog.

http://codesuppository.blogspot.com/2013/07/work-in-progress-on-bitcoin-blockchain.html (http://codesuppository.blogspot.com/2013/07/work-in-progress-on-bitcoin-blockchain.html)

This afternoon I started focusing on getting my 'parser' to display the same data which shows up on the block-explorer website; and I made good progress on that.

I was just getting ready to make it print out the public-key in ASCII when I found it that doing that is a giant pain in the ass.

You need to have access to a 'big-number' library. I refactored the version that is in 'cbitcoin' but it's still not producing a valid ASCII string for a given binary input address. I'm not sure why, but it's late in the day and that's enough for now.

In keeping with my personal coding style/preference all of the code is in 'snippet' form.

That means it's got a single header file which presents a minimal public API and then a single CPP with the implementation of that API. There are no external dependencies.

So far, I've got the main BlockChain parser in snippet form, also hashes for SHA256 and RIPEMD160, and a snippet that is supposed to convert a bitcoin address from ASCII to binary and from binary to ASCII, but it's not yet fully working. I have the start of a bitcoin scripting engine; but it doesn't parse much yet.

Thanks for all the help, I don't know when I'll find time to keep working on this, but so far it's been fun.

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: TierNolan on July 07, 2013, 12:21:45 AM

Quote from: ProfMac on July 06, 2013, 11:54:15 PM

That would let us get back to hashing without forking the chain. I think a chain fork is a very big step.

Hmm, this is kind of off-topic for the thread. In the end, a non-fork option would mean that you need to match the POW done.

Quote from: jratcliff63367 on July 07, 2013, 12:00:27 AM

I was just getting ready to make it print out the public-key in ASCII when I found it that doing that is a giant pain in the ass.

You could print out in hex, as a start.

Quote

You need to have access to a 'big-number' library. I refactored the version that is in 'cbitcoin' but it's still not producing a valid ASCII string for a given binary input address. I'm not sure why, but it's late in the day and that's enough for now.

It is always worth checking that you have the endian-ness right. Also, there are version numbers (https://en.bitcoin.it/wiki/Base_58_Encoding) too.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 07, 2013, 12:32:39 AM

printing the address from txout is not something that you can really do in a reliable and ultimate way.
the outputs scripts can be anything. most of them follow the usual patters, but they don't have to really follow any limited number of patterns - in reality there is no such thing as output address of a bitcoin transaction, there is only an output script - if you can hexdump this one, for some people it would be as good as printing some address.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 07, 2013, 01:04:31 AM

Interesting, so how come on the block-explorer website it shows public-key information for all of the blocks?

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 07, 2013, 01:05:42 AM

Quote from: jratcliff63367 on July 07, 2013, 01:04:31 AM

Interesting, so how come on the block-explorer website it shows public-key information for all of the blocks?

Sorry, I don't understand this question.

Title: Convert a publick-key signature (65 bytes) to bitcoin ASCII address.
Post by: jratcliff63367 on July 07, 2013, 03:00:46 PM

Wow, that was shockingly complicated.

For anyone in the future who ever has to/wants to convert a public key signature (65 byte binary embedded in most bitcoin scripts) to the ASCII representation of a bitcoin address, I have written a reference implementation that does this in a single CPP file.

I still plan to do some code cleanup on this, but it does work.

Basically, think of this as a reference implementation of what is described on this page https://en.bitcoin.it/wiki/Technical_background_of_version_1_Bitcoin_addresses

Here is the header file with quite a bit of corresponding documentation.

https://code.google.com/p/blockchain/source/browse/trunk/BitcoinAddress.h

Here is the implementation. I still plan to do some code cleanup on this, but the main point is that you can go from the 65 byte public-key signature to the bitcoin ASCII address with no external dependencies to any other code.

https://code.google.com/p/blockchain/source/browse/trunk/BitcoinAddress.cpp

I hope someone else finds this useful as it took me quite a bit of work to pull the necessary code to do this from multiple sources.

As I said, I still plan to do some cleanup on this code; mostly cleaning up the 'bignum' methods which are a little sloppy right now; it does completely unnecessary memory allocations and I want to remove all of that. The hash routines are pretty clean already and, hopefully, work well on lots of platforms.

I've only built this code with Microsoft Visual Studio. If there are compile errors with GCC, please let me know and I will revise it accordingly.

Seriously, if you dump this one CPP/H into a project and it fails to compile, I would like to know.

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 07, 2013, 03:19:58 PM

Good job.
It wasn't that hard after all, was it?

In many cases you will also want to decode the address from txout, while you do not know the public key yet.
I said before that it isn't really possible to represent each possible out script as an address, but these days most (like 99%) of the output scripts follow a standard pattern, that is:

Code:

0x76 0xa9 0x14 <20-bytes-of-addr's-RIMED> 0x88 0xAC

So from the 20 bytes you can already figure out address where the money is being sent to, yet without having its public key (that only comes later, if this output is being spent).

I once made a tool that is able to decode a raw tx data and dump the output addresses.
It needs openssl to link, but maybe you will find the code itself educational - some people prefer reading source code, over reading a wiki. :)

Code:

/*
gcc bctrans.c -o bctrans.exe -lcrypto -I /local/ssl/include -L /local/ssl/lib
*/

#include <stdio.h>
#include <stdlib.h>
#include <stdlib.h>
#include <strings.h>
#include <openssl/sha.h>
#include <openssl/bn.h>


static unsigned char addr_version = 0x00;
static FILE *f = NULL;

BIGNUM *bn58, dv, mo;
BN_CTX *ctx;


#define SHA256(p,l,o) { \
	SHA256_CTX shactx; \
	SHA256_Init(&shactx); \
	SHA256_Update(&shactx, (p), (l)); \
	SHA256_Final((o), &shactx); }


void readfile(unsigned char *p, int len) {
	if (!f) {
		int c, i;
		char s[3];
		while (len>0) {
			for (i=0;i<2;) {
				c = getchar();
				if (c==EOF) {
					fprintf(stderr, "File too short\n");
					exit(1);
				}
				c = tolower(c);
				if (c<='9' && c>='0' || c<='f' && c>='a')  s[i++] = (char)c;
			}
			s[2] = 0;
			sscanf(s, "%x", &c);
			*p = (unsigned char)c;
			p++;
			len--;
		}
	} else {
		if (fread(p, 1, len, f)!=len) {
			fprintf(stderr, "File too short\n");
			fclose(f);
			exit(1);
		}
	}
}

unsigned long long getle(unsigned char *p, int bytes) {
	unsigned long long res=0;
	while (bytes--) {
		res|= ((unsigned long long)(p[bytes]))<<(8*bytes);
	}
	return res;
}

unsigned long long getvl() {
	unsigned char b[8];
	readfile(b, 1);
	switch (*b) {
		case 0xfd:
			readfile(b, 2);
			return getle(b, 2);
		case 0xfe:
			readfile(b, 4);
			return getle(b, 4);
		case 0xff:
			readfile(b, 8);
			return getle(b, 8);
	}
	return *b;
}


void prhash(unsigned char *p, unsigned int l) {
	while (l--) printf("%02x", p[l]);
}


void hexdump(unsigned char *p, unsigned int l) {
	while (l--) printf("%02x", *p++);
}


void printbtcaddr(unsigned char *p) {
	static const char *chrs = "123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz";
	unsigned char mod;
	char out[64];
	int i=0, j=0;
	BIGNUM *bn = BN_bin2bn(p, 25, NULL);
	while (!BN_is_zero(bn)) {
		BN_div(&dv, &mo, bn, bn58, ctx);
		if (BN_bn2bin(&mo, &mod)==0)  mod = 0;
		out[i++] = chrs[mod];
		BN_copy(bn, &dv);
	}
	BN_free(bn);
	while ((*p)==0) {
		putchar('1');
		p++;
	}
	while (i--)  putchar(out[i]);
}


int main(int argc, char * argv[]) {
	static unsigned char buf[0x10000];
	unsigned long long i, sl, txcnt, v;
	unsigned va, vb;
	int x;
	long fpos;
	char *fname = NULL;

	for (x=1; x<argc; x++) {
		if (strcmp(argv[x], "-t")==0) {
			addr_version = 0x6f; // testnet
		} else {
			fname = argv[x];
		}
	}
	
	if (!fname) {
		printf("Enter transactions hexdump data:\n");
	} else {
		f = fopen(fname, "rb");
		if (!f) {
			fprintf(stderr, "File %s not found\n", fname);
			return 1;
		}
	}

	readfile(buf, 4);
	printf("Version: %llu\n", getle(buf, 4));

	txcnt = getvl();
	printf("TX IN cnt: %llu\n", txcnt);
	for (i=0; i<txcnt; i++) {
		readfile(buf, 36);
		sl = getvl();

		printf("%6d) : ", (int)i);
		prhash(buf, 32);
		printf(" Idx=%2lld  sl=%lld", getle(buf+32, 4), sl);
		readfile(buf, sl);
		readfile(buf, 4);

		printf(" seq=%x\n", (unsigned)getle(buf, 4));
	}

	txcnt = getvl();
	printf("TX OUT cnt: %llu\n", txcnt);

	ctx = BN_CTX_new();
	bn58 = BN_bin2bn("\x3a", 1, NULL);
	BN_init(&dv);
	BN_init(&mo);

	for (i=0; i<txcnt; i++) {
		readfile(buf, 8);
		sl = getvl();
		v = getle(buf, 8);
		va = (unsigned)(v/100000000LL);
		vb = (unsigned)(v%100000000LL);
		printf("%6d) : %7u.%08u BTC", (int)i, va, vb, sl);
		readfile(buf, sl);
		if (sl!=25 || memcmp(buf, "\x76\xa9\x14", 3) || buf[23]!=0x88 || buf[24]!=0xac) {
			printf("  WARNING! Unexpected SIG_SCRIPT:\n"); hexdump(buf, sl);
		} else {
			unsigned char dat[25];
			unsigned char sha[SHA256_DIGEST_LENGTH];
			unsigned char sha2[SHA256_DIGEST_LENGTH];
			dat[0] = addr_version; // version
			memcpy(dat+1, buf+3, 20);
			SHA256(dat, 21, sha);
			SHA256(sha, SHA256_DIGEST_LENGTH, sha2);
			//printf("  chsum:%02x%02x%02x%02x", sha2[0], sha2[1], sha2[2], sha2[3]);
			memcpy(dat+21, sha2, 4); 
			printf("  to address "); printbtcaddr(dat);
		}
		putchar('\n');
	}

	BN_free(bn58);
	BN_CTX_free(ctx);

	readfile(buf, 4);
	printf("Lock Time: %llu\n", getle(buf, 4));

	if (f) {
		fpos = ftell(f);
		fseek(f, 0, SEEK_END);
		if (fpos!=ftell(f)) {
			printf("WARNING!!! File too long. Only %ld bytes expected (%ld too many)\n",
				fpos, ftell(f)-fpos);
		} else {
			printf("File size checked OK - %ld bytes\n", fpos);
		}
		fclose(f);
	}

	return 0;
}

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 07, 2013, 04:33:55 PM

Thanks, I prefer source too.

On further thought I have decided I'm going to retitle this post as a block-chain parser in a couple thousand lines of code; but it will still be just one CPP file when I'm done.

Actually it was kind of hard, it took me hours just to gather together the code to do SHA-256, RPMD-160, and base58cheked without dragging in massive libraries and dependencies.

I'm going to make a new Google code project called 'bitcoin-snippets' which will contain just the source to do each one of these specific operations.

Thanks,

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 07, 2013, 04:55:20 PM

Quote from: jratcliff63367 on July 07, 2013, 04:33:55 PM

Don't forget this is a crypto currency. There is no way around having a general purpose crypto library available if you're going to do any coding in Bitcoin. Just like you have standard headers in C++ for various data structures and algorithms, you are going to need these crypto algorithms available. You're wasting your time reimplementing it, because it's all actually very simple, standard crypto operations (just combined in a creative way). Once you have those standard operations available and you understand what they do, much of this will be much simpler. Hashing is the bread and butter of Bitcoin, just get the libraries and do it.

And no one ever said bitcoin programming was easy, but you're taking the right size steps and you will get there with some patience. What you are doing is exactly how I started two years ago and I think it's a fantastic way to learn. You just have to appreciate that there's a lot to learn and you're going to spend a ton of time confused and digging for clarity. That's part of the fun :-)

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: piotr_n on July 07, 2013, 05:14:44 PM

It's true.
Depends how far you want to go, you would eventually also need ECDSA - which is much bigger piece of code than the hashing funcs and the big numbers..
So if you really care to have a small code, you will eventually prefer to link it against an external crypto lib, instead of turnign your block parser into a crypto lib.
Probably openssl is best for your purposes, since this is the most common library having everything that a blockchain parser needs.
And you can just download a built lib, for any platform/compiler you need.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 07, 2013, 05:31:04 PM

Well, I'm not trying to write a full client, just something that can extract all of the transactions.

For that, as far as I can tell, I just need the two hash routines. In fact I think I'm getting pretty close to finishing this thing up.

Also, to be clear my coding website is largely educational. I'm writing this code to educate myself and releasing it open source to educate others.
Thanks,

John

Quote from: etotheipi on July 07, 2013, 04:55:20 PM

Quote from: jratcliff63367 on July 07, 2013, 04:33:55 PM

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 07, 2013, 11:26:46 PM

I did one more cleanup pass/revision before the end of today.

There is now a code snippet which does just base58 encoding. It's based on the source code in 'cbitcoin' but I removed all memory allocation.

I also revised the BitcoinAddress snippet to go from public-key to bitcoin binary address, and from a bitcoin binary address to ASCII and from ASCII back to binary (checked).

Finally, my original snippet folds in all of this code to parse the bulk of the blockchain. I still haven't yet assembled all of the individual transactions, I would do that the next time I find some time to spend on this effort.

All that said, I think these individual snippets, which are pretty well documented, will hopefully prove useful educational tools for future developers who won't have to go through the same trouble I went through figuring out how all of these pieces fit together.

Like a previous poster said, I prefer looking at source code as documentation than a Wiki page; so hopefully others will find this useful too.

http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html (http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html)

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: etotheipi on July 08, 2013, 02:53:35 AM

Quote from: jratcliff63367 on July 07, 2013, 05:31:04 PM

Don't get caught up going down tangent paths. Your issues with hashing sound much like someone who's never seen a logarithm before. Reimplementing the log() function in C++ is not very useful nor educational, unless your goal is compiler optimizations or embedded software. It's more important to understand the purpose and properties of the logarithm (hashing), than to distract those you are trying to educate with unnecessary details. If I told you I didn't know what a logarithm but I needed it for my financial software, and I was working on implementing my own version of logarithm, you'd probably have the same reaction: "go look up what a logarithm does and why it's important, and then just #include <cmath> and focus on which properties of it are needed by the financial software and why."

Hashing is the one of the most fundamental and important atoms of cryptographic systems. Especially Bitcoin. And implementing it from scratch is not likely to give you any insight into how it works or how it fits into Bitcoin. What you need to understand is that it's a function that takes any input, completely mangles it, and produces a seemingly-random X-byte output. It must always produce the same answer for identical input, but change any bit should look equivalent to producing a random value. It is intentionally non-invertible (i.e. SHA256 can take in any size string and always spits out 32 bytes), and each output bit should have no correlation with any input bit.

In the context of Bitcoin transactions, it's goal is to produce a unique 32-byte identifier for a transaction, regardless of how many kB it is. If any byte of the transaction is changed, the hash will look like a different 32-byte random number. In the context of headers, this "randomness" to create a sort of lottery for miners. Since we know the hash function should produce essentially-uniform randomness, we know that for a given difficulty and hashrate on the network, it should take approx 10 min on average for someone to find a lucky hash.

Similarly, if you go any further with this, you're going to need an ECDSA library in addition to hashing. While elliptic curves are mathematically neat, you would be best to not reimplement it yourself unless you are working on speed-optimizing it. If you think publickey->address was hard, try privatekey->publickey! (no, don't try it, just get a library and understand the properties of it).

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 08, 2013, 03:35:09 AM

Yeah, I know all about hashes. My issue isn't about any particular hash algorithm, my issue is having to download and build a massive library just to get access to the *one* I need.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 08, 2013, 04:59:01 AM

Let me reform the question using your analogy.

Let's say I'm working on a project and I need a log function. Now, let's suppose that the log function (like SHA256) is not part of the C99 standard and built into the set of default libraries available in all compatible C++ compilers.

So, what if the only way to get access to a 'log' function is that I have to download some math library comprising 100,000 lines of source code with dependencies on streams, network code, a specific compiler, a complex build and configuration requirement, BOOST, the STL, exception handling, and god knows what else.

Because I needed access to 'log'.

Given the choice between dragging that into my code base or just copying only the 'log' function source code I would choose the much simpler solution.

People don't want to learn about algorithms by downloading 100,000 lines of source code libraries.

That's my position, and when I check with many of my colleagues in the computer game industry (where I come from) that tends to be our preference.

I'm a senior software engineer and I have been working on computer games and multi-platform development since 1979.

I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

John

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: ProfMac on July 08, 2013, 05:28:33 AM

Quote from: jratcliff63367 on July 08, 2013, 04:59:01 AM

I'm a senior software engineer and I have been working on computer games and multi-platform development since 1979.

I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

John

+1

Make it as simple as possible, but not simpler.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: CIYAM on July 08, 2013, 05:36:04 AM

Quote from: jratcliff63367 on July 08, 2013, 04:59:01 AM

I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

I do agree with the "minimal" approach although the STL and exceptions are *standard* C++ (if your compiler doesn't support these then it is not a C++ compiler at all but probably one of those horrible non-standard EC++ compilers from the 90's).

BTW if you want a fairly minimal SHA256 (one source and one header) then there is one in the CIYAM project here: https://github.com/ciyam/ciyam/blob/master/src/sha256.cpp and here: https://github.com/ciyam/ciyam/blob/master/src/sha256.h.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jgarzik on July 08, 2013, 05:53:24 AM

Quote from: jratcliff63367 on July 08, 2013, 04:59:01 AM

There is little engineering upside to reimplementing highly complex cryptographic algorithms on your own, given the levels of engineering review and crypt-analysis of your own codebase versus an existing crypto lib.

It might be personally satisfying, but it makes little sense unless you are truly an expert crypto mathematician.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 08, 2013, 12:33:38 PM

Me communicate not so good.

I haven't reimplemented *any* cryptographic algorithms. All I did was copy the two cryptographic routines I needed, and maybe slightly refactored the interface to them.

I just want the code I need and not the code I don't need and in a form which will compile without dragging in dependencies on hundreds of other source files and libraries (i.e. STL, BOOST, networking code, etc.)

To repeat, I haven't reimplemented any cryptographic algorithm, I've just tried to organize the source so I can easily build and understand it.

As I said before, if SHA256 were part of the C99 standard and built into every C/C++ compiler, that would be fine. But, it's not.

And I get that you guys have no problem building things like 'opensll' just because you need access to one or two routines; but that's not really my coding style. By pulling out just the two routines I need into a standalone code snippet I am not 'reimplementing' anything, I'm just cleaning up an existing implementation to make it more usable and accessible and easier to understand.

After running CLOC on openSSL it reports a total of 1,589 files; 946 C files, 169 PERL scripts, 262 C/C++ header files, 76 makes files, 13 assembly files, 61 shell scripts, and a total of 234,537 lines of C code and 69,175 lines of code in header files.

In contrast, the code for *just* SHA256 and RIPMD160 the only two hash routines needed to parse the blockchain (which was my project goal) are 750 lines of code.

If you want to add a quarter million line source code project to your little utility program just so you can access 750 lines of code, go for it. But, as I said, that's not my style.

John

P.S. For what it's worth, cryptopp doesn't seem as bad. It's 270 source files and 'only' 56,000 lines of source.

Title: Re: A bitcoin blockchain parser in a few hundred lines of C++
Post by: jratcliff63367 on July 08, 2013, 02:51:36 PM

Yes, I use the STL a lot, and mostly for that reason, because it's 'standard'. However, I also try to avoid using it if I don't have to. I won't go and write my own container class when the same container class already exists in STL but, on the other hand, if I can get away without using a container class at all, that's fine too. I really try to keep it out of public API's if at all possible. It just makes the code harder to follow I believe; so I typically might use it for internal implementation code but keep the public facing API as simple as possible.

The problems with the STL abound though, and they have been discussed a great deal online. At my professional work, we have an internal standard not to use the STL. Instead we have our own customized templates which we have highly optimized.

Our code here has to run on the following platforms:

Win32/Win64
Linux (many flavors)
Wii
Wii-U
Xbox-360
Xbox-720
PS3
PS4
Android
IOS

It's pretty crazy having to target high performance physics code on all of those platforms at the same time.

We also have a requirement that we capture every single memory allocation ever performed and never use any exception handling anywhere in the code base.

Game engines have some pretty high performance requirements, and the STL and exception handling usually don't fit this pattern.

That said, for my open source work, I always use the STL if I need a container.

John

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: Vinz87 on March 23, 2014, 07:22:29 PM

is this topic alive? i'm trying to compile your software with gcc 4.5 on OS X Mavericks, but I get these errors:

Code:

$ gcc *.cpp -o blockchain.out
BlockChain.cpp:2619:11: warning: enumeration value 'AM_LAST' not handled in switch [-Wswitch]
        switch ( m )
                 ^
1 warning generated.
main.cpp:320:116: warning: format specifies type 'int' but the argument has type 'float' [-Wformat]
  ...%d bitcoin addresses with a balance of more than %d bitcoins.\r\n", tcount, mMinBalance );
                                                      ~~                         ^~~~~~~~~~~
                                                      %f
main.cpp:334:109: warning: format specifies type 'int' but the argument has type 'float' [-Wformat]
  ...oldest bitcoin addresses with a balance of more than %d bitcoins.\r\n", tcount, mMinBalance );
                                                          ~~                         ^~~~~~~~~~~
                                                          %f
main.cpp:348:124: warning: format specifies type 'int' but the argument has type 'float' [-Wformat]
  ...older than %d days with a balance of more than %d bitcoins.\r\n", zdays, mMinBalance );
                                                    ~~                        ^~~~~~~~~~~
                                                    %f
main.cpp:418:17: warning: enumeration value 'SR_LAST' not handled in switch [-Wswitch]
                                                        switch ( mStatResolution )
                                                                 ^
main.cpp:394:12: warning: enumeration values 'CM_NONE' and 'CM_EXIT' not handled in switch [-Wswitch]
                switch ( mMode )
                         ^
5 warnings generated.
Undefined symbols for architecture x86_64:
  "std::__1::__vector_base_common<true>::__throw_length_error() const", referenced from:
      void std::__1::vector<BlockChainAddresses::StatAddress, std::__1::allocator<BlockChainAddresses::StatAddress> >::__push_back_slow_path<BlockChainAddresses::StatAddress const>(BlockChainAddresses::StatAddress const&) in BlockChainAddresses-837c2d.o
  "std::terminate()", referenced from:
      ___clang_call_terminate in BlockChain-69c2db.o
      ___clang_call_terminate in BlockChainAddresses-837c2d.o
      ___clang_call_terminate in main-b5bd23.o
  "vtable for __cxxabiv1::__class_type_info", referenced from:
      typeinfo for BlockChain in BlockChain-69c2db.o
      typeinfo for QuickSortPointers in BlockChain-69c2db.o
      typeinfo for BlockChainAddresses in BlockChainAddresses-837c2d.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "vtable for __cxxabiv1::__si_class_type_info", referenced from:
      typeinfo for BlockChainImpl in BlockChain-69c2db.o
      typeinfo for SortByAge in BlockChain-69c2db.o
      typeinfo for SortByBalance in BlockChain-69c2db.o
      typeinfo for BlockChainAddressesImpl in BlockChainAddresses-837c2d.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "operator delete[](void*)", referenced from:
      SimpleHash<FileLocation, 4194304u, 40000000u>::init() in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      BitcoinTransactionFactory::printOldest(unsigned int, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::printTopBalances(unsigned int, float) in BlockChain-69c2db.o
      SimpleHash<BlockHeader, 65536u, 500000u>::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveStatistics(bool, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveAddressesOverTime() in BlockChain-69c2db.o
      ...
  "operator delete(void*)", referenced from:
      createBlockChain(char const*) in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      createBlockChainAddresses(char const*) in BlockChainAddresses-837c2d.o
      BlockChainAddressesImpl::~BlockChainAddressesImpl() in BlockChainAddresses-837c2d.o
      std::__1::__vector_base<BlockChainAddresses::StatAddress, std::__1::allocator<BlockChainAddresses::StatAddress> >::~__vector_base() in BlockChainAddresses-837c2d.o
      std::__1::__split_buffer<BlockChainAddresses::StatAddress, std::__1::allocator<BlockChainAddresses::StatAddress>&>::~__split_buffer() in BlockChainAddresses-837c2d.o
      BlockChainAddresses::~BlockChainAddresses() in BlockChainAddresses-837c2d.o
      ...
  "operator new[](unsigned long)", referenced from:
      BlockChainImpl::buildBlockChain() in BlockChain-69c2db.o
      SimpleHash<FileLocation, 4194304u, 40000000u>::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::printOldest(unsigned int, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::printTopBalances(unsigned int, float) in BlockChain-69c2db.o
      SimpleHash<BlockHeader, 65536u, 500000u>::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveStatistics(bool, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveAddressesOverTime() in BlockChain-69c2db.o
      ...
  "operator new(unsigned long)", referenced from:
      createBlockChain(char const*) in BlockChain-69c2db.o
      createBlockChainAddresses(char const*) in BlockChainAddresses-837c2d.o
      std::__1::__split_buffer<BlockChainAddresses::StatAddress, std::__1::allocator<BlockChainAddresses::StatAddress>&>::__split_buffer(unsigned long, unsigned long, std::__1::allocator<BlockChainAddresses::StatAddress>&) in BlockChainAddresses-837c2d.o
  "___cxa_begin_catch", referenced from:
      ___clang_call_terminate in BlockChain-69c2db.o
      ___clang_call_terminate in BlockChainAddresses-837c2d.o
      ___clang_call_terminate in main-b5bd23.o
  "___cxa_pure_virtual", referenced from:
      vtable for QuickSortPointers in BlockChain-69c2db.o
      vtable for BlockChain in BlockChain-69c2db.o
      vtable for BlockChainAddresses in BlockChainAddresses-837c2d.o
  "___gxx_personality_v0", referenced from:
      createBlockChain(char const*) in BlockChain-69c2db.o
      BlockChainImpl::BlockChainImpl(char const*) in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      SimpleHash<FileLocation, 4194304u, 40000000u>::init() in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      SimpleHash<BlockHeader, 65536u, 500000u>::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::gatherStatistics(unsigned int, unsigned int, bool) in BlockChain-69c2db.o
      ...
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: cassini on March 24, 2014, 08:18:21 AM

Quote from: Vinz87 on March 23, 2014, 07:22:29 PM

is this topic alive? i'm trying to compile your software with gcc

Have you tried clang++?
See http://stackoverflow.com/questions/16660854/compiling-with-clang-with-c11-enabled-fails
If this solves some errors but not all of them, then try
http://stackoverflow.com/questions/16352833/linking-with-clang-on-os-x-generates-lots-of-symbol-not-found-errors

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: jgarzik on March 24, 2014, 02:55:52 PM

If you don't mind C (versus C++), picocoin's "blkstats" utility parses the blockchain in under 3 minutes.

https://github.com/jgarzik/picocoin/blob/master/src/blkstats.c
https://bitcointalk.org/index.php?topic=128055.0
https://github.com/jgarzik/picocoin/

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: Prospero on June 20, 2014, 11:16:20 PM

Quote from: jgarzik on March 24, 2014, 02:55:52 PM

Can this be used to compute the bitcoin rich list (list of all addresses with balance greater than x)?

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: micax1 on November 13, 2014, 12:35:24 AM

Quote from: jratcliff63367 on July 01, 2013, 07:46:23 PM

Hi! your tool is definitely usefull! however since blockchain become huge (~30 GB) and more then 52 mln addresses and 200+ inputs outputs - tool is no longer working ( moreover it`s not possible to recompile it changeing just:

#define MAX_BITCOIN_ADDRESSES 48000000 // 40 million unique addresses.
#define MAX_TOTAL_TRANSACTIONS 90000000 // 40 million transactions.
#define MAX_TOTAL_INPUTS 268000000 // 200 million inputs.
#define MAX_TOTAL_OUTPUTS 268000000 // 200 million outputs
#define MAX_TOTAL_BLOCKS 600000 // 1/2 million blocks.

because of it leads to Warning 16 warning C4307: '*' : integral constant overflow

these are the highest numbers tool is still compile-able however
it won`t let you poarse blockchain after.

It would be much appreciated if you can recompile it to be compatible with current blockchain size! or tips on how to do it )))
Thank you.

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: jgarzik on November 16, 2014, 11:07:07 PM

Quote from: Prospero on June 20, 2014, 11:16:20 PM

Quote from: jgarzik on March 24, 2014, 02:55:52 PM

Can this be used to compute the bitcoin rich list (list of all addresses with balance greater than x)?

Yes.

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: micax1 on November 18, 2014, 01:48:38 PM

Quote from: jgarzik on November 16, 2014, 11:07:07 PM

Quote from: Prospero on June 20, 2014, 11:16:20 PM

Quote from: jgarzik on March 24, 2014, 02:55:52 PM

Can this be used to compute the bitcoin rich list (list of all addresses with balance greater than x)?

Yes.

I could not find a docs on how to do it - can you plz help? for example i want to get every address from blockchain and sort it in order of higher current balance.
Thank you!

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: micax1 on November 29, 2014, 10:28:10 PM

Quote from: micax1 on November 13, 2014, 12:35:24 AM

good news - duatiugame refactored code and it`s working again:
https://code.google.com/p/blockchain/source/checkout

#define MAX_BITCOIN_ADDRESSES 60000000 // 60 million unique addresses.
#define MAX_TOTAL_TRANSACTIONS 70000000 // 70 million transactions.
#define MAX_TOTAL_INPUTS 250000000 // 250 million inputs.
#define MAX_TOTAL_OUTPUTS 250000000 // 250 million outputs
#define MAX_TOTAL_BLOCKS 600000 // 600,000 blocks.

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: micax1 on January 20, 2015, 08:28:17 PM

https://code.google.com/p/blockchain/source/checkout

recently was refactored and now works grerat! thank you!

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: DieJohnny on January 21, 2015, 04:55:55 AM

can this parser be used to do an audit of bitcoin total coins?

How do we know that there are exactly the number of coins in the blockchain as have been mined since the beginning?

Title: Re: A bitcoin blockchain parser in a few (thousand) lines of C++
Post by: amaclin on January 21, 2015, 08:04:21 AM

Quote from: DieJohnny on January 21, 2015, 04:55:55 AM

can this parser be used to do an audit of bitcoin total coins?

How do we know that there are exactly the number of coins in the blockchain as have been mined since the beginning?

Very simple. Sum of all unspent yet outputs.
The second question: how many bitcoins were lost?
The answer: sum of all unspent outputs which are provable unspendable