Two days of debugging for a single '&'

Debugging code is both tedious and contagious. You are held in suspense between your curiosity, stubbornness and immense boredom. It is one of these needle in a haystack tasks that can last for days and during the endeavor you feel awfully unproductive. However, there are often some lessons to be learnt, and sharing debugging stories is for code gurus like sharing fishing stories for fishermen.

A famous story is the one from Fortran that caused a satellite to crash (Mariner Venus) due to '-' for '~' error... This story is in C++ and about a missing '&', and like the satellite story it also begins with a crash.

I have had my customized bitcoin server (based on libcoin) running on different mac and windows desktop machines running for weeks with no issues, it survived reorganizations and the recent date based protocol changes (BIP0016/BIP0030) silently, and seemed pretty stable. I hence installed it in its production environment a Ubuntu machine at Linode. It had compiled and functioned several times before on linux, except that sometimes running on small virtualized machines I had seen crashes due to memory issues. Nothing to be really alarmed about, and the instance running on an amazon large instance had never crashed. I hence compiled to code and started it on Linode. Blockchain download... a reorganization and crash! Reason was bad_alloc, so most likely the memory issue, and afterall the Linode instance had only 1Gig of RAM. In my custom server I also keep some rather large data structures in memory, so I disabled these during chain download and tried again. Reorganize, and crash! Again due to bad_alloc.

Annoyingly, following the crashes the databases got corrupted as well. I still believed that the crashes were due to low memory, but I wanted to make sure in the future that I could at least recover gracefully from database corruptions. So I tried to figure out what state the database was left in.

In the reorganize method of the BlockChain class (basically similar to the reorganize method in bitcoin/bitcoin) there are a lot of nice tests to check if things fails - if blocks can't be read from disk or if disconnect fails. If this is the case the series of accumulated database transactions are aborted (TxnAbort) leaving the database in the same state as before, and ensuring that even thought the reorganization failed (and will fail again) the database is left in the same state and the same error will reappear on restart and can hence easily be debugged.
In my case it was a memory exception thrown in the loop where all the txes are pushed on a stack to be resurrected after the chain reorganization. The exception is caught, but it is not caught before in the message processing method (ProcessMessage in bitcoin, handleMessage in libcoin). Here it results in an error message, but the program continues and the database transaction is never aborted (TxnAbort). The failed reorganize will repeat it self and the database transaction log fills up till 500 uncommitted transactions and then the BDB makes the program die. However, at some point in between, at least in my code, the best height was sat in the database and that caused the corruption.

I fixed the issue by inserting a try catch around everything in reorganize up to the TxnAbort ensuring correct behavior even in case of a memory exception.

Instead of rebuilding the database (from scratch download) I tried to rebuild the database: Forcing a reorganize 10 steps back and cleaning the blockchainindex for all non main chain blocks. However, that immediately caused another reorganize crash due to a bad_alloc ! This time the reason was clearly not a filed up memory. I debugged the code and the same loop as before was the culprit:

Code:

    // Queue memory transactions to resurrect
    BOOST_FOREACH(const Transaction& tx, block.getTransactions())
        if (!tx.isCoinBase()) {
            vResurrect.push_back(tx);
        }

The Block::getTransactions() was defined like:

Code:

    const TransactionList getTransactions() const { return _transactions; }

libcoin encapsulates all classes, and hence looping over a (const version) of transactions in a block is done this way. I have other part of the code (in Block) where:

Code:

    BOOST_FOREACH(const Transaction& tx, _transactions)
        // do something with tx

is called multiple times with no problem, yet the code crashed there and it was tx that was corrupted! I changed the loop to a normal for loop using block.getNumTransactions() and block.getTransaction(int) - now the code ran perfectly up to block 171091 - which every bitcoinerd will recognize as the first block of March the 15th ! BIP0030... My BIP0030 code is:

Code:

        if (pindex->nTime > _chain.timeStamp(Chain::BIP0030))
            BOOST_FOREACH(const Transaction& tx, block.getTransactions()) {
                TxIndex txindexOld;
                if (ReadTxIndex(tx.getHash(), txindexOld))
                    BOOST_FOREACH(const DiskTxPos &pos, txindexOld.getSpents())
                    if (pos.isNull())
                        return false;
            }

Hey, again the BOOST_FOREACH loop! That construct was apparently somehow polluted - a google for 'BOOST_FOREACH const reference' gave this like:

http://www.richelbilderbeek.nl/CppBOOST_FOREACHExample1.htm

Note from Richel's page:

Code:

//BAD: Compiles and crashes program
  //std::clog << __func__ << '\n'; BOOST_FOREACH(const std::string& s, f()) { t+=s; }

So you cannot use a function returning a copy of a vector in BOOST_FOREACH, however a const reference to it works, but will spew a compiler warning. Changing Block::getTransactions() to :

Code:

    const TransactionList& getTransactions() const { return _transactions; }
//                       ^ note the single &

made everything work smoothly again. So the culprit was a missing '&' or more correctly that BOOST_FOREACH that didn't behave correctly. However, it does behave correctly when compiled on visual studio and Xcode. Hence I never saw that error before...

What is the lesson learnt then ? I think there are two:
* Use either exceptions OR false/true functions in a program not both! Somethimes it is better to crash than to keep trying.
* Use macros (BOOST_FOREACH) with great care, the behavior is often hard to predict and also hard to debug. I will over time change all BOOST_FOREACH in libcoin to normal for / while loops.

Thanks for reading this - I hope you found it useful!

Cheers,

Michael