Title: Two days of debugging for a single '&' Post by: libcoin on April 13, 2012, 08:38:23 AM Debugging code is both tedious and contagious. You are held in suspense between your curiosity, stubbornness and immense boredom. It is one of these needle in a haystack tasks that can last for days and during the endeavor you feel awfully unproductive. However, there are often some lessons to be learnt, and sharing debugging stories is for code gurus like sharing fishing stories for fishermen.
A famous story is the one from Fortran that caused a satellite to crash (Mariner Venus) due to '-' for '~' error... This story is in C++ and about a missing '&', and like the satellite story it also begins with a crash. I have had my customized bitcoin server (based on libcoin) running on different mac and windows desktop machines running for weeks with no issues, it survived reorganizations and the recent date based protocol changes (BIP0016/BIP0030) silently, and seemed pretty stable. I hence installed it in its production environment a Ubuntu machine at Linode. It had compiled and functioned several times before on linux, except that sometimes running on small virtualized machines I had seen crashes due to memory issues. Nothing to be really alarmed about, and the instance running on an amazon large instance had never crashed. I hence compiled to code and started it on Linode. Blockchain download... a reorganization and crash! Reason was bad_alloc, so most likely the memory issue, and afterall the Linode instance had only 1Gig of RAM. In my custom server I also keep some rather large data structures in memory, so I disabled these during chain download and tried again. Reorganize, and crash! Again due to bad_alloc. Annoyingly, following the crashes the databases got corrupted as well. I still believed that the crashes were due to low memory, but I wanted to make sure in the future that I could at least recover gracefully from database corruptions. So I tried to figure out what state the database was left in. In the reorganize method of the BlockChain class (basically similar to the reorganize method in bitcoin/bitcoin) there are a lot of nice tests to check if things fails - if blocks can't be read from disk or if disconnect fails. If this is the case the series of accumulated database transactions are aborted (TxnAbort) leaving the database in the same state as before, and ensuring that even thought the reorganization failed (and will fail again) the database is left in the same state and the same error will reappear on restart and can hence easily be debugged. In my case it was a memory exception thrown in the loop where all the txes are pushed on a stack to be resurrected after the chain reorganization. The exception is caught, but it is not caught before in the message processing method (ProcessMessage in bitcoin, handleMessage in libcoin). Here it results in an error message, but the program continues and the database transaction is never aborted (TxnAbort). The failed reorganize will repeat it self and the database transaction log fills up till 500 uncommitted transactions and then the BDB makes the program die. However, at some point in between, at least in my code, the best height was sat in the database and that caused the corruption. I fixed the issue by inserting a try catch around everything in reorganize up to the TxnAbort ensuring correct behavior even in case of a memory exception. Instead of rebuilding the database (from scratch download) I tried to rebuild the database: Forcing a reorganize 10 steps back and cleaning the blockchainindex for all non main chain blocks. However, that immediately caused another reorganize crash due to a bad_alloc ! This time the reason was clearly not a filed up memory. I debugged the code and the same loop as before was the culprit: Code: // Queue memory transactions to resurrect Code: const TransactionList getTransactions() const { return _transactions; } Code: BOOST_FOREACH(const Transaction& tx, _transactions) is called multiple times with no problem, yet the code crashed there and it was tx that was corrupted! I changed the loop to a normal for loop using block.getNumTransactions() and block.getTransaction(int) - now the code ran perfectly up to block 171091 - which every bitcoinerd will recognize as the first block of March the 15th ! BIP0030... My BIP0030 code is: Code: if (pindex->nTime > _chain.timeStamp(Chain::BIP0030)) http://www.richelbilderbeek.nl/CppBOOST_FOREACHExample1.htm (http://www.richelbilderbeek.nl/CppBOOST_FOREACHExample1.htm) Note from Richel's page: Code: //BAD: Compiles and crashes program Code: const TransactionList& getTransactions() const { return _transactions; } What is the lesson learnt then ? I think there are two: * Use either exceptions OR false/true functions in a program not both! Somethimes it is better to crash than to keep trying. * Use macros (BOOST_FOREACH) with great care, the behavior is often hard to predict and also hard to debug. I will over time change all BOOST_FOREACH in libcoin to normal for / while loops. Thanks for reading this - I hope you found it useful! Cheers, Michael Title: Re: Two days of debugging for a single '&' Post by: CIYAM on April 13, 2012, 08:52:40 AM Macros are always evil and I have never really seen the attraction of BOOST_FOREACH myself especially now that we have "auto" types and can write our loops as follows:
for( auto i = container.begin( ); i != container.end( ); ++i ) { ... } [now it will be time for others to chime in with the "C++ is the root of all evil and should never be used" type stuff we get on this forum] Title: Re: Two days of debugging for a single '&' Post by: jim618 on April 13, 2012, 10:09:03 AM I think the worst debugging nightmare I had was when I was working ages ago at an electronics company.
We used to simulate the behaviour of circuits with a language called Spice. You would hook up up the wiring in software and then simulate the signals coming in and see what happened. I had one circuit model in a spice file that worked fine and another that just crashed. After a few days I had whittled it down to two seemingly identical files, one of which worked fine and one which crashed. In the end I discovered what the problem was by... printing the two files out on the cheapest paper we had. putting the two pieces of paper one on top of each other. holding them up to the light to see if there were any differences. I noticed the two sheets of paper were identical but slightly misaligned. One had a carriage return at the start of the file, which was enough to crash the spice interpreter. Grrrrhhhhhh! Title: Re: Two days of debugging for a single '&' Post by: apetersson on April 13, 2012, 12:53:11 PM What is the lesson learnt then ? I think there are i think this shows us that in any complex codebase at some point you can't simply rule out errors just by a human looking at it. therefore you need to rely on as much automated error checking as possible.maybe its different for you, but in my experience this tells me:
for very simple problems or problems with very specific performance requirements you have to be pragmatic and drop some of those restrictions. i am very happy that there is a very good implementation of Bitcoin in Java. Any developer who agrees with the issues i raised should consider using BitcoinJ instead of a c++ based Bitcoind. It is not perfect but at the moment the safest choice. |