I've been testing and debugging the patches against the 0.3.23 version. I think I've got most of the problems figured out but there is still a critical problem that's made me lose out on 3 blocks so far.
First, the fast getwork parsing doesn't work if the client doesn't send the id without quotes. This will allow it to send the id with or without quotes:
ThreadRPCServer3 in rpc.cpp:
if ((strRequest.find("\"getwork\"")!=std::string::npos) && strRequest.find("[]")!=std::string::npos)
std::string id;
size_t p = strRequest.find("\"id\":");
if (p != std::string::npos)
{
size_t ep = strRequest.find(" ", p+5), e;
if((e=strRequest.find(",", p+5))<ep) ep = e;
if((e=strRequest.find("}", p+5))<ep) ep = e;
id = strRequest.substr(p+5, ep-p-5);
}
Second, the locking problems seen are mostly due to the mutex being non-recursive. Changing mGetWork to recursive_mutex instead of mutex, along with the iterators for it, should fix the problem.
Third, the CheckWork done in getwork assigns a value that should be returned but to a variable in the wrong scope. This has been addressed earlier.
Some of the locking seems to be off. An example would be:
CommitTransaction in main.cpp:
// Broadcast
if (!wtxNew.AcceptToMemoryPool())
{
// This must not fail. The transaction has already been signed and recorded.
printf("CommitTransaction() : Error: Transaction not valid");
return false;
}
wtxNew.RelayWalletTransaction();
+ SyncGetWork(4);
}
- SyncGetWork(4);
The comment for SyncGetWork says cs_main must be held but before being called, but that's not the case with the patch before the above change.
Another one would be:
AddToBlockIndex in main.cpp:
if (pindexNew == pindexBest)
{
// Notify UI to display prev block's coinbase if it was ours
static uint256 hashPrevBestCoinBase;
- CRITICAL_BLOCK(cs_mapWallet)
+ CRITICAL_BLOCK(cs_mapWallet) {
vWalletUpdated.push_back(hashPrevBestCoinBase);
+ hashPrevBestCoinBase = vtx[0].GetHash();
+ }
- hashPrevBestCoinBase = vtx[0].GetHash();
}
This one may or may not matter depending on how many threads call it (I believe only one thread runs this function but I'm not 100% sure). However, this related to crashes I've been seeing right after writing out the block and before sending out the proof of work. The call to vtx[0].GetHash causes a crash in the Serialize function due to the assertion that nSize >= 0. I believe it's due to vtx[0] being corrupted but the event is rare at such high difficulties. I don't see the problem with the test network but it happens on the live one.
One thought I had is that the live one has more transactions and appending enough of them would trigger the crash. If this is the case, one cause might be the changes JoelKatz has made that'd allow vtx to get updated elsewhere (from another client calling getwork?). Another one may be vtx is a vector and whenever it resizes, the structures related to each object inside vtx isn't correctly copied over (I didn't see any copy constructors). I was able to get a block through when I commented the above out so that might be a sign that I'm on the right track but I'd like to see it get a few more blocks before I can rule out that it wasn't just a random event (initially, crashes on the test network were random and I've yet to see any on it since the fixes mentioned earlier).
These high difficulty levels are making it really troublesome to debug locking/race conditions. It'd be nice if one of the big pools try out various things to see if they also experience the crashes (and possibly end up not getting credit for the block) and try out various fixes to see if it fixes the problem. I haven't seen anyone mention these crashes so for all I know, it's just some compiler/machine issue. But then again, it doesn't seem like anyone's tried it with the fixes that doesn't lock up.