I've seen enough errors to believe that hardware is a perfectly reasonable possible source of error in this case.
I am fully in agreement that hardware could possibly be the cause. Perhaps what I'm really trying to say could be summarized like this.
- Because this software is in beta, it too could be the cause, in fact this is very likely for that reason alone. I have had it crash for no reason at all (usually it complaining about its own database files after an improper shutdown). This could be happening to everybody and we may just not know it.
- RAID is not a solution to the specific conjecture offered. If this were a hardware error under identical circumstances, RAID would have given no benefit, just because of what RAID is and isn't.
- Software issues that could contribute to this include the following: misuse of stray pointers, accessing freed memory, threading-related issues, buffer overruns. Or, it could be hardware.
- A potential tool to help rule out software issues might be to distribute this blockchain verification code and have others run it. I'd run it. Who knows. Maybe my copy of the block chain will have a similar kind of corruption in a different block. If it did, then there's likely a software gremlin lurking.
The key is the nature of the corruption, which was described as a bit error. Honestly, I did not read the initial post closely enough, and thought it was a bit flipped in a hash value, but rereading I see the error is a single bit value that should be a zero across four bytes. That could have be an _overwrite_ of a 32 bit memory location, since zero is an extremely common value and single bit values are fairly common. There is no clear connection between the two.
A hash bit error, in contrast, would increase my suspicion of hardware. Otherwise software must have read the value, flipped a specific bit (or ORd it against a mask), and wrote it back. The crypto stuff might do that but other type of software errors are unlikely to produce that effect. But that debate can be tabled until hash bit errors or other obvious bit errors are reported.
I seriously doubt that anyone would lose money as a result of this kind of problem but investing in a light background continuous validation process and some sort of reporting to the end user like a message box that provides instructions for reporting the presence of corrupted blocks. If that reveals a significant problem (beta testing would probably be enough if that is the case) then the remedy would be an error correction method to replace corrupted blocks from the network (and more instrumentation to try to isolate any software causes).