What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?
What happens if I write my password down on two pieces of paper and then set both papers on fire, will I lose my password? Please don't ask such silly questions in a Theymos thread for the sake of spamming your signature.
The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.
My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)
There is an interesting aspect of RAID arrays when using SSDs in a mirrored configuration that can only tolerate the failure of one drive - you are writing identical data to identical drives, and should expect them to fail in identical ways, killing two drives at once.
Physical hard drives are more subject to mechanical tolerances that vary between samples. The disk coatings are not deposited equally, the bearings are not made molecularly identical, the windings in the heads and the motors aren't perfect matches. We would expect given a high intensity and identical load to two drives that it would be virtually impossible for them to suffer failure at the same time.
SSDs are different. Some drives have firmware that specifically bricks the drive or turns it into a read-only drive after a certain number of writes. While the actual memory cells may fail differently between the drives, they wear at a predictable statistical rate and there is a reserve of usually 2-5% of drive space of extra sectors that will finally be exhausted at nearly identical times given identical write patterns.
SSDs push the limit of what can be stored on silicon, so they have many layers of error correction to go along with the wear leveling. It is quite hard to get bad data back out of a well-designed Tier 1 drive.
However, failure of SSDs can be random and capricious, especially the OCZ/Patriot/Crucial bottom-tier drives that just shit themselves for no reason. As it is unlikely the forum completely used up the drive wear life on it's drives, it is more likely random crap-drive failure or RAID controller failure.
For reference, look at the tests here:
http://techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes - they designed a test specifically to kill SSDs with writes and left it running 24/7 for over a year on six drives. You can see that the SMART wear indicators on most drives indicate the amount of drive life left, some lock up after the reserve is used, and some keep going into write error territory. Only one drive unexpectedly died. SMART analysis of forum's drives will likely indicate the status of the reallocated sector count and wear leveling count to see if failure was unexpected.
Outside of the SSDs themselves failing, most hardware RAID controllers are pretty dumb (and if you are spending less than ~$400 you are not even getting hardware RAID). They just write the same data to two drives. There is no error correcting or checksumming, so they are useless to fix corruption, and actually are less tolerant than just a single drive would be. If you want to see scary, look up "RAID 5 write hole", basically there is no way for these RAIDs to tolerate power loss, making on-controller battery backup super important.
Also, consumer hardware just plan HAS errors, and hardware RAID is not written to deal with them:
http://www.zdnet.com/article/has-raid5-stopped-working/Also majorly important is that the system must be running ECC RAM, and the RAID controller must also have ECC RAM if it has cache slots.
I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.
Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.