Recent downtime and data loss

deepceleron

Legendary

Offline

Activity: 1512
Merit: 1036

Re: Recent downtime and data loss

January 23, 2015, 02:39:30 PM
Last edit: January 23, 2015, 05:57:09 PM by deepceleron

#41

Quote from: stellar69 on January 23, 2015, 12:37:28 PM

What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

What happens if I write my password down on two pieces of paper and then set both papers on fire, will I lose my password? Please don't ask such silly questions in a Theymos thread for the sake of spamming your signature.

Quote from: theymos on January 23, 2015, 06:03:44 AM

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

There is an interesting aspect of RAID arrays when using SSDs in a mirrored configuration that can only tolerate the failure of one drive - you are writing identical data to identical drives, and should expect them to fail in identical ways, killing two drives at once.

Physical hard drives are more subject to mechanical tolerances that vary between samples. The disk coatings are not deposited equally, the bearings are not made molecularly identical, the windings in the heads and the motors aren't perfect matches. We would expect given a high intensity and identical load to two drives that it would be virtually impossible for them to suffer failure at the same time.

SSDs are different. Some drives have firmware that specifically bricks the drive or turns it into a read-only drive after a certain number of writes. While the actual memory cells may fail differently between the drives, they wear at a predictable statistical rate and there is a reserve of usually 2-5% of drive space of extra sectors that will finally be exhausted at nearly identical times given identical write patterns.

SSDs push the limit of what can be stored on silicon, so they have many layers of error correction to go along with the wear leveling. It is quite hard to get bad data back out of a well-designed Tier 1 drive.

However, failure of SSDs can be random and capricious, especially the OCZ/Patriot/Crucial bottom-tier drives that just shit themselves for no reason. As it is unlikely the forum completely used up the drive wear life on it's drives, it is more likely random crap-drive failure or RAID controller failure.

For reference, look at the tests here: http://techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes - they designed a test specifically to kill SSDs with writes and left it running 24/7 for over a year on six drives. You can see that the SMART wear indicators on most drives indicate the amount of drive life left, some lock up after the reserve is used, and some keep going into write error territory. Only one drive unexpectedly died. SMART analysis of forum's drives will likely indicate the status of the reallocated sector count and wear leveling count to see if failure was unexpected.

Outside of the SSDs themselves failing, most hardware RAID controllers are pretty dumb (and if you are spending less than ~$400 you are not even getting hardware RAID). They just write the same data to two drives. There is no error correcting or checksumming, so they are useless to fix corruption, and actually are less tolerant than just a single drive would be. If you want to see scary, look up "RAID 5 write hole", basically there is no way for these RAIDs to tolerate power loss, making on-controller battery backup super important.

Also, consumer hardware just plan HAS errors, and hardware RAID is not written to deal with them: http://www.zdnet.com/article/has-raid5-stopped-working/

Also majorly important is that the system must be running ECC RAM, and the RAID controller must also have ECC RAM if it has cache slots.

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

kolloh

Legendary

Offline

Activity: 1736
Merit: 1023

Re: Recent downtime and data loss

January 23, 2015, 03:01:05 PM

#42

Thanks for posting the detailed information about what happened. From a technical perspective, it is interesting reading about how things are configured.

deepceleron, I'm actually planning to use a RAIDZ2 pool for a database server that I'm working on as it looks like a very nice solution. RAIDZ2 only allows two disks to fail but I am liking what I read about ZFS thus far. The LZ4 compression also looks to be pretty handy.

BADecker

Legendary

Offline

Activity: 3990
Merit: 1386

Re: Recent downtime and data loss

January 23, 2015, 04:27:36 PM

#43

Quote from: deepceleron on January 23, 2015, 02:39:30 PM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

Covid is snake venom. Dr. Bryan Ardis https://thedrardisshow.com/ - Search on 'Bryan Ardis' at these links https://www.bitchute.com/, https://www.brighteon.com/, https://rumble.com/, https://banned.video/.

fairglu

Legendary

Offline

Activity: 1100
Merit: 1032

Re: Recent downtime and data loss

January 23, 2015, 04:43:49 PM

#44

Quote from: deepceleron on January 23, 2015, 02:39:30 PM

Backups are now just for when the whole place burns to the ground.

Do not underestimate, the Plain Old Bugs (tm), Plain Old Human Errors (tm) and the Drunk or Drugged SysAdmin (r)

Much more common than destruction by fire Tongue

-- Chainz - Alternative Explorers for Alternative Crypto-currencies --

Buffer Overflow

Legendary

Offline

Activity: 1652
Merit: 1016

Re: Recent downtime and data loss

January 23, 2015, 04:53:57 PM

#45

Quote from: BADecker on January 23, 2015, 04:27:36 PM

Quote from: deepceleron on January 23, 2015, 02:39:30 PM

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.
But since you asked, a couple of disks, software raid, Linux. and a good backup schedule. Job done.

PGP: 0x702ABE1B3213C7CC4BFB63A70C921A615B4A1D72

BADecker

Legendary

Offline

Activity: 3990
Merit: 1386

Re: Recent downtime and data loss

January 23, 2015, 05:00:39 PM

#46

Quote from: Buffer Overflow on January 23, 2015, 04:53:57 PM

Quote from: BADecker on January 23, 2015, 04:27:36 PM

Quote from: deepceleron on January 23, 2015, 02:39:30 PM

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Buffer Overflow

Legendary

Offline

Activity: 1652
Merit: 1016

Re: Recent downtime and data loss

January 23, 2015, 05:03:00 PM

#47

Quote from: BADecker on January 23, 2015, 05:00:39 PM

Quote from: Buffer Overflow on January 23, 2015, 04:53:57 PM

Quote from: BADecker on January 23, 2015, 04:27:36 PM

Quote from: deepceleron on January 23, 2015, 02:39:30 PM

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.

Backup software. Plenty about.

PGP: 0x702ABE1B3213C7CC4BFB63A70C921A615B4A1D72

BADecker

Legendary

Offline

Activity: 3990
Merit: 1386

Re: Recent downtime and data loss

January 23, 2015, 05:11:37 PM

#48

Quote from: Buffer Overflow on January 23, 2015, 05:03:00 PM

Quote from: BADecker on January 23, 2015, 05:00:39 PM

Quote from: Buffer Overflow on January 23, 2015, 04:53:57 PM

Quote from: BADecker on January 23, 2015, 04:27:36 PM

Quote from: deepceleron on January 23, 2015, 02:39:30 PM

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.

Backup software. Plenty about.

Thank you. Wink

egghead123

Legendary

Offline

Activity: 1330
Merit: 1000

Re: Recent downtime and data loss

January 23, 2015, 06:10:19 PM

#49

Quote from: theymos on January 23, 2015, 06:03:44 AM

Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.

Thanks for your time and efforts.Get an xbc wallet address I will send you a big donation

johnyj

Legendary

Offline

Activity: 1988
Merit: 1012

Beyond Imagination

Re: Recent downtime and data loss

January 23, 2015, 09:38:40 PM

#50

This totally defeated the purpose of running RAID 10, such high fail rate is already higher than conventional HDD, and I don't think RAID 0 is needed for SSD, they are already enough fast. Use two RAID 1 to backup each other is a better solution, but anyway this is very strange, RAID 1 should give enough warning before a total failure

Why bitcoin will appreciate forever: https://bitcointalk.org/index.php?topic=277275.msg3244038#msg3244038

Bizmark13

Sr. Member

Offline

Activity: 462
Merit: 250

Re: Recent downtime and data loss

January 24, 2015, 01:17:15 AM

#51

The forum went down right after I hit the preview button and just as I realized that my post was missing a [/QUOTE] tag. For a second or two, I got BBcode confused with HTML and thought there was a possibility that I broke the forum.

Anyway, to those who aren't sure if they have posts that are deleted or not (particularly to those who post a lot and might not remember how many posts they made prior to the data loss), go through your browser history and you'll get an idea of which posts you need to re-post.

jacktheking

Legendary

Offline

Activity: 1484
Merit: 1001

Personal Text Space Not For Sale

Re: Recent downtime and data loss

January 24, 2015, 01:31:52 AM

#52

Looks like in lost four posts. My signature campaign require me to have 165 posts. I remember I reached it. When the forums came back online, I have 161 posts.. anyway, good to hear the forums is back online.

So sad! This profile does not appear as the #1 result (on anonymous) Google searches anymore.

Time to be active on the crypto forums again? Proud to be one of the few Legendary members of the Sparkie Red Dot!

Gonna put this on my resume if I ever join a cryptocurrency/blockchain industry!

Rishblitz

Full Member

Offline

Activity: 308
Merit: 100

I'm nothing without GOD

Re: Recent downtime and data loss

January 24, 2015, 02:41:32 AM

#53

That sucks but at least its working again.

▰▰▰▰ DIGITAL GOLD ▰▰▰▰
▰▰▰▰ First 100% Liquid Stablecoin Backed by Gold ▰▰▰▰
◆ Whitepaper ◆Ann Thead ◆Bounty

Stratobitz

Legendary

Offline

Activity: 1022
Merit: 1010

Re: Recent downtime and data loss

January 24, 2015, 08:08:17 AM

#54

Nice to see things back up and running. I run an 8 disk Raid 0 SSD Array as well. Certainly safer than spinning drives and knock on wood no failures yet. But backups are a must.

Thanks for your hard work getting it back and running so quickly.

Cheers!

Strato

Wendigo

Legendary

Offline

Activity: 2604
Merit: 1036

Re: Recent downtime and data loss

January 24, 2015, 08:26:13 AM

#55

Quote from: Stratobitz on January 24, 2015, 08:08:17 AM

Doesn't raid 0 provide no recovery if 1 of the ssd's fails? At least you can salvage some data from a HDD.

dogie

Legendary

Offline

Activity: 1666
Merit: 1185

dogiecoin.com

Re: Recent downtime and data loss

January 24, 2015, 08:50:01 AM
Last edit: January 24, 2015, 12:46:40 PM by dogie

#56

Quote from: Wendigo on January 24, 2015, 08:26:13 AM

Quote from: Stratobitz on January 24, 2015, 08:08:17 AM

Doesn't raid 0 provide no recovery if 1 of the ssd's fails? At least you can salvage some data from a HDD.

The idea is never have to do that, so at some point you don't care. As long as you have an image of the array and a few hours of changes being lost are acceptable then you don't have to go salvaging. Or, raid 1'ing that raid 0 array.

Stratobitz

Legendary

Offline

Activity: 1022
Merit: 1010

Re: Recent downtime and data loss

January 24, 2015, 08:58:53 AM

#57

Quote from: Wendigo on January 24, 2015, 08:26:13 AM

Quote from: Stratobitz on January 24, 2015, 08:08:17 AM

Doesn't raid 0 provide no recovery if 1 of the ssd's fails? At least you can salvage some data from a HDD.

Yes that is correct. But I do work in video production, and I work with a lot of 4k, 5k and 6k lossless media. The 8 disk SSD array allows me to read/write 5.7GB/s (gigabytes).

So my setup for working: ( 8 ) 960 Crucial M500 SSDs - and these mirror to a Raid 5 Array overnight (7200rpm Enterprise Drives).

Dogie is correct, you can't rely on this setup for safely storing important data. But my project files are kept on a smaller drive which is safe raided (Raid 1 SSD) and only the raw media and transcodes (of which I have LTOs and HDD copies of) are stored on this raid while the project is being worked on.

The system is also on a dual 10Gbe NIC card - so I can push or pull media files from our server raid at roughly 2 GB per second.

This may sound extreme; but many of my projects can run 3,4,5+ TB in size. So if I need to load up or load off a large project, this setup makes an enormous difference.

Strato

Muhammed Zakir

Hero Member

Offline

Activity: 560
Merit: 509

I prefer Zakir over Muhammed when mentioning me!

Re: Recent downtime and data loss

January 24, 2015, 10:27:34 AM

#58

Can Nginx be used? Wouldn't it increase the performance? Any drawbacks when using it?

~~MZ~~

.ChipMixer.{ MIXING REINVENTED FOR YOUR PRIVACY #.ChipMixer.

Mitchell

Staff
Legendary

Offline

Activity: 4130
Merit: 2337

Verified awesomeness ✔

Re: Recent downtime and data loss

January 24, 2015, 11:32:06 AM

#59

Quote from: Muhammed Zakir on January 24, 2015, 10:27:34 AM

Can Nginx be used? Wouldn't it increase the performance? Any drawbacks when using it?

~~MZ~~

Nginx is currently used for the forum. I noticed that yesterday when I got a 502 error.

. ████▄ ▄████ ██████▄ ▄ ▄██████ ▀████▀ ▄███▄ ▀████▀ ▀▀ ▄██▀▀▀██▄ ▀▀ ▄████ ████▄ ███████████████ ▀████ ████▀ ▄█▄ ▀██▄▄▄██▀ ▄█▄ ▄███▄ ▀███▀ ▄███▄ ▄███▀▀█▀ ▀ ▀█▀▀███▄ ▀█▀ ▀█▀

.
Duelbits

▄████▄▄ ▄█████████▄ ▄█████████████▄ ▄██████████████████▄ ▄████▄▄▄█████████▄▄▄███▄ ▄████▐▀▄▄▀▌██▄█▄██▐▀▄▄▀▌███ ██████▀▀▀▀████▀███▀▀▀▀█████ ▐████████████■▄▄▄■██████████▀ ▐██████████████████████████▀ ██████████████████████████▀ ▀███████████████████████▀ ▀███████████████████▀ ▀███████████████▀

. ▄ ▄▄▀▀▀▀▄▄ ▄▀▀▄ █ █ ▀▄ █ ▄█▄ ▀▄ █ ▄▀ ▀▄ ▀█▀ ▄▀ ▀█▄▄▄▀▀ ▀ ▄▀ ▄▀ ▄▀ Live Games

▄▄▀▀▀▀▀▀▀▄▄ ▄▀ ▄▄▀▀▀▀▀▄▄ ▀▄ ▄▀ █ ▄ █ ▄ █ ▀▄ █ █ ▀ ▀ █ █ ▄▄▄ █ ▀▀▀▀▀▀▀▀▀▀▀▀▀ █ █ █ █▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀█ █▄█ █ ▀▀█ ▀▀█ ▀▀█ █ █▄█ Slots

. ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▄ █ ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ █ ▄▄ █ █ █ █ █ █ ▄▀▀▄▀▀▄ █ █ █ ▀▄ ▄▀ █ █ Blackjack

█▀▀▀▀▀█▄▄▄ ▀████▄▄ ██████▄ ▄▄▄▄▄▄▄▄█▀ ▀▀█ ████████▄ █ █████████▄ █ ██████████▄ ▄██ █████████▀▀▀█▄▄████ ▀▀███▀▀ ████ █ ███ █ █▀ ▄█████▄▄▄ ▄▄▀▀ ███████▀▀▀