theymos (OP)
Administrator
Legendary
Offline
Activity: 5404
Merit: 13498
|
|
January 23, 2015, 06:03:44 AM |
|
Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).
I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.
Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.
Search is temporarily disabled because I need to regenerate the search index before it will be usable again.
There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.
If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.
This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.
Sorry for the inconvenience!
Technical details:
The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.
My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)
I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.
On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.
|
1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
|
|
|
grendel25
Legendary
Offline
Activity: 2296
Merit: 1031
|
|
January 23, 2015, 06:12:31 AM |
|
Good job getting it back up. I'm used to a config where you just hot swap drives that are about to fail. Planning any changes after this experience?
|
|
|
|
theymos (OP)
Administrator
Legendary
Offline
Activity: 5404
Merit: 13498
|
|
January 23, 2015, 06:21:27 AM |
|
Good job getting it back up. I'm used to a config where you just hot swap drives that are about to fail. Planning any changes after this experience?
That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.
|
1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
|
|
|
Welsh
Staff
Legendary
Offline
Activity: 3318
Merit: 4116
|
|
January 23, 2015, 06:22:20 AM |
|
It's good to be back nonetheless, keep us updated with the investigation. Minimal damage was done, and it was back up in a pretty speedy fashion (considering the nature of the downtime), well done to you and the team.
|
|
|
|
Quickseller
Copper Member
Legendary
Offline
Activity: 2996
Merit: 2374
|
|
January 23, 2015, 06:42:44 AM |
|
On reddit there was a discussion as to why we are not using something like amazon AWS for hosing.
Is this because we get free internet from PIA, or are there other drawbacks to using AWS verses our current setup?
|
|
|
|
CanaryInTheMine
Donator
Legendary
Offline
Activity: 2352
Merit: 1060
between a rock and a block!
|
|
January 23, 2015, 06:42:57 AM |
|
Some ssd drive health monitoring might help... Ssd drives deteriorate over time... Glad you got it restored!
|
|
|
|
redsn0w
Legendary
Offline
Activity: 1778
Merit: 1043
#Free market
|
|
January 23, 2015, 07:04:26 AM |
|
Thanks theymos for the information and good luck.
|
|
|
|
Wendigo
Legendary
Offline
Activity: 2604
Merit: 1036
|
|
January 23, 2015, 07:06:08 AM |
|
Glad to see the forum is back up and running after that downtime.
|
|
|
|
3btc
|
|
January 23, 2015, 07:07:18 AM |
|
So awesome that bitcointalk is back! *yay* I hope you don't have a too big sleep deficiency now
|
|
|
|
smoothie
Legendary
Offline
Activity: 2492
Merit: 1491
LEALANA Bitcoin Grim Reaper
|
|
January 23, 2015, 07:15:07 AM |
|
Was this the longest down time in the past few years?
I can't remember a longer one being an avid poster etc...
|
███████████████████████████████████████
,╓p@@███████@╗╖, ,p████████████████████N, d█████████████████████████b d██████████████████████████████æ ,████²█████████████████████████████, ,█████ ╙████████████████████╨ █████y ██████ `████████████████` ██████ ║██████ Ñ███████████` ███████ ███████ ╩██████Ñ ███████ ███████ ▐▄ ²██╩ a▌ ███████ ╢██████ ▐▓█▄ ▄█▓▌ ███████ ██████ ▐▓▓▓▓▌, ▄█▓▓▓▌ ██████─ ▐▓▓▓▓▓▓█,,▄▓▓▓▓▓▓▌ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓─ ²▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓╩ ▀▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀ ²▀▀▓▓▓▓▓▓▓▓▓▓▓▓▀▀` ²²² ███████████████████████████████████████
| . ★☆ WWW.LEALANA.COM My PGP fingerprint is A764D833. History of Monero development Visualization ★☆ . LEALANA BITCOIN GRIM REAPER SILVER COINS. |
|
|
|
Cyrus
Ninja
Administrator
Legendary
Offline
Activity: 3990
Merit: 3212
|
|
January 23, 2015, 07:26:32 AM Last edit: January 23, 2015, 07:48:13 AM by Cyrus |
|
Was this the longest down time in the past few years?
This was the longest I've experienced: https://bitcointalk.org/index.php?topic=306936PS: It's good to be back!
|
|
|
|
Deadstock
|
|
January 23, 2015, 07:28:22 AM |
|
I was so bored with BCT down all day at work
|
|
|
|
IamCANADIAN013
|
|
January 23, 2015, 07:29:56 AM |
|
Thank you for your hard work getting the forum back up theymos, much appreciated!
Gotta admit, I was starting to worry with it being down for so long.
|
|
|
|
fairglu
Legendary
Offline
Activity: 1100
Merit: 1032
|
|
January 23, 2015, 07:45:07 AM |
|
That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.
IME the weak point of RAID is usually the controller: it's the non-redundant part of a redundant array :/ (be it because it's plain corrupting data, doing unnecessary I/O and wearing the disks... or just fails to report errors)
|
|
|
|
haploid23
Legendary
Offline
Activity: 812
Merit: 1002
|
|
January 23, 2015, 07:53:15 AM |
|
What SSD's are you running these on? Some SSD's are pure garbage in reliability, and OCZ is notorious for this. You can't really go wrong with Intel ones, although there was one series of Intel that had some random bricking.
Also you mentioned that these are fairly old SSDs. Just noting that SSD have a lifetime and do "expire", not by age but how much is written on there. I don't know much about server configuration, but if there are some intensive writing on the SSDs, especially MLC chips, they wear out much sooner.
|
|
|
|
smoothie
Legendary
Offline
Activity: 2492
Merit: 1491
LEALANA Bitcoin Grim Reaper
|
|
January 23, 2015, 07:54:43 AM |
|
Glad to see the forum is up and running. Thanks Theymos.
|
███████████████████████████████████████
,╓p@@███████@╗╖, ,p████████████████████N, d█████████████████████████b d██████████████████████████████æ ,████²█████████████████████████████, ,█████ ╙████████████████████╨ █████y ██████ `████████████████` ██████ ║██████ Ñ███████████` ███████ ███████ ╩██████Ñ ███████ ███████ ▐▄ ²██╩ a▌ ███████ ╢██████ ▐▓█▄ ▄█▓▌ ███████ ██████ ▐▓▓▓▓▌, ▄█▓▓▓▌ ██████─ ▐▓▓▓▓▓▓█,,▄▓▓▓▓▓▓▌ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓─ ²▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓╩ ▀▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀ ²▀▀▓▓▓▓▓▓▓▓▓▓▓▓▀▀` ²²² ███████████████████████████████████████
| . ★☆ WWW.LEALANA.COM My PGP fingerprint is A764D833. History of Monero development Visualization ★☆ . LEALANA BITCOIN GRIM REAPER SILVER COINS. |
|
|
|
haploid23
Legendary
Offline
Activity: 812
Merit: 1002
|
|
January 23, 2015, 07:55:42 AM |
|
I think I lost a few PM's, but nothing crucial.
|
|
|
|
johnyj
Legendary
Offline
Activity: 1988
Merit: 1012
Beyond Imagination
|
|
January 23, 2015, 08:00:07 AM |
|
Had several SSDs on server broken, I think SSDs are not good at handling large amount of IO for server. My traditional hard drive RAID never failed on the same server
|
|
|
|
haploid23
Legendary
Offline
Activity: 812
Merit: 1002
|
|
January 23, 2015, 08:12:58 AM |
|
not good at handling large amount of IO for server.
That's what I was thinking too. Same reason why you should defrag a SSD on a normal desktop.
|
|
|
|
twister
|
|
January 23, 2015, 08:14:44 AM |
|
SSDs are not reliable but then again, HDDs aren't reliable either. I lost all my posts from yesterday.
|
|
|
|
|