Bitcoin Forum
November 01, 2024, 04:39:02 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 5 »  All
  Print  
Author Topic: Recent downtime and data loss  (Read 6077 times)
theymos (OP)
Administrator
Legendary
*
Offline Offline

Activity: 5376
Merit: 13348


View Profile
January 23, 2015, 06:03:44 AM
 #1

Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
grendel25
Legendary
*
Offline Offline

Activity: 2296
Merit: 1031



View Profile
January 23, 2015, 06:12:31 AM
 #2

Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?

..EPICENTRAL .....
..EPIC: Epic Private Internet Cash..
.
.
▄▄█████████▄▄
▄█████████████████▄
▄█████████████████████▄
▄████████████████▀▀█████▄
▄████████████▀▀▀    ██████▄
████████▀▀▀   ▄▀   ████████
█████▄     ▄█▀     ████████
████████▄ █▀      █████████
▀████████▌▐       ████████▀
▀████████ ▄██▄  ████████▀
▀█████████████▄███████▀
▀█████████████████▀
▀▀█████████▀▀
.
▄▄█████████▄▄
▄█████████████████▄
▄█████████████████████▄
▄████████▀█████▀████████▄
▄██████▀  ▀     ▀  ▀██████▄
██████▌             ▐██████
██████    ██   ██    ██████
█████▌    ▀▀   ▀▀    ▐█████
▀█████▄  ▄▄     ▄▄  ▄█████▀
▀██████▄▄███████▄▄██████▀
▀█████████████████████▀
▀█████████████████▀
▀▀█████████▀▀
.
.
[/center]
theymos (OP)
Administrator
Legendary
*
Offline Offline

Activity: 5376
Merit: 13348


View Profile
January 23, 2015, 06:21:27 AM
 #3

Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
Welsh
Staff
Legendary
*
Offline Offline

Activity: 3304
Merit: 4115


View Profile
January 23, 2015, 06:22:20 AM
 #4

It's good to be back nonetheless, keep us updated with the investigation. Minimal damage was done, and it was back up in a pretty speedy fashion (considering the nature of the downtime), well done to you and the team.
Quickseller
Copper Member
Legendary
*
Offline Offline

Activity: 2982
Merit: 2371


View Profile
January 23, 2015, 06:42:44 AM
 #5

On reddit there was a discussion as to why we are not using something like amazon AWS for hosing.

Is this because we get free internet from PIA, or are there other drawbacks to using AWS verses our current setup?

★ ★ ██████████████████████████████[█████████████████████
██████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████████
███████████████████████████████████████████████████████████████████
████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████████
███████████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████
████████████████████████████████████████████████████████████
███████████████████████████████████████████████████████████████████
★ ★ 
CanaryInTheMine
Donator
Legendary
*
Offline Offline

Activity: 2352
Merit: 1060


between a rock and a block!


View Profile
January 23, 2015, 06:42:57 AM
 #6

Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!
redsn0w
Legendary
*
Offline Offline

Activity: 1778
Merit: 1043


#Free market


View Profile
January 23, 2015, 07:04:26 AM
 #7

Thanks theymos for the information and good luck.
Wendigo
Legendary
*
Offline Offline

Activity: 2604
Merit: 1036



View Profile
January 23, 2015, 07:06:08 AM
 #8

Glad to see the forum is back up and running after that downtime.
3btc
Full Member
***
Offline Offline

Activity: 224
Merit: 101



View Profile
January 23, 2015, 07:07:18 AM
 #9

So awesome that bitcointalk is back!  Smiley *yay*

I hope you don't have a too big sleep deficiency now  Wink

Sell * Buy * Anything * Decentralized * Free

OpenBazaar.Org
smoothie
Legendary
*
Offline Offline

Activity: 2492
Merit: 1474


LEALANA Bitcoin Grim Reaper


View Profile
January 23, 2015, 07:15:07 AM
 #10

Was this the longest down time in the past few years?

I can't remember a longer one being an avid poster etc...

███████████████████████████████████████

            ,╓p@@███████@╗╖,           
        ,p████████████████████N,       
      d█████████████████████████b     
    d██████████████████████████████æ   
  ,████²█████████████████████████████, 
 ,█████  ╙████████████████████╨  █████y
 ██████    `████████████████`    ██████
║██████       Ñ███████████`      ███████
███████         ╩██████Ñ         ███████
███████    ▐▄     ²██╩     a▌    ███████
╢██████    ▐▓█▄          ▄█▓▌    ███████
 ██████    ▐▓▓▓▓▌,     ▄█▓▓▓▌    ██████─
           ▐▓▓▓▓▓▓█,,▄▓▓▓▓▓▓▌          
           ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌          
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓─  
     ²▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓╩    
        ▀▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀       
           ²▀▀▓▓▓▓▓▓▓▓▓▓▓▓▀▀`          
                   ²²²                 
███████████████████████████████████████

. ★☆ WWW.LEALANA.COM        My PGP fingerprint is A764D833.                  History of Monero development Visualization ★☆ .
LEALANA BITCOIN GRIM REAPER SILVER COINS.
 
Cyrus
Ninja
Administrator
Legendary
*
Online Online

Activity: 3934
Merit: 3146



View Profile
January 23, 2015, 07:26:32 AM
Last edit: January 23, 2015, 07:48:13 AM by Cyrus
 #11

Was this the longest down time in the past few years?

This was the longest I've experienced: https://bitcointalk.org/index.php?topic=306936

PS: It's good to be back!

Deadstock
Full Member
***
Offline Offline

Activity: 156
Merit: 100


View Profile
January 23, 2015, 07:28:22 AM
 #12

I was so bored with BCT down all day at work  Grin
IamCANADIAN013
Hero Member
*****
Offline Offline

Activity: 714
Merit: 503



View Profile
January 23, 2015, 07:29:56 AM
 #13

Thank you for your hard work getting the forum back up theymos, much appreciated!

Gotta admit, I was starting to worry with it being down for so long.
fairglu
Legendary
*
Offline Offline

Activity: 1100
Merit: 1032


View Profile WWW
January 23, 2015, 07:45:07 AM
 #14

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.

IME the weak point of RAID is usually the controller: it's the non-redundant part of a redundant array :/
(be it because it's plain corrupting data, doing unnecessary I/O and wearing the disks... or just fails to report errors)

haploid23
Legendary
*
Offline Offline

Activity: 812
Merit: 1002



View Profile WWW
January 23, 2015, 07:53:15 AM
 #15

What SSD's are you running these on? Some SSD's are pure garbage in reliability, and OCZ is notorious for this. You can't really go wrong with Intel ones, although there was one series of Intel that had some random bricking.

Also you mentioned that these are fairly old SSDs. Just noting that SSD have a lifetime and do "expire", not by age but how much is written on there. I don't know much about server configuration, but if there are some intensive writing on the SSDs, especially MLC chips, they wear out much sooner.

smoothie
Legendary
*
Offline Offline

Activity: 2492
Merit: 1474


LEALANA Bitcoin Grim Reaper


View Profile
January 23, 2015, 07:54:43 AM
 #16

Glad to see the forum is up and running. Thanks Theymos.

███████████████████████████████████████

            ,╓p@@███████@╗╖,           
        ,p████████████████████N,       
      d█████████████████████████b     
    d██████████████████████████████æ   
  ,████²█████████████████████████████, 
 ,█████  ╙████████████████████╨  █████y
 ██████    `████████████████`    ██████
║██████       Ñ███████████`      ███████
███████         ╩██████Ñ         ███████
███████    ▐▄     ²██╩     a▌    ███████
╢██████    ▐▓█▄          ▄█▓▌    ███████
 ██████    ▐▓▓▓▓▌,     ▄█▓▓▓▌    ██████─
           ▐▓▓▓▓▓▓█,,▄▓▓▓▓▓▓▌          
           ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌          
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓─  
     ²▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓╩    
        ▀▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀       
           ²▀▀▓▓▓▓▓▓▓▓▓▓▓▓▀▀`          
                   ²²²                 
███████████████████████████████████████

. ★☆ WWW.LEALANA.COM        My PGP fingerprint is A764D833.                  History of Monero development Visualization ★☆ .
LEALANA BITCOIN GRIM REAPER SILVER COINS.
 
haploid23
Legendary
*
Offline Offline

Activity: 812
Merit: 1002



View Profile WWW
January 23, 2015, 07:55:42 AM
 #17

I think I lost a few PM's, but nothing crucial.

johnyj
Legendary
*
Offline Offline

Activity: 1988
Merit: 1012


Beyond Imagination


View Profile
January 23, 2015, 08:00:07 AM
 #18

Had several SSDs on server broken, I think SSDs are not good at handling large amount of IO for server. My traditional hard drive RAID never failed on the same server

haploid23
Legendary
*
Offline Offline

Activity: 812
Merit: 1002



View Profile WWW
January 23, 2015, 08:12:58 AM
 #19

not good at handling large amount of IO for server.

That's what I was thinking too. Same reason why you should defrag a SSD on a normal desktop.

twister
Hero Member
*****
Offline Offline

Activity: 672
Merit: 502



View Profile WWW
January 23, 2015, 08:14:44 AM
 #20

SSDs are not reliable but then again, HDDs aren't reliable either.
I lost all my posts from yesterday.  Undecided

 

██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
 
Get Free Bitcoin Now!
  ¦¯¦¦¯¦    ¦¯¦¦¯¦    ¦¯¦¦¯¦    ¦¯¦¦¯¦   
0.8%-1% House Edge
[/
Pages: [1] 2 3 4 5 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!