Bitcoin Forum
May 22, 2019, 12:17:07 AM *
News: Latest Bitcoin Core release: 0.18.0 [Torrent] (New!)
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 5 »  All
  Print  
Author Topic: Recent downtime and data loss  (Read 5814 times)
theymos
Administrator
Legendary
*
Offline Offline

Activity: 3388
Merit: 5538


View Profile
January 23, 2015, 06:03:44 AM
 #1

Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
1558484227
Hero Member
*
Offline Offline

Posts: 1558484227

View Profile Personal Message (Offline)

Ignore
1558484227
Reply with quote  #2

1558484227
Report to moderator
1558484227
Hero Member
*
Offline Offline

Posts: 1558484227

View Profile Personal Message (Offline)

Ignore
1558484227
Reply with quote  #2

1558484227
Report to moderator
Every time a block is mined, a certain amount of BTC (called the subsidy) is created out of thin air and given to the miner. The subsidy halves every four years and will reach 0 in about 130 years.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
grendel25
Legendary
*
Offline Offline

Activity: 1414
Merit: 1024



View Profile
January 23, 2015, 06:12:31 AM
 #2

Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?

..Absolute.......................
..The First Proof of View Cryptocurrency..
.
.
▄▄█████████▄▄
▄█████████████████▄
▄█████████████████████▄
▄████████████████▀▀█████▄
▄████████████▀▀▀    ██████▄
████████▀▀▀   ▄▀   ████████
█████▄     ▄█▀     ████████
████████▄ █▀      █████████
▀████████▌▐       ████████▀
▀████████ ▄██▄  ████████▀
▀█████████████▄███████▀
▀█████████████████▀
▀▀█████████▀▀
.
▄▄█████████▄▄
▄█████████████████▄
▄█████████████████████▄
▄████████▀█████▀████████▄
▄██████▀  ▀     ▀  ▀██████▄
██████▌             ▐██████
██████    ██   ██    ██████
█████▌    ▀▀   ▀▀    ▐█████
▀█████▄  ▄▄     ▄▄  ▄█████▀
▀██████▄▄███████▄▄██████▀
▀█████████████████████▀
▀█████████████████▀
▀▀█████████▀▀
.
.
theymos
Administrator
Legendary
*
Offline Offline

Activity: 3388
Merit: 5538


View Profile
January 23, 2015, 06:21:27 AM
 #3

Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
Welsh
Staff
Legendary
*
Offline Offline

Activity: 1596
Merit: 1446



View Profile
January 23, 2015, 06:22:20 AM
 #4

It's good to be back nonetheless, keep us updated with the investigation. Minimal damage was done, and it was back up in a pretty speedy fashion (considering the nature of the downtime), well done to you and the team.

Quickseller
Copper Member
Legendary
*
Offline Offline

Activity: 1792
Merit: 1564


View Profile WWW
January 23, 2015, 06:42:44 AM
 #5

On reddit there was a discussion as to why we are not using something like amazon AWS for hosing.

Is this because we get free internet from PIA, or are there other drawbacks to using AWS verses our current setup?

NOTBanned from displaying signatures until May 20, 2022, 11:26:45 PM
Don’t Plagiarize, it’s dishonest and you *will* get caught
CanaryInTheMine
Donator
Legendary
*
Offline Offline

Activity: 2044
Merit: 1005


between a rock and a block!


View Profile
January 23, 2015, 06:42:57 AM
 #6

Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!
redsn0w
Legendary
*
Offline Offline

Activity: 1610
Merit: 1024


#Free market


View Profile
January 23, 2015, 07:04:26 AM
 #7

Thanks theymos for the information and good luck.
Wendigo
Legendary
*
Offline Offline

Activity: 2016
Merit: 1029


View Profile
January 23, 2015, 07:06:08 AM
 #8

Glad to see the forum is back up and running after that downtime.
3btc
Full Member
***
Offline Offline

Activity: 182
Merit: 100



View Profile
January 23, 2015, 07:07:18 AM
 #9

So awesome that bitcointalk is back!  Smiley *yay*

I hope you don't have a too big sleep deficiency now  Wink
smoothie
Legendary
*
Offline Offline

Activity: 2128
Merit: 1016


LEALANA Monero Physical Silver Coins


View Profile
January 23, 2015, 07:15:07 AM
 #10

Was this the longest down time in the past few years?

I can't remember a longer one being an avid poster etc...

███████████████████████████████████████

            ,╓p@@███████@╗╖,           
        ,p████████████████████N,       
      d█████████████████████████b     
    d██████████████████████████████æ   
  ,████²█████████████████████████████, 
 ,█████  ╙████████████████████╨  █████y
 ██████    `████████████████`    ██████
║██████       Ñ███████████`      ███████
███████         ╩██████Ñ         ███████
███████    ▐▄     ²██╩     a▌    ███████
╢██████    ▐▓█▄          ▄█▓▌    ███████
 ██████    ▐▓▓▓▓▌,     ▄█▓▓▓▌    ██████─
           ▐▓▓▓▓▓▓█,,▄▓▓▓▓▓▓▌          
           ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌          
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓─  
     ²▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓╩    
        ▀▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀       
           ²▀▀▓▓▓▓▓▓▓▓▓▓▓▓▀▀`          
                   ²²²                 
███████████████████████████████████████

. ★☆ WWW.LEALANA.COM        My PGP fingerprint is A764D833.                  History of Monero development Visualization ★☆ .
LEALANA  PHYSICAL MONERO COINS 999 FINE SILVER.
 
Cyrus
Ninja
Administrator
Legendary
*
Offline Offline

Activity: 2254
Merit: 1118



View Profile
January 23, 2015, 07:26:32 AM
Last edit: January 23, 2015, 07:48:13 AM by Cyrus
 #11

Was this the longest down time in the past few years?

This was the longest I've experienced: https://bitcointalk.org/index.php?topic=306936

PS: It's good to be back!

Deadstock
Full Member
***
Offline Offline

Activity: 157
Merit: 100


View Profile
January 23, 2015, 07:28:22 AM
 #12

I was so bored with BCT down all day at work  Grin
IamCANADIAN013
Hero Member
*****
Offline Offline

Activity: 714
Merit: 502



View Profile
January 23, 2015, 07:29:56 AM
 #13

Thank you for your hard work getting the forum back up theymos, much appreciated!

Gotta admit, I was starting to worry with it being down for so long.
fairglu
Legendary
*
Offline Offline

Activity: 1096
Merit: 1025


View Profile WWW
January 23, 2015, 07:45:07 AM
 #14

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.

IME the weak point of RAID is usually the controller: it's the non-redundant part of a redundant array :/
(be it because it's plain corrupting data, doing unnecessary I/O and wearing the disks... or just fails to report errors)

haploid23
Legendary
*
Offline Offline

Activity: 812
Merit: 1002



View Profile WWW
January 23, 2015, 07:53:15 AM
 #15

What SSD's are you running these on? Some SSD's are pure garbage in reliability, and OCZ is notorious for this. You can't really go wrong with Intel ones, although there was one series of Intel that had some random bricking.

Also you mentioned that these are fairly old SSDs. Just noting that SSD have a lifetime and do "expire", not by age but how much is written on there. I don't know much about server configuration, but if there are some intensive writing on the SSDs, especially MLC chips, they wear out much sooner.

smoothie
Legendary
*
Offline Offline

Activity: 2128
Merit: 1016


LEALANA Monero Physical Silver Coins


View Profile
January 23, 2015, 07:54:43 AM
 #16

Glad to see the forum is up and running. Thanks Theymos.

███████████████████████████████████████

            ,╓p@@███████@╗╖,           
        ,p████████████████████N,       
      d█████████████████████████b     
    d██████████████████████████████æ   
  ,████²█████████████████████████████, 
 ,█████  ╙████████████████████╨  █████y
 ██████    `████████████████`    ██████
║██████       Ñ███████████`      ███████
███████         ╩██████Ñ         ███████
███████    ▐▄     ²██╩     a▌    ███████
╢██████    ▐▓█▄          ▄█▓▌    ███████
 ██████    ▐▓▓▓▓▌,     ▄█▓▓▓▌    ██████─
           ▐▓▓▓▓▓▓█,,▄▓▓▓▓▓▓▌          
           ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌          
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓─  
     ²▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓╩    
        ▀▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀       
           ²▀▀▓▓▓▓▓▓▓▓▓▓▓▓▀▀`          
                   ²²²                 
███████████████████████████████████████

. ★☆ WWW.LEALANA.COM        My PGP fingerprint is A764D833.                  History of Monero development Visualization ★☆ .
LEALANA  PHYSICAL MONERO COINS 999 FINE SILVER.
 
haploid23
Legendary
*
Offline Offline

Activity: 812
Merit: 1002



View Profile WWW
January 23, 2015, 07:55:42 AM
 #17

I think I lost a few PM's, but nothing crucial.

johnyj
Legendary
*
Offline Offline

Activity: 1848
Merit: 1000


Beyond Imagination


View Profile
January 23, 2015, 08:00:07 AM
 #18

Had several SSDs on server broken, I think SSDs are not good at handling large amount of IO for server. My traditional hard drive RAID never failed on the same server

haploid23
Legendary
*
Offline Offline

Activity: 812
Merit: 1002



View Profile WWW
January 23, 2015, 08:12:58 AM
 #19

not good at handling large amount of IO for server.

That's what I was thinking too. Same reason why you should defrag a SSD on a normal desktop.

twister
Hero Member
*****
Offline Offline

Activity: 672
Merit: 501



View Profile WWW
January 23, 2015, 08:14:44 AM
 #20

SSDs are not reliable but then again, HDDs aren't reliable either.
I lost all my posts from yesterday.  Undecided

 

██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████
 
Get Free Bitcoin Now!
  ¦¯¦¦¯¦    ¦¯¦¦¯¦    ¦¯¦¦¯¦    ¦¯¦¦¯¦   
0.8%-1% House Edge
[/
Pages: [1] 2 3 4 5 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!