Bitcoin Forum

Other => Meta => Topic started by: theymos on January 23, 2015, 06:03:44 AM



Title: Recent downtime and data loss
Post by: theymos on January 23, 2015, 06:03:44 AM
Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.


Title: Re: Recent downtime and data loss
Post by: grendel25 on January 23, 2015, 06:12:31 AM
Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?


Title: Re: Recent downtime and data loss
Post by: theymos on January 23, 2015, 06:21:27 AM
Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.


Title: Re: Recent downtime and data loss
Post by: Welsh on January 23, 2015, 06:22:20 AM
It's good to be back nonetheless, keep us updated with the investigation. Minimal damage was done, and it was back up in a pretty speedy fashion (considering the nature of the downtime), well done to you and the team.


Title: Re: Recent downtime and data loss
Post by: Quickseller on January 23, 2015, 06:42:44 AM
On reddit there was a discussion as to why we are not using something like amazon AWS for hosing.

Is this because we get free internet from PIA, or are there other drawbacks to using AWS verses our current setup?


Title: Re: Recent downtime and data loss
Post by: CanaryInTheMine on January 23, 2015, 06:42:57 AM
Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!


Title: Re: Recent downtime and data loss
Post by: redsn0w on January 23, 2015, 07:04:26 AM
Thanks theymos for the information and good luck.


Title: Re: Recent downtime and data loss
Post by: Wendigo on January 23, 2015, 07:06:08 AM
Glad to see the forum is back up and running after that downtime.


Title: Re: Recent downtime and data loss
Post by: 3btc on January 23, 2015, 07:07:18 AM
So awesome that bitcointalk is back!  :) *yay*

I hope you don't have a too big sleep deficiency now  ;)


Title: Re: Recent downtime and data loss
Post by: smoothie on January 23, 2015, 07:15:07 AM
Was this the longest down time in the past few years?

I can't remember a longer one being an avid poster etc...


Title: Re: Recent downtime and data loss
Post by: Cyrus on January 23, 2015, 07:26:32 AM
Was this the longest down time in the past few years?

This was the longest I've experienced: https://bitcointalk.org/index.php?topic=306936

PS: It's good to be back!


Title: Re: Recent downtime and data loss
Post by: Deadstock on January 23, 2015, 07:28:22 AM
I was so bored with BCT down all day at work  ;D


Title: Re: Recent downtime and data loss
Post by: IamCANADIAN013 on January 23, 2015, 07:29:56 AM
Thank you for your hard work getting the forum back up theymos, much appreciated!

Gotta admit, I was starting to worry with it being down for so long.


Title: Re: Recent downtime and data loss
Post by: fairglu on January 23, 2015, 07:45:07 AM
That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.

IME the weak point of RAID is usually the controller: it's the non-redundant part of a redundant array :/
(be it because it's plain corrupting data, doing unnecessary I/O and wearing the disks... or just fails to report errors)


Title: Re: Recent downtime and data loss
Post by: haploid23 on January 23, 2015, 07:53:15 AM
What SSD's are you running these on? Some SSD's are pure garbage in reliability, and OCZ is notorious for this. You can't really go wrong with Intel ones, although there was one series of Intel that had some random bricking.

Also you mentioned that these are fairly old SSDs. Just noting that SSD have a lifetime and do "expire", not by age but how much is written on there. I don't know much about server configuration, but if there are some intensive writing on the SSDs, especially MLC chips, they wear out much sooner.


Title: Re: Recent downtime and data loss
Post by: smoothie on January 23, 2015, 07:54:43 AM
Glad to see the forum is up and running. Thanks Theymos.


Title: Re: Recent downtime and data loss
Post by: haploid23 on January 23, 2015, 07:55:42 AM
I think I lost a few PM's, but nothing crucial.


Title: Re: Recent downtime and data loss
Post by: johnyj on January 23, 2015, 08:00:07 AM
Had several SSDs on server broken, I think SSDs are not good at handling large amount of IO for server. My traditional hard drive RAID never failed on the same server


Title: Re: Recent downtime and data loss
Post by: haploid23 on January 23, 2015, 08:12:58 AM
not good at handling large amount of IO for server.

That's what I was thinking too. Same reason why you should defrag a SSD on a normal desktop.


Title: Re: Recent downtime and data loss
Post by: twister on January 23, 2015, 08:14:44 AM
SSDs are not reliable but then again, HDDs aren't reliable either.
I lost all my posts from yesterday.  :-\


Title: Re: Recent downtime and data loss
Post by: Buffer Overflow on January 23, 2015, 08:36:00 AM
Good work getting site back up.

I still prefer the old magnetic drives. I know they are slower, but they don't have all this cell wear problem. Also extra precautions need to be taken with SSDs if using as an encrypted drive, such as not using the TRIM command.


Title: Re: Recent downtime and data loss
Post by: railzand on January 23, 2015, 08:39:09 AM
It really was rather blissful. Could you make it a regular thing?


Title: Re: Recent downtime and data loss
Post by: One Six on January 23, 2015, 09:28:26 AM
I had no life for 12+ hours when your website went down. I'm glad its back up.


Title: Re: Recent downtime and data loss
Post by: TookDk on January 23, 2015, 09:29:13 AM
Thank you theymos for getting the forum back online so quick.
Much appreciated  :)


Title: Re: Recent downtime and data loss
Post by: EFS on January 23, 2015, 09:29:31 AM
Yesterday, internet was meaningless to me.


Title: Re: Recent downtime and data loss
Post by: BADecker on January 23, 2015, 09:31:08 AM
Well, well, well! We are seeing a new addiction that we hadn't realized existed before.    :D


Title: Re: Recent downtime and data loss
Post by: TookDk on January 23, 2015, 09:35:16 AM
Yesterday, internet was meaningless to me.

This is how the internet looked yesterday:
http://www.writers-network.com/avatars/black-background-300x240.jpg


Title: Re: Recent downtime and data loss
Post by: BadBear on January 23, 2015, 09:43:01 AM
Good work getting site back up.

I still prefer the old magnetic drives. I know they are slower, but they don't have all this cell wear problem. Also extra precautions need to be taken with SSDs if using as an encrypted drive, such as not using the TRIM command.


That's what backups are for. If you don't have backups then you probably don't really care about that data anyway. Platter drives have their place, so do SSDs.



Title: Re: Recent downtime and data loss
Post by: stellar69 on January 23, 2015, 10:32:44 AM
Maybe I found out a bug/error. I sent a PM to theymos to correct it.
The error was 502:Bad Gateway Error (I got it while accessing a particular thread).


Title: Re: Recent downtime and data loss
Post by: Remember remember the 5th of November on January 23, 2015, 11:13:44 AM
I've said this before, SSDs are unreliable. If the thing cannot be used for 10 years without any wear, it's as useless as...incomparable.


Title: Re: Recent downtime and data loss
Post by: dsyahputera on January 23, 2015, 11:22:28 AM
Is the search feature is one of the problems? It disabled now :-\


Title: Re: Recent downtime and data loss
Post by: unsoindovo on January 23, 2015, 11:22:37 AM
Good work getting site back up.

I still prefer the old magnetic drives. I know they are slower, but they don't have all this cell wear problem. Also extra precautions need to be taken with SSDs if using as an encrypted drive, such as not using the TRIM command.


That's what backups are for. If you don't have backups then you probably don't really care about that data anyway. Platter drives have their place, so do SSDs.


sure!!!

raid conf > 0, and RAID 1+0 too, is not to backup replace!!!

you need always to backup your data.

especially when it is possible to do hot backup like over DB.


Title: Re: Recent downtime and data loss
Post by: sgk on January 23, 2015, 11:46:45 AM
I was so bored with BCT down all day at work  ;D

Ditto.

I was hitting the refresh button every 5 minutes to check if its back online.
Had to work all day at office because forum was offline   :D


Title: Re: Recent downtime and data loss
Post by: Yuki1988 on January 23, 2015, 11:58:57 AM
Is the search feature is one of the problems? It disabled now :-\

You need to read theymos' post more carefully. :)
Search is temporarily disabled because I need to regenerate the search index before it will be usable again.


Title: Re: Recent downtime and data loss
Post by: bornil267645 on January 23, 2015, 12:00:39 PM
I was on the road when this happened. As I came to see my post. I saw 9 of my posts vanished. But then I saw this post. Glad the the site is back online.


Title: Re: Recent downtime and data loss
Post by: dogie on January 23, 2015, 12:09:15 PM
Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!

Depends on the controller and the age of the firmware whether you can see true SMART stats from the drives or even they're masked. My SSD's raid array (gen 1 X25-Ms) gives me goop as SMART readings which I know are false.


Title: Re: Recent downtime and data loss
Post by: stellar69 on January 23, 2015, 12:37:28 PM
What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?


Title: Re: Recent downtime and data loss
Post by: dogie on January 23, 2015, 01:54:07 PM
What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

Backup drives have no need to be quick so they're probably high capacity non SSD drives in a raid 1. And because they're much larger capacity than the active SSD raid array, they can contain multiple, historical backups of the same databases. At least, that's how I'd do it on a budget. Probably some periodical backups moved elsewhere as well.


Title: Re: Recent downtime and data loss
Post by: matt4054 on January 23, 2015, 02:17:01 PM
Thanks for bringing the forum back online and posting all the gory details about it :)


Title: Re: Recent downtime and data loss
Post by: alani123 on January 23, 2015, 02:20:06 PM
I can't really see this being a conspiracy tbh. Unless damning evidence is presented I'll keep believing that is was some legit downtime due to the issues theymos talked about. The fact that this ha never happened to the forum makes me trust him a bit more about this.


Title: Re: Recent downtime and data loss
Post by: deepceleron on January 23, 2015, 02:39:30 PM
What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

What happens if I write my password down on two pieces of paper and then set both papers on fire, will I lose my password? Please don't ask such silly questions in a Theymos thread for the sake of spamming your signature.

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

There is an interesting aspect of RAID arrays when using SSDs in a mirrored configuration that can only tolerate the failure of one drive - you are writing identical data to identical drives, and should expect them to fail in identical ways, killing two drives at once.

Physical hard drives are more subject to mechanical tolerances that vary between samples. The disk coatings are not deposited equally, the bearings are not made molecularly identical, the windings in the heads and the motors aren't perfect matches. We would expect given a high intensity and identical load to two drives that it would be virtually impossible for them to suffer failure at the same time.

SSDs are different. Some drives have firmware that specifically bricks the drive or turns it into a read-only drive after a certain number of writes. While the actual memory cells may fail differently between the drives, they wear at a predictable statistical rate and there is a reserve of usually 2-5% of drive space of extra sectors that will finally be exhausted at nearly identical times given identical write patterns.

SSDs push the limit of what can be stored on silicon, so they have many layers of error correction to go along with the wear leveling. It is quite hard to get bad data back out of a well-designed Tier 1 drive.

However, failure of SSDs can be random and capricious, especially the OCZ/Patriot/Crucial bottom-tier drives that just shit themselves for no reason. As it is unlikely the forum completely used up the drive wear life on it's drives, it is more likely random crap-drive failure or RAID controller failure.

For reference, look at the tests here: http://techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes - they designed a test specifically to kill SSDs with writes and left it running 24/7 for over a year on six drives. You can see that the SMART wear indicators on most drives indicate the amount of drive life left, some lock up after the reserve is used, and some keep going into write error territory. Only one drive unexpectedly died. SMART analysis of forum's drives will likely indicate the status of the reallocated sector count and wear leveling count to see if failure was unexpected.

Outside of the SSDs themselves failing, most hardware RAID controllers are pretty dumb (and if you are spending less than ~$400 you are not even getting hardware RAID). They just write the same data to two drives. There is no error correcting or checksumming, so they are useless to fix corruption, and actually are less tolerant than just a single drive would be. If you want to see scary, look up "RAID 5 write hole", basically there is no way for these RAIDs to tolerate power loss, making on-controller battery backup super important.

Also, consumer hardware just plan HAS errors, and hardware RAID is not written to deal with them: http://www.zdnet.com/article/has-raid5-stopped-working/

Also majorly important is that the system must be running ECC RAM, and the RAID controller must also have ECC RAM if it has cache slots.

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.


Title: Re: Recent downtime and data loss
Post by: kolloh on January 23, 2015, 03:01:05 PM
Thanks for posting the detailed information about what happened. From a technical perspective, it is interesting reading about how things are configured.

deepceleron, I'm actually planning to use a RAIDZ2 pool for a database server that I'm working on as it looks like a very nice solution. RAIDZ2 only allows two disks to fail but I am liking what I read about ZFS thus far. The LZ4 compression also looks to be pretty handy.


Title: Re: Recent downtime and data loss
Post by: BADecker on January 23, 2015, 04:27:36 PM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

:)


Title: Re: Recent downtime and data loss
Post by: fairglu on January 23, 2015, 04:43:49 PM
Backups are now just for when the whole place burns to the ground.

Do not underestimate, the Plain Old Bugs (tm), Plain Old Human Errors (tm) and the Drunk or Drugged SysAdmin (r)

Much more common than destruction by fire :P


Title: Re: Recent downtime and data loss
Post by: Buffer Overflow on January 23, 2015, 04:53:57 PM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

:)

If something was mission critical, it wouldn't be running on a home computer in the first place.
But since you asked, a couple of disks, software raid, Linux. and a good backup schedule. Job done.


Title: Re: Recent downtime and data loss
Post by: BADecker on January 23, 2015, 05:00:39 PM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

:)

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

:)


Title: Re: Recent downtime and data loss
Post by: Buffer Overflow on January 23, 2015, 05:03:00 PM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

:)

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

:)

Backup software. Plenty about.


Title: Re: Recent downtime and data loss
Post by: BADecker on January 23, 2015, 05:11:37 PM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

:)

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

:)

Backup software. Plenty about.

Thank you.    ;)


Title: Re: Recent downtime and data loss
Post by: egghead123 on January 23, 2015, 06:10:19 PM
Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.



Thanks for your time and efforts.Get an xbc wallet address I will send you a big donation


Title: Re: Recent downtime and data loss
Post by: johnyj on January 23, 2015, 09:38:40 PM
This totally defeated the purpose of running RAID 10, such high fail rate is already higher than conventional HDD, and I don't think RAID 0 is needed for SSD, they are already enough fast. Use two RAID 1 to backup each other is a better solution, but anyway this is very strange, RAID 1 should give enough warning before a total failure


Title: Re: Recent downtime and data loss
Post by: Bizmark13 on January 24, 2015, 01:17:15 AM
The forum went down right after I hit the preview button and just as I realized that my post was missing a [/QUOTE] tag. For a second or two, I got BBcode confused with HTML and thought there was a possibility that I broke the forum.

Anyway, to those who aren't sure if they have posts that are deleted or not (particularly to those who post a lot and might not remember how many posts they made prior to the data loss), go through your browser history and you'll get an idea of which posts you need to re-post.


Title: Re: Recent downtime and data loss
Post by: jacktheking on January 24, 2015, 01:31:52 AM
Looks like in lost four posts. My signature campaign require me to have 165 posts. I remember I reached it. When the forums came back online, I have 161 posts.. anyway, good to hear the forums is back online.


Title: Re: Recent downtime and data loss
Post by: Rishblitz on January 24, 2015, 02:41:32 AM
That sucks but at least its working again.


Title: Re: Recent downtime and data loss
Post by: Stratobitz on January 24, 2015, 08:08:17 AM
Nice to see things back up and running. I run an 8 disk Raid 0 SSD Array as well. Certainly safer than spinning drives and knock on wood no failures yet. But backups are a must.

Thanks for your hard work getting it back and running so quickly.

Cheers!

Strato


Title: Re: Recent downtime and data loss
Post by: Wendigo on January 24, 2015, 08:26:13 AM
Nice to see things back up and running. I run an 8 disk Raid 0 SSD Array as well. Certainly safer than spinning drives and knock on wood no failures yet. But backups are a must.

Thanks for your hard work getting it back and running so quickly.

Cheers!

Strato

Doesn't raid 0 provide no recovery if 1 of the ssd's fails? At least you can salvage some data from a HDD.


Title: Re: Recent downtime and data loss
Post by: dogie on January 24, 2015, 08:50:01 AM
Nice to see things back up and running. I run an 8 disk Raid 0 SSD Array as well. Certainly safer than spinning drives and knock on wood no failures yet. But backups are a must.

Thanks for your hard work getting it back and running so quickly.

Cheers!

Strato

Doesn't raid 0 provide no recovery if 1 of the ssd's fails? At least you can salvage some data from a HDD.

The idea is never have to do that, so at some point you don't care. As long as you have an image of the array and a few hours of changes being lost are acceptable then you don't have to go salvaging. Or, raid 1'ing that raid 0 array.


Title: Re: Recent downtime and data loss
Post by: Stratobitz on January 24, 2015, 08:58:53 AM
Nice to see things back up and running. I run an 8 disk Raid 0 SSD Array as well. Certainly safer than spinning drives and knock on wood no failures yet. But backups are a must.

Thanks for your hard work getting it back and running so quickly.

Cheers!

Strato

Doesn't raid 0 provide no recovery if 1 of the ssd's fails? At least you can salvage some data from a HDD.

Yes that is correct. But I do work in video production, and I work with a lot of 4k, 5k and 6k lossless media. The 8 disk SSD array allows me to read/write 5.7GB/s (gigabytes).

So my setup for working:  ( 8 ) 960 Crucial M500 SSDs - and these mirror to a Raid 5 Array overnight (7200rpm Enterprise Drives).

Dogie is correct, you can't rely on this setup for safely storing important data. But my project files are kept on a smaller drive which is safe raided (Raid 1 SSD) and only the raw media and transcodes (of which I have LTOs and HDD copies of) are stored on this raid while the project is being worked on.

The system is also on a dual 10Gbe NIC card - so I can push or pull media files from our server raid at roughly 2 GB  per second.

This may sound extreme; but many of my projects can run 3,4,5+ TB in size. So if I need to load up or load off a large project, this setup makes an enormous difference.

Strato


Title: Re: Recent downtime and data loss
Post by: Muhammed Zakir on January 24, 2015, 10:27:34 AM
Can Nginx be used? Wouldn't it increase the performance? Any drawbacks when using it?

   ~~MZ~~


Title: Re: Recent downtime and data loss
Post by: Mitchell on January 24, 2015, 11:32:06 AM
Can Nginx be used? Wouldn't it increase the performance? Any drawbacks when using it?

   ~~MZ~~
Nginx is currently used for the forum. I noticed that yesterday when I got a 502 error.


Title: Re: Recent downtime and data loss
Post by: blablaace on January 25, 2015, 12:15:47 AM
What happened again?


Title: Re: Recent downtime and data loss
Post by: Superhitech on January 25, 2015, 12:20:13 AM
What happened again?

I want to know too, bitcointalk gave me an error saying "SMF cannot connect, try again later"

The forum is just recently back up.


Title: Re: Recent downtime and data loss
Post by: mitzie on January 25, 2015, 12:42:01 AM
What happened again?

I want to know too, bitcointalk gave me an error saying "SMF cannot connect, try again later"

The forum is just recently back up.

Theymos warned that the forum will be periodically down as everything is reconfigured.

So no worries ;)


Title: Re: Recent downtime and data loss
Post by: Ingramtg on January 25, 2015, 02:59:53 AM
What happened again?

I want to know too, bitcointalk gave me an error saying "SMF cannot connect, try again later"

The forum is just recently back up.
yes i have the same question a few hours ago . can not open

 the forum . but now seems works fine . maybe just a maintenance


Title: Re: Recent downtime and data loss
Post by: Superhitech on January 25, 2015, 03:10:49 AM
What happened again?

I want to know too, bitcointalk gave me an error saying "SMF cannot connect, try again later"

The forum is just recently back up.

Theymos warned that the forum will be periodically down as everything is reconfigured.

So no worries ;)

Ahhh, I see, thanks.

Are there more scheduled maintenance times, or is this the last one?


Title: Re: Recent downtime and data loss
Post by: Muhammed Zakir on January 25, 2015, 03:52:52 AM
Theymos warned that the forum will be periodically down as everything is reconfigured.

So no worries ;)

Ahhh, I see, thanks.

Are there more scheduled maintenance times, or is this the last one?

It's not the last, it may last for a few days.


 =snip=

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

   ~~MZ~~


Title: Re: Recent downtime and data loss
Post by: Superhitech on January 25, 2015, 04:51:00 AM
Theymos warned that the forum will be periodically down as everything is reconfigured.

So no worries ;)

Ahhh, I see, thanks.

Are there more scheduled maintenance times, or is this the last one?

It's not the last, it may last for a few days.


 =snip=

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

   ~~MZ~~

Thanks, are there exact dates for it? I jump every time the forum is down, makes me think something is wrong with my computer lol.


Title: Re: Recent downtime and data loss
Post by: irfan_pak10 on January 25, 2015, 04:59:47 AM
Recent error that i received
"Connection Problems

Sorry, SMF was unable to connect to the database. This may be caused by the server being busy. Please try again later."


Title: Re: Recent downtime and data loss
Post by: redsn0w on January 25, 2015, 07:36:15 AM
I hope all the issues has been resolved, however any ETA for the new "forum" ? It remains ~ 1 week to  February.


Title: Re: Recent downtime and data loss
Post by: dserrano5 on January 25, 2015, 08:42:13 AM
Thanks, are there exact dates for it?

Yeah it will work 100% beginning Jan 27th. Ah, how do I know that since the OP says nothing, you ask? Well I just pulled it out of my ass. Now GO READ THE OP!


Title: Re: Recent downtime and data loss
Post by: (Lithium) on January 25, 2015, 08:43:40 AM
Thanks, are there exact dates for it?

Yeah it will work 100% beginning Jan 27th. Ah, how do I know that since the OP says nothing, you ask? Well I just pulled it out of my ass. Now GO READ THE OP!

two days from now?

Any confirmation that the launch won't be delayed or something?


Title: Re: Recent downtime and data loss
Post by: Grand_Voyageur on January 25, 2015, 08:54:21 AM
Recent error that i received
"Connection Problems

Sorry, SMF was unable to connect to the database. This may be caused by the server being busy. Please try again later."

Everybody got it. It's nothing special since theymos already warned us this could have occurred as to allow the recovery from the last forum's DB incident (https://bitcointalk.org/index.php?topic=932315.msg10232624#msg10232624). By the while you could have avoided posting if you had read the previous post here.

What happened again?

I want to know too, bitcointalk gave me an error saying "SMF cannot connect, try again later"

The forum is just recently back up.

Theymos warned that the forum will be periodically down as everything is reconfigured.

So no worries ;)
Theymos warned that the forum will be periodically down as everything is reconfigured.

So no worries ;)

Ahhh, I see, thanks.

Are there more scheduled maintenance times, or is this the last one?

It's not the last, it may last for a few days.


 =snip=

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

   ~~MZ~~


Title: Re: Recent downtime and data loss
Post by: anonimus on January 25, 2015, 03:18:53 PM
the forum was down around 20 hours before too

Giving Connection Problem and not able to access the forum


Title: Re: Recent downtime and data loss
Post by: Mt. Gox on January 25, 2015, 09:54:13 PM
the forum was down around 20 hours before too

Giving Connection Problem and not able to access the forum

Same here. It was down about a day ago. Theymos is currently making some changes to the forum and fixing things up so this sort of downtime is to be expected though:

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.


Title: Re: Recent downtime and data loss
Post by: stellar69 on January 25, 2015, 09:56:11 PM
the forum was down around 20 hours before too

Giving Connection Problem and not able to access the forum
Theymos already said that there will be some intentional downtimes to correct a few things.
Dont worry.


Title: Re: Recent downtime and data loss
Post by: Exther2 on January 26, 2015, 01:15:24 PM
So the search function is still offline? can it be repaired soon (and as it's something of real importance it shouldn't take as long as avatars repair took "∞")


Title: Re: Recent downtime and data loss
Post by: Welsh on January 26, 2015, 01:17:27 PM
So the search function is still offline? can it be repaired soon (and as it's something of real importance it shouldn't take as long as avatars repair took "∞")
Just use Google to search for now. Google is much better than the search function on here. Although, the search function can be useful at times.

use:
Code:
site:bitcointalk.org <search>  


Title: Re: Recent downtime and data loss
Post by: Muhammed Zakir on January 26, 2015, 01:50:27 PM
So the search function is still offline? can it be repaired soon (and as it's something of real importance it shouldn't take as long as avatars repair took "∞")
Just use Google to search for now. Google is much better than the search function on here. Although, the search function can be useful at times.

use:
Code:
site:bitcointalk.org <search>  

If you use a common keyword when searching, you may get many results. So filter the results by going to advanced tools and set dates.

   ~~MZ~~


Title: Re: Recent downtime and data loss
Post by: Welsh on January 26, 2015, 01:58:08 PM
If you use a common keyword when searching, you may get many results. So filter the results by going to advanced tools and set dates.

   ~~MZ~~
Yes, you can use others as well, some examples would be:
Code: (Example)
 inurl:
 intitle:



Title: Re: Recent downtime and data loss
Post by: unamis76 on January 26, 2015, 02:47:28 PM
Any other downtimes planned? Pretty boring getting here and seeing this offline all of a sudden :D


Title: Re: Recent downtime and data loss
Post by: theymos on February 01, 2015, 11:46:43 AM
Searching is enabled again now. I also made several improvements to search. It should be substantially faster now, and maybe also more accurate. (SMF was extremely buggy in this area -- it's surprising that search was even usable before.)


Title: Re: Recent downtime and data loss
Post by: unamis76 on February 01, 2015, 12:00:35 PM
Searching is enabled again now. I also made several improvements to search. It should be substantially faster now, and maybe also more accurate. (SMF was extremely buggy in this area -- it's surprising that search was even usable before.)

Thank you, the search is very handy :) Keep up the good work!


Title: Re: Recent downtime and data loss
Post by: Muhammed Zakir on February 01, 2015, 12:20:46 PM
Searching is enabled again now. I also made several improvements to search. It should be substantially faster now, and maybe also more accurate. (SMF was extremely buggy in this area -- it's surprising that search was even usable before.)

Thanks for implementing it again! :)

   ~~MZ~~


Title: Re: Recent downtime and data loss
Post by: Madness on February 01, 2015, 06:56:11 PM
Searching is enabled again now. I also made several improvements to search. It should be substantially faster now, and maybe also more accurate. (SMF was extremely buggy in this area -- it's surprising that search was even usable before.)

Thanks for adding the search function back mate , that will reduce topics of the same thing little bit .
Btw , I'am pretty sure that the forum was down for several hours and back online just minuts ago , what was the reason of that ? was having a Connection Problem error when I try to come to the forums

~ Madness


Title: Re: Recent downtime and data loss
Post by: jbrnt on February 01, 2015, 07:02:02 PM
Are these random downtime time for backups or there are still problems with the database?


Title: Re: Recent downtime and data loss
Post by: blablaace on February 01, 2015, 07:04:55 PM
I think that this was the longest downtime since this started..


Title: Re: Recent downtime and data loss
Post by: haploid23 on February 01, 2015, 07:06:51 PM
These downtimes are for upgrades, not data loss and recovery, right?


Title: Re: Recent downtime and data loss
Post by: CptTripps on February 01, 2015, 07:13:27 PM
I have a forum that has been running on SMF since before it was called SMF. (PutterTalk.com) I can certainly appreciate that error page!  Great job getting search re-enabled.


Title: Re: Recent downtime and data loss
Post by: unamis76 on February 01, 2015, 07:18:29 PM
Another odd, non-planned outage... Seems that everything is now back and working though, search included :)


Title: Re: Recent downtime and data loss
Post by: dserrano5 on February 01, 2015, 07:41:50 PM
Another odd, non-planned outage... Seems that everything is now back and working though, search included :)

It was planned—read the OP.


Title: Re: Recent downtime and data loss
Post by: Superhitech on February 01, 2015, 07:45:19 PM
Yay, search is back. I was tired of using KEYWORD site:bitcointalk.org on google.

New forum software coming soon? It's February already, hope to see it soon. :)


Title: Re: Recent downtime and data loss
Post by: jsmit332 on February 01, 2015, 07:46:58 PM
Yay, search is back. I was tired of using KEYWORD site:bitcointalk.org on google.

New forum software coming soon? It's February already, hope to see it soon. :)

I would love to see new forum software. But Theymos is probably busy.


Title: Re: Recent downtime and data loss
Post by: theymos on February 01, 2015, 07:58:20 PM
Most of the recent downtime is caused by MariaDB very rarely hanging on a random query and preventing all other queries from completing. If this sort of thing happens when everyone is sleeping/away (especially on weekends), then there's downtime. Someone on the MariaDB IRC says that it might be a bug in the version of MariaDB that I upgraded to as part of the server change.


Title: Re: Recent downtime and data loss
Post by: Bizmark13 on February 02, 2015, 04:22:19 AM
Forum was down for several hours, went back up, and then went down again for a couple hours earlier today.

I think that this was the longest downtime since this started..

I'm not 100% sure but I think the first one after the initial downtime was a bit longer. Then again, I slept through most of it so maybe my timing is a bit off.

Yay, search is back. I was tired of using KEYWORD site:bitcointalk.org on google.

New forum software coming soon? It's February already, hope to see it soon. :)

I'm the opposite of you. I've always preferred to use "site:bitcointalk.org" on Google for searching the forums.