Over the years, I've configured and happily used Linux Software RAID on numerous servers. It has proven to be amazingly resilient and quite stable.
But a couple years ago (2003) when I was building my newest server (which conveniently lives in a collocation facility about 4 miles from where I do), I opted to drop in a 3ware card. They had a good reputation in the Linux world and I figured I might as well move up in the world.
Well guess what died recently?
Right. That server suddenly became unresponsive about two weeks ago.
Due to some access complications, I wasn't able to visit it until this evening. I un-racked the machine, opened it up, and inspected things. All the cables were still plugged in and the card was firmly seated. Hmm.
When I rolled the crash cart over to put a keyboard and monitor on it, I found that the RAID array was simply gone. No trace. I poked around in the 3ware BIOS a bit and couldn't figure out what was going on.
I brought the machine home and decided to chuck the card. It'd failed in its single mission: keeping a redundant copy of my data on both disks. I plugged the two disks directly into the motherboard and stuck in my little Debian installation USB stick (just made it tonight). It's easier than finding a CD-ROM drive I can plug in.
Part way through the configuration process, I noticed the primary drive acting very bursty. Then I heard the clicking noises. We all know what it means when a hard disk start to click, right?
Now it was all making sense. One of the two drives flaked out and that caused the RAID controller to shit itself and blow away the array.
Nice.
Let's just say that I'll be going back to Software RAID from now on. The machine is rebuilt (minus the bad disk) and I'll put it back in the rack tomorrow morning.
Thanks to rsnapshot, I never lost any data. I had current off-site backups. In two locations. Doesn't everyone?
Let's just say I've been burned a few times in the past.
Anyway, soon I can finally migrate the data for this site and several others off my old (going on 6 years old) server in Ohio (happily running Software RAID).
In retrospect, I was adding complexity and a new point of failure to a system that had always worked fine in the past. I've learned my lesson.
Posted by jzawodn at March 08, 2007 10:01 PM
That's odd, I run 15 servers with various 3ware cards, with 4 to 16 drives, and have never had a problem with a drive killing an array or a server.
Clicking == Maxtor. Do yourself a favor, go Seagate.
It is odd alright.
I'm sticking with what works for me until it stops working for me. :-)
Maybe i'm just saying what is obvious, but my problem with a 3ware-Hardware-raid is that i did not realise that one disk had failed for a while. The only chance to find it is when you watch closely during reboot.
Who does that on remote servers?
I was lucky to find out by accident before the second drive died and now i have a cron-job that checks raid status every day...
*Shudder* this is why I hate the hardware side of this business. It's so much easier when you have the budget to leave this to others. I guess at least with software raid it is so much easier to update to a newer version ...
I'm not sure I understood you correctly. One of the disks failed, and the controller disabled the whole array? Or did the first disk fail without being noticed by you until the second disk failed, too?
In the letter case I wouldn't blame the controller, but the (missing) monitoring of the RAID health. 3Ware provides excellent tools to monitor the state of the array which should let you know when you need to replace a disk.
I never had problems with the 3Ware controllers, but of course there is always the chance of the RAID controller failing. Software RAID does work and is reliable but in my experience rebuilding an array slows down the server performance to a crawl, making it nearly unusable (which is what a RAID should prevent).
I'm really bummed - not least because I recommended the cards to you.
However, as of late I've been severely disappointed with 3Ware and AMCC (their new parent). The newest version of the 3dm management utilities are nearly impossible to monitor properly, and still have severe issues with 2.6 kernels.
That said, I've run well over a hundred servers with 3ware cards (everything from the 7000 series newer), and never seen this happen. One bad disk I could understand, it happens, but I've never seen something like this.
Currently I have two boxes colo'd together at 365Main, doing cross-backups over a gbit crossover cable. But I need to build a decent fileserver for offsite backups.
*sigh* data redundancy is such a bitch. I wish they'd just make disks that didn't fucking die.
I'm starting to become less a fan of Raid sollutions, and more of just copying the data often. Less complex, so less that can go wrong. I think the google paper on harddrives somehow supported this view, but I forgot why...
We have 3ware cards in over 30 servers. For the most part they work and work very well. Could have been a fluke. I have heard the older cards are not as reliable.
It's not as rare as you'd think.. we had a HW based 6 disk raid 1 array with 'a certain managed hosting provider' ;).. it went offline and after a lot of talking to them, it transpired the raid controller lost it when a drive failed, and managed to scramble all the disks.
Jeremy, I feel your pain. Enjoy those backups.. we didn't have those.
[In my defense, this happened on my first 6 hours of my first day and I hadn't even been given account credentials yet]
Absolutely have multiple backups. Personally, I always endeavor to have three copies, one on the server's second hard drive, one at a third-party location, and one in my office. I've also learned the hard way that even backup copies can be corrupt.
I had one hardware RAID controller that was periodically corrupting data on the array. The vendor claimed the controller card was good, because the little "failure" LED they put on the card was not lit. I tried to explain to them that the failure detection could have failed too (sometime in the past), but they couldn't understand that their "failure detection" was just another point of failure.
Software RAID, forever. Software RAID also lets you RAID hardware that you can't get a hardware controller for, like external Firewire drives. Hot-plug RAID5, for cheap.
Don't be so sure that all is well with the other drive. At work we've deployed hundreds, maybe thousands, of 3ware controllers in 2-drive RAID1 configurations. With that setup the catastrophic failure scenario is always that one drive errored out and got removed from the array, followed by the other, which then causes the controller to "shit itself."
The most important thing when using hardware RAID is to make sure that you've installed the vendor's RAID monitoring software, configured it to alert you when something fails, and verified that it actually works.
And with desktop-class drives it's a good idea to choose a model with TLER so that a bad sector doesn't turn into a dropped drive.
Did you get this hooked into s3? If so, care to share the how-to?
very informative. jeremy, keep writing this kinda stuff as we people need them.
I currently have a "situation" - nay, two - with 3Ware Raid 5 setups.
One array is currently on-line and working, but in a state such that I can't replace the disk that has failed. Or rather, I can, but the 3Ware controller refuses to rebuild because of an error with another disk in the array (I think, SMART pending reallocate nonzero). THe array is "working" but linux throws up an error when one particular file is accessed.
So I'm going to have to back up half a terabyte, rebuild the array, and copy it back. (I think).
Then there's the problem that this array, and another one, are constructed of Maxtor 250Gb disks from what I hope was a faulty batch. They're just out of warranty, and about 50% are showing unhealthy SMART stats (which I can see with smartctl, but the 3Ware software doesn't seem to notice). And guess what? The Maxtor 250Gb block count is slightly bogger than the other makes of 250Gb disk, so I'll have to replace with 300Gb Seagates and waste 1/6 (and pray that the rebuild doesn't kill the array).
I just hope that since Seagate bought Maxtor, it's Seagate quality that triumphs. (As for click-fail as mentioned above, I always thought that was IBM Deskstars? Trouble is, no big disk you duy has been longterm tested. By the time it's been in service for a couple of years it's obsolete, and usually you can't buy another one of the same model even if you wanted to).
Software RAID keeps looking better, as does the thought of forgetting about RAID-5 and just mirroring disk pairs. If only motherboards came with 16 SATA ports or someone made a cheap non-RAID lots-of-SATA-ports card for linux users.
"are constructed of Maxtor 250Gb disks from what I hope was a faulty batch."
Based on my experience every batch is a faulty batch.
More and more I am convinced that they are secretly owned by Lucas UK. Both have had over 20 years to improve their products, both still live up to their bad reputations.
Another reason for RAID pains: disks that implement S.M.A.R.T. so badly that it's worse than useless (just lulls you into a false sense of security).
I've just been investigating a disk that wasn't in a RAID array (thank heaven), just a humble Windows box. It was taking ages to boot, the disk making scritch-scratch error-recovery noises. It's a Hitachi Deskstar 7K250. I thought IBM fixed that scritch-scratch problem ages ago before Hitachi bought the business?!
It wasn't utterly dead, I had to data-erase it anyway, so I fired up Knoppix and investigated with smartctl. Um. No pending reallocates, no reallocated blocks, no sign of any trouble at all. Little wonder no red screen of dead disk from the BIOS. If I hadn't heard that tell-tale noise ....
Then tried a read-test, dd from the disk /dev/sda to /dev/null. Unsurprisdingly, scritch-scratch and errors.
Then dd from /dev/zero to /dev/sda. No errors. No funny noises. Presumably it now reallocated the bad blocks, but smartctl still shows zero reallocated sectors and events!
Finally nother dd from the disk to /dev/zero. Now no noises, no errors. The disk appears perfect. I could sell it as perfect to some sucker if I was that way minded.
I can't help thinking that manufacturers are deliberately hiding nasty things under the carpet to keep down the number of warranty returns they have to process. If a clueless user reinstalled Windows on this disk, he'd also have caused the bad blocks to be reallocated, everything would work right, and he'd probably curse malware, a virus, or Microsoft.
I feel ill. This may explain Google's recent paper where half their failed disks didn't give any advance warning through SMART. Anyone know if Seagate or WD "Enterprise" SATA are any better?
Thanks for these stories. Useful to read, going for software raid now and check the status reports on the disk. The S.M.A.R.T. problem is worrying though :s.
I have good experiences with Seagate and Hitachi Sata drives. Bad experiences with Maxtor (IDE and Sata). Stopped implementing them some years ago.
I've seen something similar but it can happen to any raid array. One disk starts to go bad but isn't completely bad. Typically drives fail completely and are unusable but if they fail slowly what happens is the bad data gets replicated across the array which pretty much hoses everything.
I'm not a huge fan of commodity raid cards for just this very reason. When I worked for IBM I saw a LOT of RAID failures and most of the time they were firmware related. It's a little known fact that drives and the controller have firmware that needs to be kept up to date. Its also important that the drives and the controller know what to expect from one another (think timeouts, soft errors and bus resets here) and quite honestly things like this are not included in the stock firmware on one of those commodity cards. Its hard enough for a big company that know's what drives are "supported" to keep the bugs out of their controllers imagine being much smaller and supporting every IDE drive made.
I'm not a big fan of software raid on Linux nor cheapie white box hardware. Linux file systems are generally fragile enough without throwing software raid and questionable hardware into the mix. Not to mention the performance hit under heavy I/O not having a dedicated hardware controller handling all the striping and parity operations.
The exception to the rule on software raid is XLV under XFS. XFS is the file system straight out of SGI's IRIX OS. If you take a close look at SGI hardware you will notice that all non Fibre Channel storage is JBOD. The reason for this is the performance you can get out of an XLV on XFS. Unfortunately SGI did not put XLV support into the Linux XFS.
Your best bet for raid on Linux is either to stick with the software raid which works "ok" or to move to an enterprise grade server with an integrated raid controller and supported disks. Dell makes some remarkably cheap hardware with built in controllers. For slightly more than you have in your beige box hardware you could get a brand new Dell with a service contract and a controller matched to its disks that will happily hammer away for years. Not to mention all sorts of goodies like DRAC cards for remote access so you can tell what's going on with your server without a crash cart and drive to that distant data center.
Hello there,
I've been searching for a way to understand whether a disk failure has occurred when a RAID failure is experienced.
Is there a way to do that?
Any comment is appreciated..
Thanks,
Mithat
My 4-5yr old Linux software raid 1 went bad last week...A very humble system, but it took a week to finally somewhat isolate that one disk really was bad. Very intermittent/obscure problem.
Now that I've cooled down - realize this inexpensive system did exactly what it was supposed to do. And after reading your tale - am less inclined to attempt the "hardware improvement".
Finding a replacement disk (and spares) was a problem tho.