Over the years, I've seen too many posts on the MySQL mailing list from eager users of free software on cheap hardware who want 24x7x365 availability for their databases. Inevitably, the question gets replies from a few folks who respond with something like:
Here's what I did... I setup replication and wrote some Perl scripts to notice when there's a problem. They'll switch everything to the slave. The code is ugly but it works for me. :-)
I always cringe when reading those responses. I shouldn't really complain, though. I've been guilty of providing terse replies once in a while. But usually I just ignore 'em because I don't have the time or experience to really do the question justice.
Today I read a response from Michael Conlen that finally comes close to explaining why you're probably asking for something you don't need, why it's not cheap, and what you really need to be thinking about.
Since it was posted to a public list, I don't mind quoting it here (with a few spelling fixes).
First get an acceptable outage rate. Your only going to get so many nines, and your budget depends on how many. The system will fail at some point, no matter what, even if it's only for a few seconds. That's reality. Figure out what kinds of failures you can tolerate based on how many 9's you get and what kinds you have to design around. From there you can figure out a budget. 99.999% uptime is 5 minutes and 15 seconds per year of total downtime. 99.99% is 52.56 minutes and so on. At some point something will happen, and I've never seen anyone offer more than 5 9's, and IBM charges a lot for that. Then, figure out everything that could cause an outage, figure out how to work around them and give them a budget. Watch how many 9's come off that requirement.
If you have to use MySQL I'd ditch PC hardware and go with some nice Sun kit if you haven't already, or maybe a IBM mainframe. Sun's Ex8xx line should let you do just about anything without taking it down (like change the memory while it's running). Then I'd get a bunch of them. Then I'd recode the application to handle the multiple writes to multiple servers and keep everything atomic, then test the hell out of it. There's a lot of issues to consider in there, and you probably want someone with a graduate degree in computer science to look over the design for you. (anything this critical and I get someone smarter than me to double check my designs and implementations). It may be best to just build it in to the driver so the apps are consistent.
On the other hand, if you have all this money, look at some of the commercial solutions. This is probably heresy on this list, but hey, it's about the best solution for the needs right? Sybase or DB2 would be my first choices depending on the hardware platform (Sun or Mainframe). The systems are setup to handle failover of the master server. I know for Sun you want to be looking at Sun Clustering technology, a nice SAN and a couple of nice servers. You write to one server, but when it fails the backup server starts accepting the write operations as if it were the master. There's a general rule with software engineering that says "if you can buy 80% of what you want, your better off doing that than trying to engineer 100%"
Think about the networking. two data paths everywhere there's one. Two switches, two NIC cards for each interface, each going to a different switch.
Depending on where your "clients" are you need to look at your datacenter. Is your database server feeding data to clients outside your building? If so you probably want a few servers in a few different datacenters. At least something like one on the east coast and one on the west coast in the US, or the equivalent in your country, both of whom have different uplinks to the Internet. Get portable IP addresses and do your own BGP. That way if a WAN link fails the IP addresses will show up on the other WAN link even though it's from a different provider.
This is just a quick run down of immediate issues in a 24x7x365, it's not exhaustive. Think about every cable, every cord, every component, from a processor to a memory chip and think about what happens when you pull it out or unplug it, then make it redundant.
Like the title of this entry says, High Availability is NOT Cheap.
Now, I know what you're thinking. These folks who are asking for 24x7x365 don't really need what they're asking for. A response like this is not helpful.
Re-read the first three sentences of the reply again.
Posted by jzawodn at June 21, 2003 09:12 AM