High Availability is NOT Cheap (by Jeremy Zawodny)

Over the years, I've seen too many posts on the MySQL mailing list from eager users of free software on cheap hardware who want 24x7x365 availability for their databases. Inevitably, the question gets replies from a few folks who respond with something like:

Here's what I did... I setup replication and wrote some Perl scripts to notice when there's a problem. They'll switch everything to the slave. The code is ugly but it works for me. :-)

I always cringe when reading those responses. I shouldn't really complain, though. I've been guilty of providing terse replies once in a while. But usually I just ignore 'em because I don't have the time or experience to really do the question justice.

Today I read a response from Michael Conlen that finally comes close to explaining why you're probably asking for something you don't need, why it's not cheap, and what you really need to be thinking about.

Since it was posted to a public list, I don't mind quoting it here (with a few spelling fixes).

First get an acceptable outage rate. Your only going to get so many nines, and your budget depends on how many. The system will fail at some point, no matter what, even if it's only for a few seconds. That's reality. Figure out what kinds of failures you can tolerate based on how many 9's you get and what kinds you have to design around. From there you can figure out a budget. 99.999% uptime is 5 minutes and 15 seconds per year of total downtime. 99.99% is 52.56 minutes and so on. At some point something will happen, and I've never seen anyone offer more than 5 9's, and IBM charges a lot for that. Then, figure out everything that could cause an outage, figure out how to work around them and give them a budget. Watch how many 9's come off that requirement.

If you have to use MySQL I'd ditch PC hardware and go with some nice Sun kit if you haven't already, or maybe a IBM mainframe. Sun's Ex8xx line should let you do just about anything without taking it down (like change the memory while it's running). Then I'd get a bunch of them. Then I'd recode the application to handle the multiple writes to multiple servers and keep everything atomic, then test the hell out of it. There's a lot of issues to consider in there, and you probably want someone with a graduate degree in computer science to look over the design for you. (anything this critical and I get someone smarter than me to double check my designs and implementations). It may be best to just build it in to the driver so the apps are consistent.

On the other hand, if you have all this money, look at some of the commercial solutions. This is probably heresy on this list, but hey, it's about the best solution for the needs right? Sybase or DB2 would be my first choices depending on the hardware platform (Sun or Mainframe). The systems are setup to handle failover of the master server. I know for Sun you want to be looking at Sun Clustering technology, a nice SAN and a couple of nice servers. You write to one server, but when it fails the backup server starts accepting the write operations as if it were the master. There's a general rule with software engineering that says "if you can buy 80% of what you want, your better off doing that than trying to engineer 100%"

Think about the networking. two data paths everywhere there's one. Two switches, two NIC cards for each interface, each going to a different switch.

Depending on where your "clients" are you need to look at your datacenter. Is your database server feeding data to clients outside your building? If so you probably want a few servers in a few different datacenters. At least something like one on the east coast and one on the west coast in the US, or the equivalent in your country, both of whom have different uplinks to the Internet. Get portable IP addresses and do your own BGP. That way if a WAN link fails the IP addresses will show up on the other WAN link even though it's from a different provider.

This is just a quick run down of immediate issues in a 24x7x365, it's not exhaustive. Think about every cable, every cord, every component, from a processor to a memory chip and think about what happens when you pull it out or unplug it, then make it redundant.

Well said.

Like the title of this entry says, High Availability is NOT Cheap.

Now, I know what you're thinking. These folks who are asking for 24x7x365 don't really need what they're asking for. A response like this is not helpful.

Re-read the first three sentences of the reply again.

Posted by jzawodn at June 21, 2003 09:12 AM | edit

Reader Comments

# Ben Meadowcroft said:

Reminds me of an experience I heard of a while back, a consultant was brought in by a CEO into evaluate the 24x7x365 availablility the CIO was saying could be done. (power struggles eh!) Anyway the story goes that during the morning meeting everything was going ok, the CIO was happily stating that the mission critical system was robust enough to handle anything, the consultant asked for a 10 minute break, during which he headed to his car... and came back up with a chain saw.

Curious looks were passed around the room, when the meeting reconvened the consultant looked at the CIO and asked where the server room was, after receiving the answer he asked how the system would handle having one of its servers chainsawed in half? At this the CIO turned furiously turned to the CEO and stated that this was a ridiculous test and waste of a server, at this the consultant said that this would be a valuable "trophy" test and if the CIO was really confident that the system would keep running then the cost of the server would be easily covered, turning to the CEO the consultant asked if this would be ok, the CEO smiled and said sure go ahead.

With a defeated look on his face the CIO confessed that he wasn't sure that the system would survive... Moral of the story 24x7 x365 is really hard to do! Would your system stand up to the test if someone took a chainsaw to one of the servers?

on June 21, 2003 11:24 AM

# dws said:

"High Availability" is one of those things that is often approached from the "don't know what you don't know" side. Someone has been lucky enough to keep a database server running for some length of time, perhaps with simple replication, and they truly believe that it's not that difficult to be highly available. These are the people to watch out for. The folks who ask questions are at least teachable.

on June 21, 2003 11:36 AM

# justin said:

I disagree with your comment about "Someone has been lucky enough to keep a database server running for some length of time, perhaps with simple replication". Having gone through this struggle with my previous company I know that it is not that hard to implement if you have dump trucks full of money.

- BGP fiber
- Dual firewalls
- Dual load balancers
- And the required networking gear
- Cluster of available machines to serve requests, taking into account that some will fail.

You should have a fairly stable environment, which should withstand the chainsaw test. This solution was not overly difficult to implement just costs some real money.

on June 22, 2003 10:35 AM

# eLGie said:

So what's the big deal anyhow with high availablility. Smells and sounds all to very commercial to me. I host my own and do a good job of it. My users understand that my server is on when I want it and off when I want it. I sometimes have dreadfully long page loads too! This still doesn't stop people from using my sites or otherwise coming back. All this high availability crap is fud from hosting companies trying to stop or monopolize average people from hosting their own.

on June 22, 2003 07:20 PM

# justin said:

Actually, you are incorrect. Hosting companies are not the only people whom require 24/7 operations. The company that I used to work for hosted an e-commerce site internally (had dedicated fiber to our building), where people would download software (a very popular digital imaging tool) and pay via credit-card. We needed a solution where we could accept orders and allow users to securely download the software.

We wanted to accept orders without interruption. For example every minute (60 seconds) users/customers where unable to pay for and download the software, we lost $5000 US. So it was in the best interests to spend the bucks and make fairly sure we had all bases covered.

If you are just running some mom and pop shop hosting company than I guess it is acceptable to have a service that is down sometimes. But I would definitely not have you hosting anything that I was expecting to have a return on.

on June 22, 2003 08:25 PM

# BDKR said:

I'd like to know how you can make a Cluster owner proof. I came in this morning to find that both databases had went south, or so it seemed. It turned out to be something much simpler. A loose cable on the primary! When the bozo owner decided that the primary was down, he tried connecting to the back up, but failed to remember that the primary db allways uses a virtual IP address. When he couldn't get things to work, he started rebooting the box over and over again in frustration! Note: He is a windows user and has no idea how to shut down a linux box other than just hitting the power button.

It was a great Monday morning!

As for the part about ditching PC equipment, isn't that what Google is using? Perhaps I mis-understood something there.

on June 23, 2003 09:22 AM

# justin said:

There are many different types of different load balancing and clustering techniques. You seem to be referring to two different technologies.

Basically you have a "cluster" of computers that are going to perform a specific task, you can think of them as resources. Think of a cluster as many systems configured in the exact same manner, thus increasing the overall capacity. For example say you have a cluster of computers named computer01 to computer10 (10 in total).

Then you have software or a hardware solution that allocates recourse (load balancers). Load balancing comes in many forms from simple round robin DNS to expensive hardware switches. On a hardware load balancer you would input the ten computer names (computer01 to computer10), then specify the server type (http, ftp, smtp, etc.) and the amount of the load each system should take. So you can evenly distribute the load among many different systems with varying abilities.

Google uses a proprietary method for managing high-availability through software and then has hardware load balancing devices to manager the clusters of computers.

on June 23, 2003 11:30 AM

# adamsj said:

When I think "High Availability" and "DBMS", I always think Teradata (in the non-OLTP world)--I've seen systems in the multi-Terabyte range stay up for months.

But you know what else? It's expensive as hell, runs on proprietary hardware, and costs and arm and a leg for support. Look at the TPC results and see just how expensive all that availability is.

(Disclaimer: I don't work for Teradata, but I do work extensively with their stuff. I like it.)

on June 23, 2003 07:46 PM

# Kiran said:

Wow, Nice Stuff to explain to a guy who dreams of shooting a fish in dark water at a dark night with a cooling glass on his eyes.....

Really, I felt like I am explaining this to my manager....

My manager always says, he needs 365 * 7 *24 and always I oppose.

on June 13, 2007 01:44 PM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.