August 21, 2008

Dumber is Faster with Large Data Sets (and Disk Seeks)

dumb I remember reading Disk is the new Tape earlier this year and how much it resonated. That's probably because I was working for Yahoo at the time and hearing a lot about their use of Hadoop for data processing. In fact, I even did a couple videos (1 and 2) about that.

Anyway, I recently faced the reality of this myself. When I wrote about The Long Term Performance of InnoDB I'd been beating my head against a wall trying to get millions of records out of InnoDB efficiently.

It was taking days to get all the records. Yes, days!

After joking that it'd probably be faster to just dump the tables out and do the work myself in Perl, I thought about Disk is the new Tape and realized what I was doing wrong.

Allow me to offer some background and explain...

There are several tables involved in the queries I needed to run. Two of them are "core" tables and the other two are LEFT JOINed because they hold optional data for the rows I'm pulling. There are well over a hundred million records to consider and I need only about 10-15% of them.

And these records fall into roughly 500 categories. So what I'd been doing is fetching a list of categories, running a query for each category to find the rows I actually need, processing the results, and writing them to disk for further processing.

The query looked something like this:

    SELECT field1, field2, field3, ... field N
      FROM stuff_meta sm, stuff s
 LEFT JOIN stuff_attributes sa ON sm.item_id = sa.item_id
 LEFT JOIN stuff_dates      sd ON sm.item_id = sd.item_id
     WHERE sm.item_id = s.item_id
       AND sm.cat_id  = ?
       AND sm.status IN ('A', 'B', 'C')

That seemed, at least in theory, to be the obvious way to approach the problem. But the idea of waiting several days for the results led me to think a bit more about it (and to try some InnoDB tuning along the way).

While it seems very counter-intuitive, this was sticking in my head:

I’m still trying to get my head around this concept of "linear" data processing. But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.

So I gave it a try. I wrote a new version of the code that eliminated the two AND bits in the WHERE clause. Combining that with using mysql_use_result in the client API, meant it had to process a stream of many tens of millions of records, handle the status filtering and shorting records into buckets based on cat_id (and some extra bookkeeping).

As an aside, I should note that there used to be an ORDER BY on that original query, but I abandoned that early on when I saw how much work MySQL was doing to sort the records. While it made my code a bit easier, it was far more efficient to track things outside the database.

Anyway, the end result was that I was able to get all the data I needed in merely 8 hours. In other words, treating MySQL as an SQL powered tape drive yielded a 12 fold improvement in performance.

Put another way, taking the brain-dead stupid, non-SQL, mainframe-like approach got me results 12 times faster than doing it the seemingly "correct" way.

Now this isn't exactly what the whole disk vs. tape thing is about but it's pretty close. I'm aware that InnoDB works with pages (that will contain multiple records, some of which I don't need) and that's part of the problem in this scenario. But it's a really interesting data point. And it's certainly going to change my thinking about working with our data in the future.

Actually, it already has. :-)

Dumber is faster.

As I've mentioned before, craigslist is hiring. We like good Perl/MySQL hackers. And good sysadmin and network types too.

Ping me if you're interested in either role.

Posted by jzawodn at 02:21 PM

August 19, 2008

Lembert Dome Hike in Yosemite

Last weekend afforded an opportunity to explore the Lembert Dome Hike in Yosemite National Park.

Lembert Dome is the monolithic dome that dominates the eastern end of Tuolumne Meadows in Yosemite National Park. It's a justifiably popular ascent, particularly among day hikers in the area, with the summit offering magnificent views of Tuolumne Meadows to the west, the Cathedral Range to the south, and the Sierra crest to the east.

The trail starts out a bit steep but the views are definitely worth the trek up, as is a quick side trip to Dog Lake.

Here are a few pictures.

Pile of Rocks
Some rocks to mark the start of the trial...

Lembert Dome Hike in Yosemite
The clouds helped keep the heat down.

Lembert Dome Hike in Yosemite
Almost there!

Lembert Dome Hike in Yosemite
Looking back where we came from.

Me Looking Goofy
Hey, it's me!

Lembert Dome Hike in Yosemite
Look at all those trees...

Dog Lake
Dog Lake

Farm Near Evergreen Lodge
Farm near Evergreen Lodge (just outside the park)

The rest are here: Lembert Dome Hike in Yosemite on Flickr

Posted by jzawodn at 07:35 AM

August 18, 2008

Open Source Queueing and Messaging Systems?

Dear Lazyweb,

I'm interested getting an idea of what open source message queueing systems exist that are fast, stable, and have some good replication (think multi-colo) and fault tolerance built-in. The idea being, of course, that some processes want to send messages into a queue (of work to be done) and other processes will fetch those and do stuff with them.

Ideally, I'm looking for a system that allows for different message priorities--meaning that I'd like to be able to mark some messages as less important, so it's okay if we lose them in a crash. It'd also be handy to have the ability to set expiation times on messages.

Bonus points for stuff with good Perl libraries.

Put another way, if you wanted to run something like Amazon's SQS on your own infrastructure, what would you use as the building blocks?

Stuff I already know of (some of which doesn't meet my own criteria):

But surely there's more. Feel free to spew others in the comments below...

And even if you don't know of any others, I'd love to hear about your experience with any of the above or already commented systems.

Update: A lot of folks are replying with "what's wrong with XXX in your list?" I haven't tested these yet. I'm looking to see what the landscape looks like before I dive in.

Posted by jzawodn at 11:49 AM

August 12, 2008

The Long Term Performance of InnoDB

The InnoDB storage engine has done wonders for MySQL users that needed higher concurrency than MyISAM could provide for demanding web applications. And the automatic crash recovery is a real bonus too.

But InnoDB's performance (in terms of concurrency, not really raw speed) comes at a cost: disk space. The technique for achieving this, multiversion concurrency control, can chew up a lot of space. In fact, that Wikipedia article says:

The obvious drawback to this system is the cost of storing multiple versions of objects in the database. On the other hand reads are never blocked, which can be important for workloads mostly involving reading values from the database.

Indeed.

Imagine a set of database tables will tens of millions of rows and a non-trivial amount of churn (new records coming in and old ones being expired or removed all the time). You might see this in something like a large classifieds site, for example.

Furthermore imagine that you're using master-slave replication and the majority of reads hit the slaves. And some of those slaves are specifically used for longer running queries. It turns out that the combination of versioning, heavy churn, and long running queries can lead to a substantial difference in the size of a given InnoDB data file (.ibd) on disk.

Just how much of a difference are we talking about? Easily a factor of 4-5x or more. And when you're dealing with hundreds of gigabytes, that starts to add up!

It's no secret that InnoDB isn't the best choice for data warehouse looking applications. But the disk bloat, fragmentation, and ongoing degradation in performance may be an argument for having some slaves that keep the same data in MyISAM tables.

I know, I know. I can do the ALTER TABLE trick to make InnoDB shrink the table by copying all the rows to a new one, but that does take time. Using InnoDB is definitely not a use it and forget about it choice--but what database engine is, really?.

Looking at the documentation for the InnoDB plug-in, I expect to see a real reduction in I/O when using the new indexes and compression on a data set like this. (Others sure have.) But I don't yet have a sense of how stable it is.

Anyone out there in blog-land have much experience with it?

Posted by jzawodn at 01:05 PM

August 07, 2008

Fun with Network Programming, race conditions, and recv() flags

internet tubes Last week I had the opportunity to do a bit of protocol hacking and found myself stymied by what seemed like a race condition. As with most race conditions, it didn't happen often--anywhere from 1 in 300 to 1 in 5,000 runs. But it did happen and I couldn't really ignore it.

So I did what I often do when faced with code that's doing seemingly odd things: insert lots of debugging (otherwise known as "print statements"). Since I didn't know if the bug was in the client (Perl) or server (C++), I had to instrument both of them. I'd changed both of them a bit, so they were equally likely in my mind.

Well, to make a long, boring, and potentially embarrassing story sort, I soon figured out that the server was not at fault. The changes I made to the client were the real problem.

I had forgotten about how the recv() system call really works. I had code that looked something like this (in Perl):

recv($socket, $buffer, $length, 0);
...
if (length($buffer) != $length) {
    # complain here
}

The value of $length was provided by the server as part of its response. So the idea was that the client would read exactly $length bytes and then move on. If it read fewer, we'd be stuck checking again for more data. And if we did something like this:

while (my $chunk = <$socket>) {
    $buffer .= $chunk;
}

There's a good chance it could block forever and end up in a sort of deadlock, each waiting for the other to do something. The sever would be waiting for the next request and the client would be waiting for the sever to be "done."

Unfortunately for me, the default behavior of recv() is not to block. That means the code can't get stuck there--it simply does a best effort read. If you ask for 2048 bytes to be ready but only 1536 are currently available, you'll end up with 1536 bytes. And that's exactly the sort of thing that'd happen every once in a while.

The MSG_WAITALL flag turned out to be the solution. You can probably guess what it does...

This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned.

That's pretty much exactly what I wanted in this situation. I'm willing to handle the signal, disconnect, and error cases. Once I made that change, the client and server never missed a beat. All the weird debugging code and attempts to "detect and fix" the problem were promptly ripped out and the code started to look correct again.

The moral of this story is that you should never assume that the default behavior is what you want. Check those flags.

Now don't get me started about quoting and database queries...

Posted by jzawodn at 08:42 PM

August 04, 2008

I'm Thinking

thinking In Amazing Powers of Concentration, Brad Feld says something that resonated with me.

I've never really understood the phrase "I'm thinking." It's too abstract for me. I like to think I think all the time. So "I'm thinking" doesn't feel like it applies to anything. For example, when "I'm running", it's pretty clear what I'm doing. "I'm thinking" - not so much so.

That's so true. Thinking is an ongoing and difficult to see activity.

In fact, I know of some people who are so busy thinking at times that they find it difficult to sleep at night. I used to have that problem a lot. However, it's rare these days. I'm not sure why. Maybe I'm just more fond of sleep than I used to be.

I suppose that if you're into meditation, there is a time during the day when you force yourself not to think. But that's pretty rare, I suspect.

Oh, I almost forgot about television...

Posted by jzawodn at 07:02 AM

August 03, 2008

Feline Diabetes or Living with a Diabetic Cat

About a week and a half ago, I noticed that Barnes (one of our two older cats) was thinner than he used to be--so much so that I felt his bones when I gave him the sort of back scratching that he loves so much.

Both he and his brother (Noble) are about 10 years old and have nearly always been on the heavy site. And, of course, don't get to a vet regularly because they utterly detest cat trips.

Barnes and Noble

Last Thursday we realized that it wasn't getting any better and took him over to the vet (Kirkwood Animal Hospital and Dr. Ueno) to see what was going on. Some on-line reading led me to believe that it was likely a case of Hyperthyroidism, which I'd heard of and thought was somewhat common in aging cats.

However, the doctor called back on Friday morning to tell me that Barnes was diabetic. :-( Not only did that mean another trip to the vet and a 6-8 hour stay for glucose testing, it also likely meant insulin shots for the rest of his hopefully long life.

It wasn't long before I found the FelineDiabetes.com web site and began reading about what this was likely to mean: dietary changes, closer monitoring, daily shots, and so on.

To make a long story short, Barnes is doing better now. He and the other three cats are adjusting to eating a low-carb cat food (Purina DM). I have an appointment for his brother Noble to get checked out next week. If he's headed down the same path, a distinct possibility given the role that genetics can play, we'd like to catch it ASAP.

The food is more expensive and the insulin shots aren't nearly as bad as I expected. But I really wish this hadn't happened. Diabetes puts him at risk for other complications down the road--just like in humans.

What you need to know...

If you're a cat owner, here are a few suggestions from our experience:

  1. Feed your cats a good diet--onc they were designed to eat. That means avoiding the cheap foods and excessive snacking.
  2. Help them get lots of exercise. Use cat toys, catnip, a laser pointer, whatever works for them.
  3. Keep your cats indoors--they'll live much longer lives.
  4. Get you cats to the vet yearly. Eventually they'll get used to it. And even if they don't, it's for their own good.

Oh, I just dug up some of the pictures I took of Barnes and Noble back in 1999 when I first adopted them. There were about 3-6 months old at the time.

Just to lighten things up a bit, if you haven't already seen it, check out An Engineer's Guide to Cats.

There's probably a lot more I could say about this but will save it for another time. I'm sure we have much to learn yet. Now I'm off to get an injection ready.

Posted by jzawodn at 08:23 AM

August 02, 2008

Two weeks into my new job at craigslist...

Many people have asked (via IM, email, Twitter) how my new job is going, what craigslist is like, etc. So here are a few thoughts about my first two weeks in the new job.

The Commute

Despite what folks said in the comments of my little announcement, the commute really isn't that bad. Taking I-280 from Willow Glen (San Jose) up to near Golden Gate Park is about 55 minutes from pulling out of the garage to parking in San Francisco. And I've been able to find parking on Lincoln each time I've gone up--usually within 4-6 blocks from the office.

So 55 minutes of driving plus about 10 minutes of walking (which is good for me anyway) is very manageable if you're not doing it every day. If I did, I'd be less up-beat about it, I'm sure.

Having said that, I am going to experiment with the mass transit options as well. I'd like to give all the reasonable options a fair shake.

The Hardware

My laptop, a Lenovo ThinkPad T61 running Ubuntu Linux is performing quite well. It's had one lockup that I cannot attribute to anything in particular. But other than that, it's a joy to work on--especially with Emacs Snapshot and it's most excellent font rendering. (Learning VIM is still on my todo list...)

The biggest hassle so far has been VPN related. Every once in a while my laptop decides to reconnect to my wireless router at home and when it does it replaces the custom resolv.conf file with my "normal" home one. That results in a VPN that sort of works and sort of doesn't. I'm getting better at noticing when this happens and fixing it, but I really need to find a way to keep that from happening at all.

The Culture

In two weeks, I've only had one experience that I would come close to classifying as a "meeting." There really aren't conference rooms (yay!) but it did involve a whiteboard. However, unlike meetings I'm used to, it involved only the most essential people, had a clearly defined goal, and was very useful to me.

The engineering team has a great old-school Perl and Unix mentality (and sense of humor) to it that I really dig. Our private IRC channel is filled with a mix of useful information sharing and old fashioned joking, complaining, and ranting. It reminds me a lot of Yahoo in the 1999-2000 time frame.

The Food

Unlike Yahoo, craigslist has an abundance of nearby eating establishments within very short walking distance. I suspect that it'll take months of time before I've sampled what's nearby.

The Work

What am I actually doing?

Well, it's a mix of things at this stage. Since I know only a little bit about how things actually work, I'm asking a lot of questions and trying to get a sense of what's what and where. That always takes time in a new environment and with a new code base. But eventually the day does come when you suddenly realize that it's not an issue anymore and you must have things mostly figured out.

I'm also playing with alternatives to our current search. I've spent a week or so getting to know Sphinx, the open source search engine by Andrew Aksyonoff. People often use it as a replacement for MySQL's full-text search capabilities.

So far I'm quite impressed with it's speed and capabilities, not to mention Andrew's willingness to offer advice and suggestions. I've also been using Jon Schutz's Sphinx::Search Perl module. I've had to slightly modify the code of both to get them to perform the way we'd like, but that modifications aren't terribly extensive. As is often the case, what took the most time was figuring out what I really wanted to do and then how to do it.

I may have more to say about all this later.

Hiring

It looks like we've got a bit more room at craigslist. As Jim mentioned on the craigslist blog:

Worth mentioning that the CL tech hiring bit remains set to "1" for star LAMPerl developers, systems heavyweights, and networking wizards.

If you're a great Perl hacker, amazingly skilled networking geek, or someone who really knows systems and data center stuff, we may be waiting for you.

Ping me if you're interested and we'll get the ball rolling.

Finally...

Anyway, that's the story so far.

Am I happy in my new role? You bet.

Do I miss some of my old colleagues at Yahoo? Of course. In fact, I missed Chad's going away party due to a sick cat, which is a whole separate and sad story I need to tell.

See Also: Settling in to a New Environment at Craigslist

Posted by jzawodn at 09:08 AM