Dumber is Faster with Large Data Sets (and Disk Seeks) (by Jeremy Zawodny)

I remember reading Disk is the new Tape earlier this year and how much it resonated. That's probably because I was working for Yahoo at the time and hearing a lot about their use of Hadoop for data processing. In fact, I even did a couple videos (1 and 2) about that.

Anyway, I recently faced the reality of this myself. When I wrote about The Long Term Performance of InnoDB I'd been beating my head against a wall trying to get millions of records out of InnoDB efficiently.

It was taking days to get all the records. Yes, days!

After joking that it'd probably be faster to just dump the tables out and do the work myself in Perl, I thought about Disk is the new Tape and realized what I was doing wrong.

Allow me to offer some background and explain...

There are several tables involved in the queries I needed to run. Two of them are "core" tables and the other two are LEFT JOINed because they hold optional data for the rows I'm pulling. There are well over a hundred million records to consider and I need only about 10-15% of them.

And these records fall into roughly 500 categories. So what I'd been doing is fetching a list of categories, running a query for each category to find the rows I actually need, processing the results, and writing them to disk for further processing.

The query looked something like this:

    SELECT field1, field2, field3, ... field N
      FROM stuff_meta sm, stuff s
 LEFT JOIN stuff_attributes sa ON sm.item_id = sa.item_id
 LEFT JOIN stuff_dates      sd ON sm.item_id = sd.item_id
     WHERE sm.item_id = s.item_id
       AND sm.cat_id  = ?
       AND sm.status IN ('A', 'B', 'C')

That seemed, at least in theory, to be the obvious way to approach the problem. But the idea of waiting several days for the results led me to think a bit more about it (and to try some InnoDB tuning along the way).

While it seems very counter-intuitive, this was sticking in my head:

I’m still trying to get my head around this concept of "linear" data processing. But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.

So I gave it a try. I wrote a new version of the code that eliminated the two AND bits in the WHERE clause. Combining that with using mysql_use_result in the client API, meant it had to process a stream of many tens of millions of records, handle the status filtering and shorting records into buckets based on cat_id (and some extra bookkeeping).

As an aside, I should note that there used to be an ORDER BY on that original query, but I abandoned that early on when I saw how much work MySQL was doing to sort the records. While it made my code a bit easier, it was far more efficient to track things outside the database.

Anyway, the end result was that I was able to get all the data I needed in merely 8 hours. In other words, treating MySQL as an SQL powered tape drive yielded a 12 fold improvement in performance.

Put another way, taking the brain-dead stupid, non-SQL, mainframe-like approach got me results 12 times faster than doing it the seemingly "correct" way.

Now this isn't exactly what the whole disk vs. tape thing is about but it's pretty close. I'm aware that InnoDB works with pages (that will contain multiple records, some of which I don't need) and that's part of the problem in this scenario. But it's a really interesting data point. And it's certainly going to change my thinking about working with our data in the future.

Actually, it already has. :-)

Dumber is faster.

As I've mentioned before, craigslist is hiring. We like good Perl/MySQL hackers. And good sysadmin and network types too.

Ping me if you're interested in either role.

Posted by jzawodn at August 21, 2008 02:21 PM | edit

Reader Comments

# Tom Printy said:

I wonder if this a disk thing or a MySQL thing....

If it was possible it might be interesting to load your data on some solid state drives and see what the difference would be.

I am sure you did this but you might also want to peek at your explain plan to make sure you are using indexes and the like.....

on August 21, 2008 03:38 PM

# Jonathan Disher said:

I believe the consensus was that InnoDB sucks balls on LEFT JOIN's. That can't be helping.

But yeah, it's truly sad when it's faster to dump more data than you care about (and discard the excess in your code). Sadly I may be headed this route in Buzz, we'll have to see.

on August 21, 2008 03:40 PM

# Kevin Scaldeferri said:

I've had this experience before. If you're going to iterate over multiple keys (like the categories in this example), it's almost always faster to just grab everything from the database and sort things into buckets in your code.

on August 21, 2008 04:15 PM

# Mark Callaghan said:

Why blame SQL, InnoDB and disks? Maybe the problem is MySQL. It has a very simple external sort (no async IO among other things). It doesn't have hash join. Index nested loops join is really good at generating too much random IO. A good hash join and external sort will do exactly what you want -- run as fast as the sequential IO throughput capacity of your server.

on August 21, 2008 08:16 PM

# Jeremy Zawodny said:

Mark:

I guess it's easy to blame when you've got a good idea of what it's doing inside. :-)

But, yeah, I definitely see your point. It's not necessarily the perfect tool for the job in this case. And, of course, the schema design isn't optimal either.

on August 21, 2008 10:26 PM

# NM said:

You could also have done that the "correct" way by using a temporary table, ideally stored on a separate spindle.

on August 21, 2008 11:22 PM

# Lee said:

Its very hard to judge a query like that without an idea of what indexes you have, and what query plan mysql is using.

Also, how selective is sm.status? Do you have an index on it?

on August 21, 2008 11:57 PM

# andy said:

"So what I'd been doing is fetching a list of categories, running a query for each category to find the rows I actually need, processing the results, and writing them to disk for further processing."

I think the problem is SQL itself. There are proprietary non-SQL database languages that handle this problem trivially.

In Progress, for example:

for each stuff_meta ...
find stuff_attributes ...
find stuff_dates ...
...
end

...which ensures that you only read stuff_meta once. Simple.

on August 22, 2008 01:19 AM

# Craig said:

Along those lines, Hadoop seems "dumber." So why use MySQL at all and worry about optimizing queries? Write a little piece of code that does what you want and is executed on a cloud of machines. Done. Isn't that dumber, er better?

on August 22, 2008 08:02 AM

# bob said:

When you say 'millions of records' how many do you mean exactly. I can query 10s of millions of records with a query more complicated than you're using on SQL Server in a matter of a minute or so. Hardware is dual Xeon quad core with 12GB of memory.

on August 22, 2008 09:56 AM

# Paul Houle said:

Full table scans are highly effective in many domains. At least there's an upper bound on the amount of I/O done, and it can be done in the most efficient possible manner.

There were some nice articles in the Bell System Technical Journal back around 1980 about the systems that were used to process large volumes of telephone billing records. They'd developed tools that were much in the spirit of 'awk' and 'grep' but used a binary data format to save CPU cycles and disk space.

on August 22, 2008 11:05 AM

# wolf550e said:

Andy: this is exactly what SQL was meant to solve. It's declarative instead of imperative, so you let the query planner convert the SQL statement into the plan you describe (a for loop), using what it knows about the tables, the indexes, the row sizes, the media the tables and indexes are stored on, maybe even how much of that is already in cache etc. A good SQL DB will perform this exact SQL in the best way you can imagine. MySQL is probably too simple for that, but we can't say without knowing the schema and seeing the plan.

I guess mysql would read 'stuff_meta' sequentially, filter by 'cat_id' and 'status' (if they're not indexed or not selective enough) and read 'stuff' using an index on item_id. If stuff is big that means lots or random IO.

on August 22, 2008 12:48 PM

# Brian Blood said:

Wouldn't using a HAVING clause produce the same filtering of the large result set into a smaller stream of records?

on August 25, 2008 01:23 PM

# Maurice said:

@Paul Houle

Interesting I worked on billing systems for BT doing complex billing for Dialcom /Telecom Gold services on PRIME superminis and we used a custom rule based system the you could configure to read almost any format of log (we had 4 different logs with multiple types of billing records in each log) and spit out an intermediate record tagged with the correct product code and also work out their break points where in each product i.e. so much for the first x characters so much for the second y chars and so on.

It was written by dialcom in PL/1G which was a surprise to us as it was supposed to be done in Fortran 77. the billing system was at one time one of the 3 or 4 international products sold by BT.

I suspect this shows the down side in starting out with mysql and just letting it grow like topsy. It would be interesting to see the speed of a DB2 or Oracle implementation of this as tuned by some of the DBA gurus I worked with in BT on of whom could drop interesting facts like “my boss used to be Dijkstra”

on August 26, 2008 09:14 AM

# Whitney Sorenson said:

I'm curious if you tried sending all the categories in a single IN query. Wouldn't this possibly cause the optimizer to ignore the index and do a full table scan instead, roughly simulating what you did in code?

Maybe it would be faster than issuing the individual queries.

on August 26, 2008 11:42 AM

# said:

Errrm, your indexing schemas must suck. If it is the filtering on category and status that is a problem. Not to say that mySQL/InnoDB aren't toys, but other databases, with appropriate indices will do this as fast as the disk can stream your major table. Or memory if you've got enough, and you preload the cache.

Generally, if you know how your database is operating (specifically the nesting of loops and the choice of indices) then you can force the desired behaviour.

Having said that, there is nothing wrong with doing stream processing, just avoid the database in the first place.

on August 28, 2008 03:03 AM

# Tim Hare said:

This is interesting, and the commenter who mentioned PL/1G almost touched on what I'm about to say, but didn't.

I mean no insult by this, but I think part of the issue is that younger people working in computers no longer get any training related to sequentially processing records in a file (the oft-denigrated "batch" work).

I could relate a story from my days converting library data from Apple IIs to IBM PCs as an example, but we don't need the "old guy story" here. The point is, that sometimes working with record-structured sequential files is just the right way to go, but if you've never done it before it might not ever occur to you.

on September 9, 2008 06:29 AM

# Dmitriy said:

This is exactly the type of thing that one has to do all the time, but one *shouldn't* have to do at all.

Like an earlier commenter pointed out, the power and promise of a declarative language is that you just tell it what you want, and it will figure out the best way to do that.

Sadly, none of the existing implementations (at least the ones I've tried -- but that does include both open and closed-source DBs) deliver fully on the promise, leading to people devolving to simpler access methods. It's like having a car that drives itself, but sometimes drives to the wrong place.. Sometimes it's faster to walk (or take a map/reduce bicycle).

Any time a user has to fall back to "dumber is better" or play with in clauses vs inner selects, that's a sign:

1. The user and/or the DBA need to spend time thinking about their schemas, data layout choices, etc.
2. The DB community, in the meantime, needs to keep plugging with figuring out ways to determine necessary indexes on the fly, adjust data storage policies according to observed and predicted usage, etc.
3. Someone at MS, Oracle, IBM, Sun, Postgres, etc needs to beat up on their optimizer until it starts delivering on the declarative promise. Which, of course, is really really hard.

But while the CS PhDs of the world keep working on those things -- thanks, Jeremy, for posting your tips, they are certain to be useful to people who are actually trying to get things done using imperfect (albeit good!) technology..

on September 9, 2008 01:03 PM

# said:

If your database job is taking longer than a full table scan, than you need to start from scratch and look at your explain plans and logic. If both are OK, then just do a full scan and put all of the other logic and variables into your code.

on October 8, 2008 05:48 AM

# leetz said:

Oh i remember this one. It's where "M" was 8x faster than SQL becuase it was optimized for searching strings.

on November 19, 2008 12:45 PM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.