I remember reading Disk is the new Tape earlier this year and how much it resonated. That's probably because I was working for Yahoo at the time and hearing a lot about their use of Hadoop for data processing. In fact, I even did a couple videos (1 and 2) about that.
Anyway, I recently faced the reality of this myself. When I wrote about The Long Term Performance of InnoDB I'd been beating my head against a wall trying to get millions of records out of InnoDB efficiently.
It was taking days to get all the records. Yes, days!
After joking that it'd probably be faster to just dump the tables out and do the work myself in Perl, I thought about Disk is the new Tape and realized what I was doing wrong.
Allow me to offer some background and explain...
There are several tables involved in the queries I needed to run. Two of them are "core" tables and the other two are LEFT JOINed because they hold optional data for the rows I'm pulling. There are well over a hundred million records to consider and I need only about 10-15% of them.
And these records fall into roughly 500 categories. So what I'd been doing is fetching a list of categories, running a query for each category to find the rows I actually need, processing the results, and writing them to disk for further processing.
The query looked something like this:
SELECT field1, field2, field3, ... field N FROM stuff_meta sm, stuff s LEFT JOIN stuff_attributes sa ON sm.item_id = sa.item_id LEFT JOIN stuff_dates sd ON sm.item_id = sd.item_id WHERE sm.item_id = s.item_id AND sm.cat_id = ? AND sm.status IN ('A', 'B', 'C')
That seemed, at least in theory, to be the obvious way to approach the problem. But the idea of waiting several days for the results led me to think a bit more about it (and to try some InnoDB tuning along the way).
While it seems very counter-intuitive, this was sticking in my head:
I’m still trying to get my head around this concept of "linear" data processing. But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.
So I gave it a try. I wrote a new version of the code that eliminated the two AND bits in the WHERE clause. Combining that with using mysql_use_result in the client API, meant it had to process a stream of many tens of millions of records, handle the status filtering and shorting records into buckets based on cat_id (and some extra bookkeeping).
As an aside, I should note that there used to be an ORDER BY on that original query, but I abandoned that early on when I saw how much work MySQL was doing to sort the records. While it made my code a bit easier, it was far more efficient to track things outside the database.
Anyway, the end result was that I was able to get all the data I needed in merely 8 hours. In other words, treating MySQL as an SQL powered tape drive yielded a 12 fold improvement in performance.
Put another way, taking the brain-dead stupid, non-SQL, mainframe-like approach got me results 12 times faster than doing it the seemingly "correct" way.
Now this isn't exactly what the whole disk vs. tape thing is about but it's pretty close. I'm aware that InnoDB works with pages (that will contain multiple records, some of which I don't need) and that's part of the problem in this scenario. But it's a really interesting data point. And it's certainly going to change my thinking about working with our data in the future.
Actually, it already has. :-)
Dumber is faster.
Ping me if you're interested in either role.
Posted by jzawodn at August 21, 2008 02:21 PM