I'm thinking about ways to store archival data for the long term and wanted to solicit anyone who's been down this road for some input, advice, warnings, etc.
Essentially I'm dealing with a system where "live" and "recently live" data is stored in a set of replicated MySQL servers and queried in real-time. As time goes on, however, older "expired" data is moved to a smaller set of replicated "archive" servers that also happen to be MySQL.
This is problematic for a few reasons, but rather than be all negative, I'll express what I'm trying to do in the form of some goals.
There are a few high-level things I'd like this archive to handle based on current and future needs:
- Be able to store data for the foreseeable future. That means hundreds of millions of records and, ultimately, billions.
- Fast access to a small set of records. In other words, I like having MySQL and indexes that'll get me what I want in a hurry without having to write a lot of code. The archive needs to be able to handle real-time queries quickly. It does this today and needs to continue to work.
- Future-proof file/data format(s). One problem with simply using MySQL is that there will be schema changes over time. A column may be added or dropped or renamed. That change can't easily be implemented retroactively on a larger data set in a big table or whatnot. But if you don't then code needs to be willing to deal with those changes, NULLs appearing, etc.
- Fault tolerance. In other words, the data has to live in more than once place.
- Support for large scans on the data. This can be to do full-text style searches, looking for patterns that can't easily be indexed, computing statistics, etc.
- It's worth noting that data is added to the archive on a constant basis and it is queried regularly in a variety of ways. But there are no delete or updates occurring. It's a write heavy system most of the time.
Pieces of a Possible Solution
I'm not sure that a single tool or piece of infrastructure will ever solve all the needs, but I'm thinking there may be several open source solutions that can be put to use.
You'll notice that this involves duplicating data, but storage is fairly cheap. And each piece is especially good at solving one or more of our needs.
- MySQL. I still believe there's a need for having a copy of the data either denormalized or in a star schema in a set of replicated MySQL instances using MyISAM. The transactional overhead of InnoDB isn't really needed here. To keep things manageable one might create tables per month or quarter or year. Down the road maybe Drizzle makes sense?
- Sphinx. I've been experimenting with Sphinx for indexing large amounts of textual data (with some numeric attributes) and it works remarkably well. This would be very useful instead of building MySQL full-text indexes or doing painful LIKE queries.
- Hadoop/HDFS and Flat Files or a simple record structure. To facilitate fast batch processing of large chunks of data, it'd be nice to have everything stored in HDFS as part of a small Hadoop cluster where one can use Hadoop Streaming to run jobs over the entire data set. But what's good future-proof file format that's efficient? We could use something like XML (duh), JSON, or even Protocol Buffers. And it may make sense to compress the data with gzip too. Maybe put a month's worth of data per file and compress? Even Pig could be handy down the road.
While it involves some data duplication, I believe these pieces could do a good job of handling a wide variety of use cases: real-time simple queries, full-text searching, and more intense searching or statistical processing that can't be pre-indexed.
So what else is there to consider here? Other tools or considerations when dealing with a growing archive of data whose structure may grow and change over time?
I'm mostly keeping discussion of storage hardware out of this, since it's not the piece I really deal with (a big disk is a big disk for most of my purposes), but if you have thoughts on that, feel free to say so. So far I'm only thinking 64bit Linux boxes with RAID for MySQL and non-RAID for HDFS and Sphinx.
- The Long Term Performance of InnoDB
- Open Source Queueing and Messaging Systems?
- Dumber is Faster with Large Data Sets (and Disk Seeks)
- The Perl UTF-8 and utf8 Encoding Mess
Posted by jzawodn at September 08, 2008 04:24 PM