Long Term Data Archiving Formats, Storage, and Architecture (by Jeremy Zawodny)

I'm thinking about ways to store archival data for the long term and wanted to solicit anyone who's been down this road for some input, advice, warnings, etc.

Background

Essentially I'm dealing with a system where "live" and "recently live" data is stored in a set of replicated MySQL servers and queried in real-time. As time goes on, however, older "expired" data is moved to a smaller set of replicated "archive" servers that also happen to be MySQL.

This is problematic for a few reasons, but rather than be all negative, I'll express what I'm trying to do in the form of some goals.

Goals

There are a few high-level things I'd like this archive to handle based on current and future needs:

Be able to store data for the foreseeable future. That means hundreds of millions of records and, ultimately, billions.
Fast access to a small set of records. In other words, I like having MySQL and indexes that'll get me what I want in a hurry without having to write a lot of code. The archive needs to be able to handle real-time queries quickly. It does this today and needs to continue to work.
Future-proof file/data format(s). One problem with simply using MySQL is that there will be schema changes over time. A column may be added or dropped or renamed. That change can't easily be implemented retroactively on a larger data set in a big table or whatnot. But if you don't then code needs to be willing to deal with those changes, NULLs appearing, etc.
Fault tolerance. In other words, the data has to live in more than once place.
Support for large scans on the data. This can be to do full-text style searches, looking for patterns that can't easily be indexed, computing statistics, etc.
It's worth noting that data is added to the archive on a constant basis and it is queried regularly in a variety of ways. But there are no delete or updates occurring. It's a write heavy system most of the time.

Pieces of a Possible Solution

I'm not sure that a single tool or piece of infrastructure will ever solve all the needs, but I'm thinking there may be several open source solutions that can be put to use.

You'll notice that this involves duplicating data, but storage is fairly cheap. And each piece is especially good at solving one or more of our needs.

MySQL. I still believe there's a need for having a copy of the data either denormalized or in a star schema in a set of replicated MySQL instances using MyISAM. The transactional overhead of InnoDB isn't really needed here. To keep things manageable one might create tables per month or quarter or year. Down the road maybe Drizzle makes sense?
Sphinx. I've been experimenting with Sphinx for indexing large amounts of textual data (with some numeric attributes) and it works remarkably well. This would be very useful instead of building MySQL full-text indexes or doing painful LIKE queries.
Hadoop/HDFS and Flat Files or a simple record structure. To facilitate fast batch processing of large chunks of data, it'd be nice to have everything stored in HDFS as part of a small Hadoop cluster where one can use Hadoop Streaming to run jobs over the entire data set. But what's good future-proof file format that's efficient? We could use something like XML (duh), JSON, or even Protocol Buffers. And it may make sense to compress the data with gzip too. Maybe put a month's worth of data per file and compress? Even Pig could be handy down the road.

While it involves some data duplication, I believe these pieces could do a good job of handling a wide variety of use cases: real-time simple queries, full-text searching, and more intense searching or statistical processing that can't be pre-indexed.

So what else is there to consider here? Other tools or considerations when dealing with a growing archive of data whose structure may grow and change over time?

I'm mostly keeping discussion of storage hardware out of this, since it's not the piece I really deal with (a big disk is a big disk for most of my purposes), but if you have thoughts on that, feel free to say so. So far I'm only thinking 64bit Linux boxes with RAID for MySQL and non-RAID for HDFS and Sphinx.

Posted by jzawodn at September 08, 2008 04:24 PM | edit

Reader Comments

# blckholehorizon said:

Are you looking for something like table partitions in MySQL 6.0, where you can specify the storage engine { archive maybe? }

Too bad I don't see 6.0 becoming GA in the near future...

This doesn't give you protection against the future, but we cant have everything.

on September 8, 2008 06:21 PM

# Guillaume Theoret said:

I have a feeling Hadoop is probably the most future-proof solution given the many tools people are building on top of it.

Depending on your use case you should look into some of them.

Facebook only just released Hive, a tool that basically exposes an sql-ish interface to data in Hadoop. (But I can't get it working yet =/ )

HBase is more or less for getting individual records out with decent latency.

Nutch is a full-text search engine that runs on top of Hadoop.

There's also stuff like hypertable that can run on top of Hadoop too. (But for now hypertable only has a C++ api)

on September 8, 2008 06:33 PM

# Harish Mallipeddi said:

You mention above that one of the requirements is that "the archive needs to be able to handle real-time queries quickly".

How would you achieve this with Hadoop? Hadoop is more for offline processing than real-time querying. You run a bunch of map-reduce tasks, wait for a couple of hours, and finally get your results.

Btw we have been using JSON as the format for all our HDFS files. During the map/reduce steps, JSON objects get (de)serialized to Java objects.

If you really want such real-time querying, then you should probably look at HBase (or Facebook's Cassandra) or something like that.

on September 8, 2008 06:51 PM

# mike said:

Why not think about ZFS as a backing filesystem (if using filesystem) it's meant to deter bitrot and can snapshot easily etc. Or hell even something like mogilefs could be repurposed too. Both allow for basically infinite storage addition (zfs for sure)

on September 8, 2008 07:16 PM

# Jeremy Zawodny said:

Harish:

Hadoop would not be used for the real-time stuff. MySQL would.

on September 8, 2008 07:31 PM

# Chris Anderson said:

Have you looked at CouchDB lately? It's designed with uses like this in mind, and with incremental Map/Reduce, it can query large data like Hadoop does, but without having to re-compute each query from scratch when new data needs to be taken into account.

CouchDB's replication is designed to allow offline access to portions of a larger database, as well as integrating modifications from multiple peers. The upshot is that you could insert new data into the "live" database, periodically defining a new "live" database, and merging the old one into the archive.

on September 8, 2008 11:01 PM

# Kevin Burton said:

We have about 2 years worth of blog data at Spinn3r...

It's all in MySQL. We keep support for legacy code by simply adding columns to the schema as we make changes.

When a feature is added we simply have a configuration variable for the underlying table to detect when the added column is present.

New code working on old tables just returns NULL for the missing columns.

This is about 5T worth of data so certainly a non-trivial amount of content.

Offline storage and backups are another issue. We're using S3 for now but I'm thinking of just using plain olde HDDs and keeping them at the office.

Much cheaper........

on September 8, 2008 11:34 PM

# Abhinav Saxena said:

Regarding the flat file approach - you can partition the data into multiple files by hashing some column - say date/userid etc. This approach works well when your queries will always be like "select columns where userid='XXXX'" But I don't that's the case here :-(

on September 9, 2008 01:08 AM

# Jan Lehnardt said:

In addition to Chris' recommendation of evaluating CouchDB: It handles data with a non-fixed schema quite gracefully. Feel free to hit me (or Chris) with follow-up questions. Or hop onto #couchdb on Freenode.

on September 9, 2008 02:58 AM

# Casey said:

Here's an alternative solution: store the data as text files (dated?) on a distributed parallel fault tolerant file systems, like GlusterFS.

1) Text files are future-proof.
2) You should be able to put a fulltext search on top of it if you need one.
3) Backups should be pretty simple to implement since, if I understand these correctly, you just backup your filesystem.

Here's a link with more options.

http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems

I haven't used any of these, so this is more theory than fact. It does sound like an interesting solution, though.

on September 9, 2008 07:29 AM

# Ryan Grimm said:

Have you looked into using an XML database like MarkLogic at all? It is a highly scalable database with built in clustering support, fulltext indexing, content analytics, content classifier and a ton of other fun tricks. Seems like a perfect fit except for it not being open source. But eXist is an open source solution that is being actively developed.

Disclaimer: I work on markmail.org which uses MarkLogic.

on September 9, 2008 09:10 AM

# Dmitriy said:

Hi Jeremy,

There are a lot of options for solving this type of challenge, all with their own set of advantages and disadvantages. Your choice will mostly depend on how much money your company is willing to spend on this project, what kind of resources you can devote to developing it, and what kind of resources you can commit to maintaining it. As a former Yahoo and a MySQL guru, I am sure you are familiar with quite a few of these, but I’ll do my best to outline the pros and cons of a variety of options.

None of these technologies give you resilience to schema changes “for free.” At the very least, you should write all code to know that it might get nulls (or some other default value) for new columns.

Option 1: Go big. Contract with a data warehouse specialist company like Teradata, Netezza, Greenplum, or Asterdata. Buy the SAN, or the cluster, or both, hire a few DBAs, and store everything. A bit of partitioning, a bit of MPP, a flexible schema design, and you can store hundreds of terabytes of data. Keep your archive as well as your current data in there, they can handle it.

Pros: Terabytes of data at the tips of your fingers. Non-engineers can query the data with minimal training. Existing SQL can be ported. Great performance assuming your team knows their stuff. Great scalability.

Cons: Big money. Big commitment. Need to find a team with experience in large Data Warehouses. Totally not worth it if you are in the low Tb. When you hit 50 Tera.. might want to start thinking about it.

Option 2: Go Hadoop. Still need a computing/storage cluster, but you don’t have to buy if from an expensive vendor. You need a team of engineers to figure out how to get the data into Hadoop well. Fortunately for this case, updates are not an issue – this is a data archiving solution – but the lack of support for data appends in HDFS may be troublesome. Pig is great, Hadoop is great, but you need a dedicated team of programmers to port existing code and write new code. You also incur the overhead of supporting the cluster – swapping out disks, etc. The vendors in option 1 can do that for you (for a price), but here, you need to find the techs yourself.

Pros: very flexible access to data. Low up-front cost. Great scalability.

Cons: Significant development effort. Need to find a team with experience in distributed computing. Everything (other than the M/R framework) has to be built from the ground up. Existing reports are not portable.

Option 3: Replicate to backup MySQL boxes. If I were you, I’d probably ask me (meaning, you). So I won’t presume .

Option 3a. Partition your MySQL tables on regular date intervals. I am not sure what kind of support MySQL has for partitioning; you might have to do this manually. Create a metadata table, or a set of metadata tables, that document schema changes, and be religious about updating it as you change your schema. Export a month’s (or week’s or day’s…) worth of data, compress it as much as possible, and store it on Amazon’s S3. When you need to query it, spin up an EC2 instance using a pre-built image with MySQL, etc, load up the stored snapshot, and go nuts. Write something that generates the right queries for you given a query you really want, and the schema information stored in the metadata (so it can decide which columns it doesn’t have and fill them in with nulls, return meaningful errors if you are filtering on non-existing columns, etc).

Pros: Cheap. Minimal development and admin effort. Existing reports/code is mostly portable.

Cons: A bit of administrative pain. The metadata thing is cool but takes work. Spinning up an instance takes some time. Need to write some code to make sure the stuff you send to Amazon gets to Amazon. Oh, and also you may not be willing to send your stuff to Amazon..

-D

on September 9, 2008 12:48 PM

# Peter Manis said:

Solr might be an option for indexing certain information for fast location within the archive. I have used it for full text search of doc/xls/pdf/etc and it worked extremely well and was very fast.

Erik Hatcher used it for the UVA Library

http://blacklight.betech.virginia.edu/

CNET and Archive.org use it for search as well

http://www.nabble.com/INTERNET-ARCHIVE-goes-SOLR!-to8670095.html#a8670095

on September 9, 2008 04:56 PM

# hakmem said:

We have looked at this - in a non-profit context for archiving TBs of manuscript scan data. In another context I have looked at this for archiving financial markets data.

I think you want archival but you also want an online store. Those are mutually exclusive. If you want archival in the strictest sense of the word - then you have to push the data to some sort of magnetic storage and ship it off to a underground salt mine somewhere. period.

If you want distributed long term storage - I would recommend glusterfs. Gluster is cheap and easy to maintain. It gives you everything Lustre would without the high-priest overhead.

I would recommend that you store the data as Mysql files - basically dumps. Then import the whole thing as you need. Since Gluster can give you huge I/O through put - it will load fast - i.e. - it will be cpu-bound. With Glsuter - over a reasonably fast link you can even have one of the storage bricks offsite.....

Ideally I would do ZFS as the underlying filesystem to gluster - but it could be anything - ext3, xfs, ....

Please let us know what path you decide..

on September 9, 2008 11:35 PM

# Branedy said:

Long term has a relative meaning, when I was working at a telecom, the archiving issue was one of media upon which the data was stored. We ended up re-buying some 9 track reel to reel devices so as to recover some old data for an audit. File systems aside, what media will you choose, DLT even now is becoming obsolete, disk drives are changing even more frequently. Not a show stopper, but like I said, long term could be defined quite differently.

on September 13, 2008 04:57 AM

# martin said:

I'm very interested in MySQL partitioning, I've 0 experience with it but it looks really promising to me (and looks like the perfect solution for this job).
Any hints?

on September 15, 2008 07:53 AM

# jeff said:

hey jeremy,

your scenario sounds mildly similar to what we were facing with the inbox storage and search problem at facebook. cassandra is holding up quite well with around 40 T of data, the team is growing, and it was being evaluated for several more projects when last i was employed by facebook (about a month ago). the guys working on cassandra are exceptionally talented and committed to supporting external users, so we'd love to have you give it a spin. drop me a line if you have any problems getting started.

later,
jeff

on September 18, 2008 12:25 AM

# Michael said:

Jeremy,

Look at Infobright data warehouse storage engine for MySql. It looks the most natural way to solve your problems. Cheap and easy and open-sourced. Built exactly for this kind of tasks.

Michael

on September 27, 2008 02:22 AM

# Robert said:

Very interesting post. Great input

on October 30, 2008 10:00 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.

Long Term Data Archiving Formats, Storage, and Architecture

Background

Goals

Pieces of a Possible Solution

Related Posts