Distributed Parallel Fault Tolerant File System Wanted (by Jeremy Zawodny)

After re-thinking and re-tooling some of the work I've been doing to take advantage of Gearman, I've started to wish for a big file system in the sky. I guess it's no surprise that Google uses GFS with their Map/Reduce jobs and that Hadoop has HDFS as a major piece of its infrastructure.

The Wikipedia page List of file systems has a section on Distributed parallel fault tolerant file systems that appears to be a good list of what's out there. The problem, of course, is that it's little more than a list.

Do you have any experience with one or more of those? Recommendations?

I should say that I'm only interested in something that's Open Source and have a minor bias against big Java things as well as stuff that appear as though it would cease to exist if a single company went out of business.

I'm not too worried about POSIX compliance. The main use would be for writing large files that other machines or processes would then read all or part of. I don't need updates. The ability to append would probably be nice, but that's easy to work around.

More specifically, these three have my eye at the moment:

CloudStore (was KFS) by Kosmix, a C++ clone of GFS
MogileFS from Danga, what can I say--I'm a Perl guy
HDFS the Hadoop file system

It's interesting that some solutions deal with blocks (often large) while others deal with files. I'm not sure I have a preference for either at the moment.

But I'm open to hearing about everything, so speak up! :-)

Posted by jzawodn at June 18, 2009 11:06 AM | edit

Reader Comments

# Sam Pullara said:

I think what you will find is a bit annoying. If you want a file system that works well for large files choose one that is based on GFS because those are block based. If you want one that works well for small files, choose one that is based on files rather than blocks. Right now HDFS (and probably cloudfront) is a poor choice for large numbers of small files (where small << block size). Also note that you might want the latest and greatest version (HDFS 0.19) if you need append.

on June 18, 2009 11:57 AM

# Chris Bolt said:

There's also GlusterFS, http://www.gluster.org/ which looks pretty interesting, and is just a layer on top of existing filesystems kind of like MogileFS.

on June 18, 2009 12:44 PM

# Matt Kangas said:

Regarding your use-case, how long do you want those "large files" to persist?

As a Mogile user, the one thing that still kinda scares me is the lack of file checksums. You'll be protected from one box (or disk) falling over but won't know when bitrot sets in.

That said, we've exceeded 2^32 items in our MogileFS cluster (2 db nodes, 8 storage nodes @ 3.8TB apiece) and it's still working great. Dirt cheap compared to any other solution I know of!

on June 18, 2009 01:02 PM

# Parand Tony Darugar said:

Hadoop/HDFS seems to have the most momentum right now. It's in heavy use at Yahoo as you probably know, as well as Facebook and a bunch of other places. It also has the advantage of having a healthy ecosystem - lots of projects build on the hdfs/hadoop base - for example, Hive, Pig, HBase. It also has great EC2/S3/EBS support, and there's now a shiny new company providing updates and support for it (Cloudera).

On the other hand HDFS is primarily intended for large files, so it may not work so well for you if you're looking to serve many small files from it.

on June 18, 2009 01:18 PM

# Erik Ljungstrom said:

I'm currently looking into GlusterFS, and I like what I've seen so far. It appends and it has really good performance from my initial tests. For many smaller files, I had to make a smaller change to the code to get it to perform the way I expected it to, I've written about that on my blog if you're interested.

I haven't tested it enough to dare say it's great and promote it, but in my opinion it's definitely worth putting on your shortlist! A bit of a minus for a rather apathetic community though.

on June 18, 2009 01:54 PM

# Ask Bjørn Hansen said:

I'm surprised you don't have something like this setup already.

We use MogileFS for "contributed files" (photos, videos, pdf files etc; only ~10 million files or so). Because we already have that working we also use it to archive log files; although we don't do much with them yet I'm planning to use it for basic map/reduce type processing. For content where we care we store an md5 of the file with the metadata (in a separate db) so we can check for bitrot.

I also looked at GlusterFS, but at the time it was missing some basic HA/reliability/redundancy/failover features.

People often come to the MFS mailing list having trouble with a basic install. It's not as simple as installing a few RPMs, but it's really pretty straight-forward as long as you're willing to put a little time into understanding how it works.

on June 21, 2009 01:02 AM

# arbitrage said:

glusterfs. glusterfs. we have seriously looked at the various solutions in the space and nothing can hold a candle to Glusterfs.

you seem to ask the same question every 9-12 months jeremy: http://jeremy.zawodny.com/blog/archives/010556.html

on June 21, 2009 03:19 AM

# Jeremy Zawodny said:

That's a related question from months back, but definitely not the same. :-)

When I get this all figured out, I'll write something up on it. Really!

on June 21, 2009 09:51 AM

# Case Larsen said:

Check out CloudStore (formerly kosmofs) http://kosmosfs.sourceforge.net/

as a plus, the primary developer is down the street from you in San Francisco :)

on June 23, 2009 04:35 AM

# mark terill said:

Here is another possible solution that looks quite interesting:

http://ceph.newdream.net/

Object rather than block based storage.

on June 30, 2009 08:29 PM

# CJ said:

I know you said open source, but we're using a product from a startup called Scale Computing that uses GPFS. We're really happy with it, and it turns out you can license GPFS and build a module that you can dynamically load into the kernel. It "just works" and since it's a kernel module, it's doesn't run in the user space (like a real filesystem). :)

on July 1, 2009 10:05 AM

# irider said:

Amazing ... I was looking something like this :) Thx

on July 9, 2009 03:59 AM

# gpapilion said:

Coming from a Hadoop shop I can tell you HDFS is great and also a bit of a pain.

It lacks kernel support so you cannot navigate it like a typical file system. If you're using the API and accessing problematically its not really an issue, but if you are trying to use it from the shell it can be a bit of a pain.

There is no append support.

Failover on the namednode is a little problematic, as in completely manual.

on July 13, 2009 11:02 AM

# said:

You may check this out:
http://www.ioremap.net/projects/pohmelfs

on July 13, 2009 06:04 PM

# said:

Facebook released an open source project that you might be interested in. I haven't used it myself, but check out Cassandra:

http://incubator.apache.org/cassandra/

on July 21, 2009 04:22 PM

# CzOlO said:

I don't know if you are still looking for the distributed file system?

I think you could give http://www.moosefs.org a try. It is open source and can keep files of sizes up to 2TB (and they plan to remove this limit). The system has nice web monitor with lists of available servers and hdds. There are implementations of the system of more than 500TB data distributed over 70 servers with more than 25 milion files. You can also specify how many copies you want to have of any file so that in case of failure of one server the whole data is still available. And adding new servers is just as easy as plugging them in the net.

on March 22, 2010 05:40 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.