Yahoo! Search running Apache Hadoop on Large Scale (by Jeremy Zawodny)

Over on the Yahoo! Hadoop blog, you can read about how the webmap team in Yahoo! Search is using the Apache Hadoop distributed computing framework. They're using over 10,000 CPU cores to build the map and processing a ton of data to do so. They end up using over 5 petabytes of raw disk storage, eventually outputting over 300 terabytes of compressed data that's used to power every single search.

As part of that post, I got to interview Sameer and Arnab to learn more about the history of the webmap and why they moved from our proprietary infrastructure to using Hadoop.

One of the points I try to make during the interview is that this a huge milestone for Hadoop. Yahoo! is using Hadoop in a very large scale (and growing) production deployment. It's not just an experiment or research project. There's real money on the line. (It's too bad we had a technical glitch in the video right as we were discussing a Really Big Number.)

As Eric says in that post:

The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.

It looks to me like 2008 and 2009 are going to be big growth years for the Hadoop project--and not just at Yahoo!

Stay tuned...

Update: You can get a Quicktime version of this video now.

Posted by jzawodn at February 19, 2008 07:18 AM | edit

Reader Comments

# Jeff said:

The video is great. It's fascinating to hear about the evolution of Inktomi's webmap system: 20 nodes in perl, 1000 nodes on "Dreadnaught", and 2,000+ nodes on Hadoop.

Is there an independent link to a full-screen version of the video? I would like to share this with friends.

Thanks!

on February 19, 2008 11:05 AM

# Jeremy Zawodny said:

The downloadable QuickTime video will be up shortly...

on February 19, 2008 11:19 AM

# Sunson said:

I second Jeff. It would be very useful to have a downloadable version of the video (or use a more sensible video streaming player that can buffer the stream - not all of the world is pampered with fat pipes, you see... :) )

on February 20, 2008 04:23 AM

# Venkat said:

Are there any plans to use Hadoop for other Yahoo properties like Mail etc

on February 20, 2008 11:16 PM

# Rayed said:

Still waiting for the QuickTime video.
Thanks

on February 25, 2008 02:09 AM

# Jeremy Zawodny said:

Quicktime video is now available.

on February 25, 2008 06:25 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.