Over on the Yahoo! Hadoop blog, you can read about how the webmap team in Yahoo! Search is using the Apache Hadoop distributed computing framework. They're using over 10,000 CPU cores to build the map and processing a ton of data to do so. They end up using over 5 petabytes of raw disk storage, eventually outputting over 300 terabytes of compressed data that's used to power every single search.
As part of that post, I got to interview Sameer and Arnab to learn more about the history of the webmap and why they moved from our proprietary infrastructure to using Hadoop.
One of the points I try to make during the interview is that this a huge milestone for Hadoop. Yahoo! is using Hadoop in a very large scale (and growing) production deployment. It's not just an experiment or research project. There's real money on the line. (It's too bad we had a technical glitch in the video right as we were discussing a Really Big Number.)
As Eric says in that post:
The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.
It looks to me like 2008 and 2009 are going to be big growth years for the Hadoop project--and not just at Yahoo!
Stay tuned...
Update: You can get a Quicktime version of this video now.
Posted by jzawodn at February 19, 2008 07:18 AM
The video is great. It's fascinating to hear about the evolution of Inktomi's webmap system: 20 nodes in perl, 1000 nodes on "Dreadnaught", and 2,000+ nodes on Hadoop.
Is there an independent link to a full-screen version of the video? I would like to share this with friends.
Thanks!
The downloadable QuickTime video will be up shortly...
I second Jeff. It would be very useful to have a downloadable version of the video (or use a more sensible video streaming player that can buffer the stream - not all of the world is pampered with fat pipes, you see... :) )
Are there any plans to use Hadoop for other Yahoo properties like Mail etc