Over on the Yahoo! Hadoop blog, you can read about how the webmap team in Yahoo! Search is using the Apache Hadoop distributed computing framework. They're using over 10,000 CPU cores to build the map and processing a ton of data to do so. They end up using over 5 petabytes of raw disk storage, eventually outputting over 300 terabytes of compressed data that's used to power every single search.
As part of that post, I got to interview Sameer and Arnab to learn more about the history of the webmap and why they moved from our proprietary infrastructure to using Hadoop.
One of the points I try to make during the interview is that this a huge milestone for Hadoop. Yahoo! is using Hadoop in a very large scale (and growing) production deployment. It's not just an experiment or research project. There's real money on the line. (It's too bad we had a technical glitch in the video right as we were discussing a Really Big Number.)
As Eric says in that post:
The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.
It looks to me like 2008 and 2009 are going to be big growth years for the Hadoop project--and not just at Yahoo!
Update: You can get a Quicktime version of this video now.
Posted by jzawodn at February 19, 2008 07:18 AM