I've been trying to stay out of the size debate for the last few days while I digest what others have been saying. Now that I've done that, I get to react to a few of the things I've been reading.
A Rhetorical Question
First off all, many people are suddenly crying "size doesn't matter!" and that doesn't smell right. If size really doesn't matter, then why didn't anyone jump on Google for having that counter ("Searching 8,168,684,336 web pages") on their home page for so long? They have one of the most sparse home pages around but seem to believe it's important enough to waste a few bytes with that number.
We all know that number is bullshit anyway, right? When's the last time it changed? Oh, right. When MSN Search declared a larger number. Coincidence, I'm sure.
It seems odd to me that size became irrelevant right about the time that Yahoo! comes out witch a much larger number. It's almost as if some Google fans are in denial. There's got to be some reason that our number has evoked such emotional responses.
But, hey... that's just me.
All that aside, how can you argue that size doesn't matter? If Google indexed only 100,000 documents, would it be nearly as useful as it is today? Of course not. Without indexing a reasonable amount of the Web, they'd be missing important stuff.
Relevancy
Danny Sullivan says "Screw Size!" and he's right. Having a big f'ing index doesn't help if you can't figure out how to return relevant documents. He'd rather we compare relevancy.
I couldn't agree more. Relevancy is what matters and it's a simple test you can do yourself. Try your search on both sites and see which one provides the better results.
Or you can use RustySearch and rate results while you're at it. The RustyBrick Search Engine Relevancy Challenge aims to quantify which service produces better results. If you look at the results, you'll see that Yahoo is quite competitive. Last I looked, we were ahead of Google by a small margin.
I know, I know. It's not a perfect measure. There are flaws in the system. The audience is wrong. The sample not large enough. Etc.
There are a lot of holes you can poke there. But that doesn't mean it's not useful.
Speaking of holes...
The NCSA "Study"
I knew it was going to be one of those days when the NCSA results got linked on Slashdot. As I expected, the slashdot herd jumped for joy at the chance to prove Yahoo wrong and hold Google up as the reigning champ of web search and all things non-evil
But I didn't see anyone look very closely at the methodology or results, which are all public (as is the source code). As Seth noted, "The methodology is severely flawed, with a sampling-error bias." In fact, there are so many poor assumptions behind it that I had to laugh when I read about it. It's really more of a clever hack than a scientific comparison. I see little evidence that anyone looked at the actual results.
Using randomly chosen words doesn't reflect the real world at all. But even if you suspend logic for a while and look at some of the cases when Google "beat" Yahoo, it gets more interesting. The "extra" results on Google are dominated by pages that are simple large word lists.
Seth listed one that illustrates the problem quite well. Search for "alkaloid's observance" on Google and on Yahoo. Guess what. On Yahoo you find no results but Google shows several. Dig a bit deeper and you see that the pages Google found are garbage. This page (the #1 result) no longer contains the target phrase. So you check the cached copy and notice that it's just a bunch of gibberish words. (Hmm. A freshness problem and a quality problem?)
You know, we index those too. But we filter 'em out because they're pretty useless. I'm not sure why Google thinks those are good pages to include, but hey--it boosts the numbers! Our algorithms manage to suppress such pages and I doubt anyone misses them.
Believe it or not, coming up with a really good relevancy comparison is quite hard. And it's even harder to get right when you take the humans out of the loop.
The Bottom Line
So what's the point?
Index size matters, but it's not all that matters. Big index is a necessary but not sufficient condition for getting search right. Good algorithms for finding relevant documents do the heavy lifting required to find the right matches for each query.
We've got some of those too... :-)
Kaigene said it best back when we hit the 1 billion mark in images: "Yes, size does matter. But only if you know how to use it. ;-)"
May the best engine win!
Update: Someone just poitned out that Yahoo! Search now returns a few results for that query--both from Steh's blog. I guess this is more of a Heisenberg problem than I first thought!
Update #2: If you read French, this post might be interesting too. You can also translate it with Babelfish. It's a pretty good analysis of the problems with the NCSA test.
Update #3: Also, Gary Price listed several things to consider when trying to measure index sizes.
Posted by jzawodn at August 17, 2005 07:40 AM
Dear Jeremy
I am now a daya praying for you not to have a vacation.. U know why is that i am doing like that,, I had been waiting for ur comment to hear from this.. There are a lot of google fanatics around,,, No dought its a great company when things done by better companies they should agree.. Considering just the blogs size they can have a much large database.. But this is not any once problem that they are not updating it.. The arguement is on the size yahoo announced . Google should have a much higher number but whose problem is with updating
Good one
Jeremy, I don't think any rational person would not agree that having the most pages to choose from, when creating relevant search results, is better. I think what you are missing in your response is that the "rational people" who are saying, "who cares" are responding to Yahoo! making the claim about size in the first place. Who cares if Google has pasted a number on their home page forever. He hit me so I hit him back? Hopefully, "because Google is doing it" is not the mindset driving all of the development at Yahoo!. How about making a real claim like, "Yahoo! Search returns more relevant results (8 out of 10 searchers agree)?" Getting caught up in a silly argument about whose index is bigger, is well...silly.
BTW, I am a big Y! Search supporter.
Judging from a solitary narcissistic sample, Yahoo! is clearly more useful than Google.
So much for narcissism. I didn't know the form ate anchor tags. Here we go again ...
Yahoo: http://search.yahoo.com/search?p=link:http://plan8.blogspot.com/
Google:
a href="http://www.google.com/search?q=link:http://plan8.blogspot.com/
I did a very brief analysis of the results. They really are junk. I found that the most common difference between Google and Yahoo! were 3 results. So I dumped every two word query that had three Google results and zero Yahoo! results, it amounted to 755 queries (out of their 10,000 sample, this is fairly significant). The funny part, the Google results for almost all, are links to the same three dictionaries that Yahoo! filtered out. I didn't bother looking at the 4 to 0 queries. Its more of the same.
I've always size didn't matter, only the quality. Still, its fun to whip out your search indexes and compare them (EWW!). And lets be honest, a post that says "Google's results contain a lot of useless, missing or irrelevant spam pages" is just old news. :-)
I'm actually afraid that, now that it has a bigger index, Yahoo's results will get cluttered up as well. So far, it looks to be working out for you guys, but don't let it slack for one second. Google is going to tackle their problems one of these years.
What I think is interesting is that the search for "alkaloid's observance" now returns 7 relevant results and Google still only returns those stupid spam pages. I for one switched a long time ago in order to bring balance to the force:
http://www.javarants.com/B1823453972/C1460871803/E1947612627/
Precision vs. Recall
Day One of Library/Info Science Education:
First, Jeremy thanks for including a link to my post from last week about why measuring and then comparing web database size totals is so difficult. A few more comments.
A larger database increases recall (more hits, true) but lowers precision.
Given what we know about web searching behavior:
+ Limited amount of search terms entered, often difficult to understand user intent. Does the searcher even know what they're looking for?
What does increasing size mean for the typical searcher?
+ Minimal use of advanced search tools and techniques that could be used to create a more precise query
+ Users looking at the first five results and saying it's "good enough."
Like I said last week, the deep web is everything beyond results number eight or nine. If a useful and relevant result appears at #22, will the typical searcher see it?
Equally important is just what that total size number consists of. Quality and cleanliness should count. It's not difficult to find more web pages. Yes, quality is difficult to measure but I think we would all agree that there is a rapidly growing amount of junk/spam out there. Page #7,435,345,004 is in the index but does the page serve any value to the searcher? Is it worth counting in the first place.
One could argue that a searcher using a database that's smaller (and more easily controlled) than a large web database but targeted to a certain type of info or info need would naturally lower recall, increase precision, and produce a more relevant results set. While most of the large search engines also offer specialized databases (Yahoo News, Froogle, etc.) but from what I heard last week, few people use them when compared to the general web databases.
One of the first commercial online database services out there was and still is, Dialog
http://www.dialog.com
They offer thousands of databases. Yes, it's possible to seach hundreds of them simultaneously
but the power searcher knows what's available and picks the best databases to match their info need.
Means that
Hmm. “Best engine” and “win?” I don’t think either of those things exist. :-)
Yes, I believe that size does matter too. When I search I want to make sure that I don't miss any results and also I don't want to see spam site as the first results. The second is where yahoo is better for me, when I search my own name, google shows a parked spammer domain as the top most result, which is shameful.
But honestly one of the most important things that keeps me from using Y! Search is the sponsored results put in the worst place, where the real results should be. Even for me with a high res screen the results are pushed down to the 1/3rd of the screen which I don't like.
I'm using both engines.
I went back to Yahoo Next Today and tried again Yahoo Mindset. I really see something powerful in it, which might actually combine both number and relevance with this technology.
I hope to see this project carried on
Well, there is only way to compare:
A double blind-test!
We need some bloggers from Google and some bloggers from Yahoo, and a script that just takes the search results and returns them in a very plain from (only url - no snippet). Then we give those people a few dozen words or phrases to search for and they rate the results. It should be easy enough to set up, but do the involved parties have the guts (and time, I guess) to do it?
Sure, it won't be very scientific, I mean how do you get a statistically significant sample for 10 Billion Pages or x million searches per day - but it will be loads of fun. :)
Ah, that's what you get for getting side tracked while writing a comment. Of course what I described is just a blind taste test, not a double blind test.
And of course there are already some people offering stuff like that on the web - given that it's a lot of SEO-related pages I am not sure wether it's appropiate to link to it from here. But searching for "search engine blind taste test" should find you some - if you are using the right search engine (ok, sorry, cheap joke - couldn't resist).
Size matters differently to different people, but for most people the numbers at play are meaningless. Most searches are casual, poorly keyworded, fast hit-and-run affairs that don't dig deeper than the first ten result for one or two queries.
However, the size debate is really about the long tail. Google developed its reputation based on serving each query equally, regardless of its obscurity, and delivering useful relevance to most users. It's not too much to say that long-tail service is the kernel of search's rebirth. Any long-tailed database needs to be large, but I imagine there are diminishing returns; fewer and fewer previously unserved people become satisfied as the tail extends beyond a certain length.
Is Google's index at a size where enlarging it improves the experience for a significant number of people? What number would be significant? Does it matter, or does the ideal search service extend the index infinitely to serve one previously unserved query? How fine are the incremental improvements to relevance, and what are the economics of such improvements?
I don't know the answers. And while I think the questions are pretty good to theorists, developers, power users, and industry observers, they have little pertinence to the search experience of most users.
Thanks for linking to one of my posts--appreciate it.
If it had been Google the one that announced they doubled their index, we'd have tons of people - many of whom who are now critizising Yahoo - bowing to the search company and saying they're the best, how much better than Yahoo they are now that they even doubled their index and bla bla, and it'd be "great news".
Think about it.
Thanks for citing my study.
An English translation is now available at:
http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-2.html
--jv
I decided to test some queries after reading Jeremy's article.
Not only Yahoo's index is bigger, IMO, it will take Google quite some time to catch up.
Here are some random searches and number of results found.
[Shoping] Y 1,540,000,000 G 318,000,000
[hosting] Yahoo 631,000,000 Google 123,000,000
[network] Yahoo 1,130,000,000 Google 589,000,000
[social] Yahoo 673,000,000 G 307,000,000
[Wedding] Y 251,000,000 G 42,700,000
[antidisestablishmentarianism] Y 80,000 G 36,100
["do no evil"] Y 275,000 G 82,400
[yahoo] Y 869,000,000 G 177,000,000
[google] Y 480,000,000 G 228,000,000
[microsoft] Y 562,000,000 G 258,000,000
[Spam] Y 210,000,000 G 94,500,000
[USPTO] Y 4,300,000 G 1,380,000
["larry page"] Y 735,000 G 175,000
["will smith"] Y 11,500,000 G 1,870,000
[NGCSU] Y 345,000 G 42,200
[UCLA] Y 30,100,000 G 21,900,000
[Stanford] Y 61,300,000 G 73,800,000
["google nightmare"] Y 1,330 G 250
["yahoo nightmare"] Y 132,000 G 6,710
["jeremy zawodny"] Y 1,760,000 G 699,000
[eugooglizer] Y 74 G 40
["george w. bush"] Y 50,200,000 G 18,000,000
["video games"] Y 248,000,000 G
[to be or not to be] Y 4,840,000,000 G 1,370,000,000
[movies] Y 957,000,000 G 152,000,000
[linkshare] Y 1,640,000 G 989,000
[eminem] Y 46,900,000 G 4,880,000
[Toyota] Y 101,000,000 G 12,900,000
[Honda] Y 135,000,000 G 14,400,000
[nissan] Y 80,200,000 G 9,340,000
[BMW] Y 120,000,000 G 13,400,000
[sims] Y 64,500,000 G 9,190,000
["average joe"] Y 3,150,000 G 722,000
[NASDAQ] Y 103,000,000 G 21,800,000
["dow jones"] Y 36,200,000 G 7,770,000
[zaxbys] Y 15,700 G 3,940
[Clinton] Y 113,000,000 G 29,400,000
Google had more results only for one term [Stanford] out of the above terms.
Some will obviously continue to argue in favor of Google but it's actually quite clear who is bigger.
Hi,
I don't think that index size matters at all if your search method is useless. It's all good and well having a huge amount of clothes, but you never wear half of them because you forget you even have them or can't get to them.
I didn't see a search method change.
It is interesting to note that NCSA has published a strong disclaimer on the Google/Yahoo study:
http://vburton.ncsa.uiuc.edu/indexsize.html
I've written a follow up here:
http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-3.html