The current state of "feed search" is messy at best. Joseph Scott does a good job of presenting his impressions on the major feed search engines (where "feed" means RSS/Atom):
Say I wanted to track what people are saying about PostgreSQL. This can’t really be done with the traditional search engines (Google, Yahoo, etc) because they base their results on popularity (in one form or another). This doesn’t help me because I’m interested in what people are saying right now, not who has said the most popular thing. So I started using the feed search sites to see how they stacked up. The results were extremely disappointing.
His frustration is quite clear and I also feel the pain. It's hard right now. And he's certainly right to say that "this really can't be done with traditional search engines" but his reasoning is a bit off. So rather than talk about how to fix the problem, I think it's worth looking at a few of the differences between these new "search engines" and the previous generation. In doing so, pieces of the solution may become obvious.
The problem isn't PageRank, WebRank, or whatever you want to call a relevancy function that looks at a bunch of edges and weights on an imaginary graph to determine "popularity". The problem is that today's leading search engines weren't designed to work in a world with millions of feeds.
Let's think about the differences between the web of the mid to late 90s (what Google and
Inktomi Yahoo were built to crawl, index, and search) and the web of 2004 which seems to be overflowing with sites (not just blogs) pumping out RSS and Atom feeds on a regular basis.
Structured vs. Unstructured Data
The old school search engine crawlers are surely complicated conglomerations of code. And a lot of it has to do with the fact that they're handling HTML. HTML provides little real structure to documents and is often written quite sloppily by folks who don't know any better. This makes it harder to figure out what's important and what's not in the document. What's the title? What's a reasonable summary? When was it posted? There's no universal way to distinguish navigation from content on a page, for example.
RSS and Atom provide a relatively fixed and predictable set of metadata in XML. This removes the need to handle a lot of the crappy HTML out there because the feed clearly says "this is the title... this is the author... this is the desription..." and so on. Sure, a lot of folks get XML wrong too, but that's not as hard to work around and the metadata generally comes out intact and meaningful.
In the old web, most content never changes. The only exceptions seem to be traditional news outlets (CNN, New York Times, and so on). That means search engines can crawl most sites infrequently and nobody really notices. Missing the last few days worth of stuff isn't that big a deal as long as you crawl those "news" sites regularly. Also, there's no way to find out which pages have changed without crawling the whole site, and that's quite expensive.
In the new world, feeds update frequently. Blogs start to look a like like "news" sites to search engine crawlers. But the updates are contained within the feed, so there's no need to crawl every link on the site looking for the new stuff. In other words, the cost of staying current on site changes is much lower when feeds are available.
Many of these new-fangled content publishing systems (MovableType, WordPress, you name it) have the built-in ability to "ping" services like weblogs.com, Technorati, Feedster, My Yahoo, and so on. They do this to let those services know that something is new. The services typically react by fetching an updated copy of the feed within seconds and extracting the relevant info.
These real-time pings mean that we don't have to wait for a full polling or crawling cycle before getting the latest content. But the old school "web" search engines don't listen for these pings. Instead of seeing this post moments have I click the 'post' button, they're generally 6-36 hours behind.
But what if they did listen for pings? Or maybe offered a compatible ping API?
The Real Kicker
Of course, this is all very temporary. Once this feed stuff hits the tipping point (I think we're close), things will get really, really interesting. Suddenly these feed sources will be the thing people care about. The model of "search and find" or "browse and read" will turn into "search, find, and subscribe" for a growing segment of Internet users and it will really change how they deal with information on the web.
What's that gonna be like? Will the "web search" folks be ready? What about the browser folks?
I don't know for sure, but I get the feeling that we're a lot closer than you might think.
Posted by jzawodn at August 27, 2004 12:44 AM