The current state of "feed search" is messy at best. Joseph Scott does a good job of presenting his impressions on the major feed search engines (where "feed" means RSS/Atom):
Say I wanted to track what people are saying about PostgreSQL. This can’t really be done with the traditional search engines (Google, Yahoo, etc) because they base their results on popularity (in one form or another). This doesn’t help me because I’m interested in what people are saying right now, not who has said the most popular thing. So I started using the feed search sites to see how they stacked up. The results were extremely disappointing.
His frustration is quite clear and I also feel the pain. It's hard right now. And he's certainly right to say that "this really can't be done with traditional search engines" but his reasoning is a bit off. So rather than talk about how to fix the problem, I think it's worth looking at a few of the differences between these new "search engines" and the previous generation. In doing so, pieces of the solution may become obvious.
The problem isn't PageRank, WebRank, or whatever you want to call a relevancy function that looks at a bunch of edges and weights on an imaginary graph to determine "popularity". The problem is that today's leading search engines weren't designed to work in a world with millions of feeds.
Let's think about the differences between the web of the mid to late 90s (what Google and
Inktomi Yahoo were built to crawl, index, and search) and the web of 2004 which seems to be overflowing with sites (not just blogs) pumping out RSS and Atom feeds on a regular basis.
Structured vs. Unstructured Data
The old school search engine crawlers are surely complicated conglomerations of code. And a lot of it has to do with the fact that they're handling HTML. HTML provides little real structure to documents and is often written quite sloppily by folks who don't know any better. This makes it harder to figure out what's important and what's not in the document. What's the title? What's a reasonable summary? When was it posted? There's no universal way to distinguish navigation from content on a page, for example.
RSS and Atom provide a relatively fixed and predictable set of metadata in XML. This removes the need to handle a lot of the crappy HTML out there because the feed clearly says "this is the title... this is the author... this is the desription..." and so on. Sure, a lot of folks get XML wrong too, but that's not as hard to work around and the metadata generally comes out intact and meaningful.
In the old web, most content never changes. The only exceptions seem to be traditional news outlets (CNN, New York Times, and so on). That means search engines can crawl most sites infrequently and nobody really notices. Missing the last few days worth of stuff isn't that big a deal as long as you crawl those "news" sites regularly. Also, there's no way to find out which pages have changed without crawling the whole site, and that's quite expensive.
In the new world, feeds update frequently. Blogs start to look a like like "news" sites to search engine crawlers. But the updates are contained within the feed, so there's no need to crawl every link on the site looking for the new stuff. In other words, the cost of staying current on site changes is much lower when feeds are available.
Many of these new-fangled content publishing systems (MovableType, WordPress, you name it) have the built-in ability to "ping" services like weblogs.com, Technorati, Feedster, My Yahoo, and so on. They do this to let those services know that something is new. The services typically react by fetching an updated copy of the feed within seconds and extracting the relevant info.
These real-time pings mean that we don't have to wait for a full polling or crawling cycle before getting the latest content. But the old school "web" search engines don't listen for these pings. Instead of seeing this post moments have I click the 'post' button, they're generally 6-36 hours behind.
But what if they did listen for pings? Or maybe offered a compatible ping API?
The Real Kicker
Of course, this is all very temporary. Once this feed stuff hits the tipping point (I think we're close), things will get really, really interesting. Suddenly these feed sources will be the thing people care about. The model of "search and find" or "browse and read" will turn into "search, find, and subscribe" for a growing segment of Internet users and it will really change how they deal with information on the web.
What's that gonna be like? Will the "web search" folks be ready? What about the browser folks?
I don't know for sure, but I get the feeling that we're a lot closer than you might think.
Posted by jzawodn at August 27, 2004 12:44 AM
> But what if they did listen for pings? Or maybe offered a compatible ping API?
Then they will be spammed with pings ... It's still better to let the SE decide what is new, and not the webmaster. But I admit it should be the other way round.
Why can't popular web search engines build into their current tools another that searches feeds. It sounds like a feature, another tab at the top for Google or Yahoo! to take advantage of. It's not as if it isn't, like you said, worth investing in. And I don't see how it has to take on the form of separate entities.
Couldn't you slip the head of search (since you did help him make his blog, right?) a suggestion?
Another possibility is to use an online aggregation-type service to do the work. You don't necessarily need pings, as the services are polling regularly anyway. Bloglines, for instance, hits my site every hour or so. I'm sure My Yahoo! does the same, or something very very similar. I agree that pinging the site could cause a bit of overload.
I'm not familiar with the Y! search mechanism, but the one at Bloglines works (or is able to work) on a most-recently updated basis. Strangely, "feed search" doesn't include your post as a result (yet, anyway), but the results that are presented are in a most-recent format. I'm sure there's some tweaking to do - but I think it's on the path you describe.
Excellent blog entry. I maintain a blog / online book all about Microsoft. I aggregate a lot of RSS feeds into my Drupal-powered site. But I made the section on my main page for feeds only show Microsoft-related information. A little PHP and SQL searches through the table of entried imported via RSS for Microsoft keywords. So while my feed search for Microsoft isn't real-time it's perfectly satisfactory for my needs. And since it's a custom job I can search for many words/phrases at once ("Microsoft", "Bill Gates", "Internet Explorer", etc.). This is one specialized case where we're able to keep a customized feed search continually up-to-date on a web page. It certainly doesn't solve your ultimate goal of quality google-style feed searches, but it's a very useful solution.
I did an investigation into a blog search tool a few years ago.
The premise was that until the semantic web arrives the best method we have to understand a users point of view is to examine the RSS feeds they subscribe to. If Googling my weblog is like searching by backup brain, then searching all sites in my RSS news aggregator is like searching the brains of people I respect and find interesting.
Something along this line of thinking came up while I was writing up my thoughts of feed searches. My next entry was about an Apache module idea, mod_ping, which would be able to let search engines (or anyone else for that matter) know when "static" pages have been updated.
We guessed that there's about 15 million (really) structured data feeds now in xml, rdf, RSS, FOAF, Atom et al. So what should the search engines do with them? Displaying the raw data in abstracts and cache isn't good enough. If they can translate pdf and word, they should be able to translate this. But that's just display. As you point out it's also all structured which means there should be interesting stuff to be done from a search perspective.
Then there's the question of timeliness. You would think that one of the first things Google would do with Blogger would be to pipe all the updates into the search engine. But I don't think they're doing that yet. Then you'd expect them to find a way of opening this up to everyone else whether by providing a ping API or by scraping one of the ping aggregators. As someone said there are major spam implications of this. But it's a very similar problem to comments spam and we all have to deal with this, not just the search engines.
But then the Google API has remained static for ages now. Maybe they can start ploughing all that IPO money back into R&D and build some new APIs while finishing all the bits that are still in beta. Or maybe it would be easier to just buy Technorati, Blogdex, Daypop etc etc.
HTML is messy, no doubt. However, RSS is no cure-all. If we change "HTML" to "RSS" in this quote from your post, it's pretty correct
"RSS provides little real structure to documents and is often written quite sloppily by folks who don't know any better"
Here are a few different "interpretations" of the spec (ah, which one might that be? 0.9x, 2.0, ATOM??)
(URLs removed to protect the guilty)
Here's one with nothing in <title></title>.
How to index/search on it? Also, it is full of HTML markup
<description><![CDATA[<p><B>The health-care war</b><br><br />
: I am no fan of Paul Krugman's; rarely make it all the way through a column. But today's is a <a href="http://www.nytimes.com/2004/08/27/opinion/27krugman.html?hp">winner</a>, for it rationally sets out the current choice in what I believe is one of the big two issues facing us the election (after the war on terrorism). <blockquote>In other words, rising health care costs aren't just causing a rapid rise in the ranks of the uninsured (confirmed by yesterday's Census Bureau report); they're also, because of their link to employment, a major reason why this economic recovery has generated fewer jobs than any previous economic expansion.</p>
Here's another with no title and full of markup
<a href="http://images.scripting.com/archiveScriptingCom/2004/08/01/sudan.gif"><img src="http://images.scripting.com/archiveScriptingCom/2004/08/01/africa.gif" width="65" height="71" border="0" align="right" hspace="15" vspace="5" alt="A picture named africa.gif"></a>Spoke with Jim Moore this afternoon, he says the <a href="http://platform.blogs.com/passionofthepresent/2004/07/holocaust_museu.html">genocide</a> in <a href="http://images.scripting.com/archiveScriptingCom/2004/08/01/sudan.gif">Sudan</a> has reached the boil-over point. Nigeria and France are preparing to enter, and the US has committed funds to support an intervention. The Holocaust Museum, for the first time, has said this is a genocide emergency. I've been reading the blogs, <a href="http://platform.blogs.com/passionofthepresent/">Passion of the Present</a> and Jim's Berkman <a href="http://blogs.law.harvard.edu/jim/">weblog</a>.
<pubDate>Sun, 01 Aug 2004 19:30:53 GMT</pubDate>
Yah, RSS forces more well-formedness of metadata than HTML, but lack of well-formedness is not the main reason search engines don't pay a lot of attention to HTML metatags, for example. Those tags _could_ be used to accurately describe content, but in practice they're not.
Maybe it's because I work on spam, but when I hear this proposal I just imagine getting a ping per minute from sites with deceptive titles...
Sure, META tags can lie. And there's nothing to stop RSS feeds from lying to the Yahoo and Google crawlers today.
The HTML vs. RSS point is more about simplicty and structure than trust. Just because someone says something is the title doesn't mean it is.
As for the pings, yeah. They're another way people can spam or abuse. But that seems to be the case with every new thing on-line these days. Witness the rampant blog comment spam. It barely existed a year or so ago. But that doesn't mean we're all gonna turn off our blogs and go home, either. :-)
i think you really got the points that make feed search different from normal web searches. From my experience on writing www.plazoo.com, i see lots of these points having been questions in development. One thing that you did not mention is the underlying database. Because of the frequent updates, feed search needs a special database that you can read/write into concurrently without write locks. Thats because otherwise updates would lock readers(=searches) and slow down the search engine. Typical Web search engines dont have this problem as it does not matter if updates are done more infrequently. There is no need of being uptodate to the second.
In my point of view that is the mayor point when creating a search engine for blogs and feeds. Finding feeds or polling them in a timely manner is not that much of a problem, although you're right its different from web search engines.