Sometimes I'm a little surprised by how long some ideas take to bubble up. Other times I'm surprised by the form they take.

I'm doubly surprised this time.

Google Sitemaps (BETA, of course) has me scratching my head a bit. Rather than build on existing work, it seems that Google wants people to build up and submit sitemaps to them so they can increase the freshness and coverage (or comprehensiveness) of their web search index.

Of course, those are two of the four critical variables for Getting Search Right. Around these halls we call them RCFP:

  • Relevancy
  • Freshness
  • Comprehensiveness (or coverage, which is a smaller word)
  • Presentation

So it's clear what the motivations here are. Nicely, they've decided to apply a Creative Commons License to the work. It's good to see more and more CC licenses out there, especially from the Big Players.

Last summer, I wrote something titled Feed Search vs. Web Search in which I talked about some of the differences between the Googles and Yahoos of the world and the Technoratis and Feedsters.

Under the heading of "Real-Time Pings", I wrote:

Many of these new-fangled content publishing systems (MovableType, WordPress, you name it) have the built-in ability to "ping" services like weblogs.com, Technorati, Feedster, My Yahoo, and so on. They do this to let those services know that something is new. The services typically react by fetching an updated copy of the feed within seconds and extracting the relevant info.

These real-time pings mean that we don't have to wait for a full polling or crawling cycle before getting the latest content. But the old school "web" search engines don't listen for these pings. Instead of seeing this post moments have I click the 'post' button, they're generally 6-36 hours behind.

But what if they did listen for pings? Or maybe offered a compatible ping API?

Emphasis is, of course, mine.

I wonder why they're not simply offering to extend the current weblog ping protocol a bit to work toward the goals of freshness and coverage? It seems to me that with an installed base of millions of ping-generating tools, that'd be a no-brainer. I'm surprised that Danny Sullivan didn't ask this either.

If I had my way, we'd be plugging ping server streams directly into our web crawlers.

See Also: So, You'd Like To Map Your Site by Anil Dash, with more "prior art" that wasn't used.

Posted by jzawodn at June 07, 2005 01:34 PM

Reader Comments
# jim winstead said:

dave winer has been lobbying for this for years.

but i think the reason to not 'just do it' is pretty clear: it will be an instant target for search engine spammers.

that's probably not an intractable problem, but i bet it goes a long way in explaining why it hasn't happened yet.

on June 7, 2005 01:46 PM
# Aaron Brazell said:

Hey JEremy--

Off topic but when are you going to talk about Yahoo! Photomail. I have a bunch of questions...

Aaron

on June 7, 2005 02:17 PM
# Jackson said:

I don't think that Google's SiteMap addresses the freshness aspect at all. We still have to wait for it to refresh the sitemap or a page with a link to a new page, though we can give it a suggection of how frequently to check those pages for updates. I think Google is trying to address something else.

But Yahoo! should lead the way here and listen to pings!!!

on June 7, 2005 02:32 PM
# Matt said:

How would you like to extend the weblog ping protocol to work better? I'm positioned to effect both the ping generators (through WordPress) and the distribution (through Ping-O-Matic), so I'd be curious to hear your thoughts. Search engine already have ways to deal with spammers, I imagine they could just return a different message when a ping is received (don't send me any more pings from this source) and that would solve the problem from the ping distributors side.

on June 7, 2005 02:50 PM
# josh said:

Matt Said: "I imagine they could just return a different message when a ping is received (don't send me any more pings from this source) and that would solve the problem from the ping distributors side."

Except what does a ping distributor do if they have updated content that they want to ping about but they're past their limit for the day? Pinging technorati isn't a live or die thing but search engine coverage is for a lot of businesses. You're putting google or yahoo in a position to say that any website can only update x times a day or we won't index them.

Now a sites breadth is a disadvantage.

on June 7, 2005 03:49 PM
# Mud's Tests said:

Yes,

I was confused by this as well. I'm still trying to make sense of the impacts, if any.

I plead, "clueless".

on June 7, 2005 04:15 PM
# Mud's Tests said:

The link to my specific comment/confusion is under my name in this posting.

on June 7, 2005 04:19 PM
# Ask Bjørn Hansen said:

I'm working on a CMS type tool that (with a small army of editors) will generate thousands of pages with (presumably) interesting content.

They'll be linked in various ways from millions of other content pages, but to make sure they get indexed we're going to make a bunch of pages listing all of them. Now for Google Search we can of course just build a sitemap, which is nice...

It's not about freshness, but about coverage. (And crawler optimizations I suppose; being able to say "don't bother with that page more than once a month").


- ask

on June 7, 2005 05:12 PM
# diego said:

One problem that I see with the ping approach is "page replacement" spam. The attacker creates a page meant to be seen only by the indexer, then pings and waits to see how it affects the rankings. Doesn't like how it works, modifies it and tries again. At least now they don't control how often the crawler stops by.

on June 7, 2005 05:19 PM
# Danny Howard said:

Uhmm ...

I'm curious why they invented a new standard, but it appears that as part of the project, they also support stuff like RSS:

http://www.google.com/webmasters/sitemaps/docs/en/faq.html#s8

I think the "ping" approach is kind of goofy but then I have a wife and other stuff that might be described as "a life" so ...

Thanks for the tip, Jeremy.

-danny

on June 7, 2005 07:49 PM
# Chris DiBona said:

I'm not really getting what is so bad about the path Google (my employer, natch) has chosen. Did we break a rule?

on June 7, 2005 10:46 PM
# Tim Converse said:

Sigh --- the same two objections as last time: 1) when pinging is the way to get noticed, the bad people will ping abusively, and 2) your proposal assumes that any delay in being represented in an index is due to the crawler not knowing what has changed. A hint about point #2 --- engines (like technorati) that respond to pings by including docs quickly also tend to be dog-slow at query time...

on June 8, 2005 12:31 AM
# Sencer said:

In Case of the prior art found by Anil: Why should they restrict themselves to obscure and old propositions that are not used? If you had the chance to build something just the way you like it, would you restrict yourself to something that can only be found via archive.org?

Now the blog-pings are a different thing. They are widely used, and I agree it would be nice if search engine start supporting it (When is yahoo starting?), but it seems pretty obvious that the XML-Format that Google proposed does (a lot more) more, than what you can do with Pings. And I have already seen people using the ping-feature of their weblog-software to ping google when they updated their sitemap - so it's not "incompatible". What exactly is the issue here?

on June 8, 2005 12:56 AM
# Martin said:

Isn't this SiteMap Technology a bit of surrendering? Admitting that Google needs the publisher's help to keep track of all the new content?


on June 8, 2005 02:59 AM
# Danny said:

Yup, pings could do a lot for (efficient) freshness, ideally implemented using a simple HTTP POST (at worst a GET) rather than unnecessarily complex XML-RPC.

One other aspect I haven't seen raised is how other services are meant to locate the sitemaps. I'd suggest using autodiscovery (link element in a homepage's head).

on June 8, 2005 03:31 AM
# Joseph Scott said:

After writing about my feed search complaints

http://joseph.randomnetworks.com/archives/2004/08/24/why-hasnt-anyone-figured-out-how-to-do-feed-searches/

I'd mentioned the idea of something like a mod_ping for Apache

http://joseph.randomnetworks.com/archives/2004/08/24/apache-module-idea-mod_ping/

Which would basically allow for any type of content to notify other services (search engines specifically) when it changed. Sure the concept is one that will obviously be abused, which is why more time and thought would have to go into it before anyone actually pursued the idea. In general this idea seems to be working pretty well for blogs/feeds. Getting updates seems to be even better since FeedMesh started:

http://www.intertwingly.net/blog/2004/09/11/FeedMesh

http://bobwyman.pubsub.com/main/2005/04/feedmesh_works_.html

http://feedmesh.cozy.org/index.php/Main_Page

So if search engines were able to get together to come up with something like this for a more general, hey this site has been updated type of ping and then share that information in a way similar to FeedMesh they I think we'd really have something. I'm not sure that SE companies would be willing to actually talk to each other about something that is so close to the core of what they do.

So Chris, I don't think Google did anything wrong (although some people will complain about the format, which might be a valid complaint), they just didn't go far enough. At this point I think you keep the sitemap idea and find ways to make pinging for sitemap updates work. Maybe even getting down to the point of a ping when just an individual page has been updated. The important thing at this point is just start talking about this more. There are a lot of potential problems with this and those need to be addressed before anyone even things of putting such a service out for public consumption. I'm tired of companies/software/people who put out products/services/software/technology who haven't even considered how it will be abused for evil, specifically spammers and kind.

I don't think it is a trivial problem to solve, so please spend plenty of time looking at new service/software/etc from the bad guys perspective. Heck, hire someone whose job it is to try and break your new toys or use them for evil (like spam). Sorry to go one about this, but potential abuse/evil simply must be talked about and addressed before it goes out to the public, not after.

on June 8, 2005 06:23 AM
# Greg Stein said:

re: Joseph Scott

Why are you assuming that Google hasn't already considered all of those factors? That maybe part of the reason around the Sitemap feature's avoiding specific discussions of latency, fetching, etc is because there are extra processes going on to deal with those issues?

Or maybe even more simply put: just because all of your concerns were not specifically addressed in the external documentation, why presume they haven't been addressed?

And in terms of "far enough" -- does every product release have to solve every problem on the first release? What if it can solve *some* problems *now*? Should it be witheld simply because it didn't go "far enough"?

[ disclaimer: I work for Google :-) ]

on June 8, 2005 03:23 PM
# Joseph Scott said:

Greg,

True, certainly seems possible that Google implemented Sitemap they way they did because after debates and discussions on the issues this is what seemed best. It is also certainly possible that there could be more to come. In all fairness I could have pointed that out in my comment, but then again I'm not the one who works at Google :-)

I'm not the only one wondering why some of these issues weren't at least mentioned in the FAQ. There is so much opportunity for abuse in many of the new technologies that come out that I'd like to see more discussion about what is being (and been) done to limit it as much as possible. It would be nice if it at least some of that discussion was done in public.

Kudos for your work on Subversion and WebDAV, thanks!

on June 9, 2005 08:22 AM
# powerpop said:

in all the discussion of sitemaps i dont see how an outside web service can query for a sites map - or retrieve an RSS of updated items - it would be great if i could query for updates from outside! - is there an API for this that i just missed?

on June 23, 2005 04:44 PM
# Wahyu Wijanarko said:

I think Google just want find the best way to get all links inside your website, beacuse if Google only follow hyperlinks, maybe not all pages can crawled.

on July 20, 2005 04:58 PM
Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

 

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.