More RSS Hacking (by Jeremy Zawodny)

I've finally gotten back to the RSS autodiscovery work that I mentioned a few weeks ago.

Since then, I've scrapped all my code and started over. I'm not relying on third party code to parse RSS, HTML, or XML anymore. I just began coding up support for the most common cases and things have taken off. The code can reliably find the RSS feed for nearly every blog on my blogroll.

Very cool. It's not quite the hell I thought it'd be. And it took far less code that expected. I'm not done by any means, but it's a good start.

There are a few notable exceptions, of course. Blogs that don't support autodiscovery and don't point to any obvious looking files. And Slashdot. I have no idea how this happened, but they missed the "http:" portion of the URL! Seriously. Their HTML says:

<LINK REL="alternate" TITLE="Slashdot RSS" HREF="//slashdot.org/index.rss" TYPE="application/rss+xml">

Anyway... Other than a few anomalies it's not bad. Tomorrow I'll try much harder to find odd cases for it to cope with. I'd like to see my test suite go from 15 sites to about 50 or 80 representative URLs.

It's fun to code once in a while. :-)

Posted by jzawodn at September 09, 2003 10:19 PM | edit

Reader Comments

# Alden Bates said:

//slashdot.org/ works in IE - guess it assumes http: protocol by default... I wonder how many browsers it'd break though. ;)

on September 10, 2003 02:31 AM

# paul said:

Actually, this is part of relative URIs as defined in RFC 2396.

"A relative reference beginning with two slash characters is termed a network-path reference, as defined by in Section 3. Such references are rarely used."

This is so "[...] it is possible for a single set of hypertext documents to be simultaneously accessible and traversable via each of the "file", "http", and "ftp" schemes if the documents refer to each other using relative URI [...]"

on September 10, 2003 03:56 AM

# Peter said:

Boy, did you hit the nail on the head!

I don't know *how* many times I've started off with another's library and ended up scrapping it because it was poorly written, or bloated beyond belief (notable exception is curl--curl rocks).

Usually, with good programming practices and a bit of experience, a task is much simpler and elegant than originally feared.

on September 10, 2003 05:22 AM

# Mike Hillyer said:

Aah yes, I too have cursed Slashdot's RSS autodiscovery.

on September 10, 2003 07:14 AM

# Hein Roehrig said:

Stupid question regarding Slashdot -- are there actually any RSS feeds except for /index.rss ? E.g. for people's ./ journals that would be nice but I failed to find any.

Sorry if this is FAQ or OT.

on September 10, 2003 01:24 PM

# Jeremy C. Wright said:

Hey Jeremy, I assume MT has autodiscovery on by default, right? I can't see it in the settings and would hate to be one of the blogs pissing you off ;)

on September 11, 2003 07:28 AM

# Jeremy Zawodny said:

Yup. It's built-in to the default templates in the recent versions. Yours is all set. :-)

on September 11, 2003 09:19 AM

# Jeremy C. Wright said:

Fantastic, thanks mate :)

on September 11, 2003 11:16 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.