Okay, I'm in Portland for OSCON and catching up on a very big backlog of stuff.
Mark Fletcher, in RSS Scaling Issues, brings up an issue that I've been worrying about (but not talking about) for quite some time now.
Centralized services like Bloglines avoid this problem because we only fetch a feed once regardless of how many subscribers we have to it. Desktop aggregators can't do that, of course, and end up generating huge amounts of traffic to sites like Infoworld. There are various things that a desktop aggregator can do to mitigate the load, like using the HTTP last-modified header and supporting gzip compression. But the aggregator still has to query the server, so there will always be a load issue.
Because Bloglines has a vested interest in increasing RSS (in the generic sense) adoption, we're looking at ways we can help. We are working on a couple of projects right now, and we're of course open to suggestions.
I've had very brief discussions about this with a few folks at work. Since we're clearly interested in seeing RSS grow, I wonder if we can't help somehow. And I wonder if we shouldn't be talking with Mark and others (who?) about this.
Thoughts and ideas out there?
Update: RSS Scaling BoF Session at OSCON, Thursday at 8pm.
Posted by jzawodn at July 26, 2004 07:39 PM
Squid?
Set up a network of squid proxies with geographic DNS entries. Squid passes data between caches extremely well.
Downsides: fair amount of organization required, no clear economic upside for those who require it to take part.
Create a bloglines-like server with a web services interface for clients. RSS reader clients can then be modified to poll this rss "router" and only download new or updated items, as opposed to directly downloading the whole feed every time something changes. Think Pub/Sub for RSS. Eventually content producers may even publish their material directly rather than having this service scrape it hourly.
What about having desktop aggregators use PUSH technology?? Pubsub.com is already doing this using jabber/xmpp. I would love to see bloglines integrate jabber notifications into their service so I don't have to pull/poll that website either. PS, I'm at OSCON too if you'd like to chat about this more.
Distribute the load.
Something could be built into RSS to give it a sort of BitTorrent like functionality. If you log all request for the RSS feed in question, you could send new request to the previous sites that requested the feed. Use a simple algorithm to feed new comers while keeping bandwidth low.
So if 500 people request my RSS feed, I should only have to respond to the request ... say ... 10% of the time.
Of course, this will bring up authenticity concerns. Something else could be built in to make sure the feed coming from a credible source.
OT:
Jeremy's book has been reviewed - mentioned on rootprompt.org !!!
http://www.unixreview.com/documents/s=8989/ur0407j/
yes yes - i know. it's OT. Slap me across the face with an oversized wet kipper, but i thought i'd like to spread the news considering how many people read this blog.
I've been gathering some links on the subject.
FYI: http://ciberia.blogspot.com/2004/07/reference-scalability-of-syndication.html
The FeedBurner model seems like a practical solution that Yahoo could do as well - this is a case where more is better. Basically Feedburner subscribes to your feed and then people subscribe to the feed from Feedburner - so if a bazillion people subscribe it's Feedburners problem. Obviously, they need some incentive to keep doing this so I suspect they are going to start putting ads in the feeds at some point although I could imagine the producer paying for the service (sort of slashdot insurance). It also means that they can implement gzip and If-Not-Modified-Since even if the underlying feed doesn't support it.
What we need is a new version of USENET that updates quickly and efficiently. This is exactly the type of problem that USENET solved.
There are two problems:
1. It doesn't update fast enough
2. ISPs end up paying for it
I don't know what the deal is with my trackbacks not working, probably my wordpress install I suppose.
At any rate, my thought was to use FreeCache (or something like it) to distribute the load. Here's my original post in response to Jeremy's question:
http://joseph.randomnetworks.com/archives/2004/07/26/rss-scaling-problems/
Does this need anything more complicated than a DNS-like Last-Modified system? A web service which takes a list of URLs and gives you back a tuple for each. URLs which it has Last-Modified cached can be answered from cache; those that aren't cached are forwarded to the upstream service provider (or to the "root" web server) and the answer cached. TTL could be configured by the cache or derived from any polling indication in the feed itself.
I think the main problem is, the bandwidth has to come from somewhere.
A popular blog author can get some relief from something like bloglines or yahoo, but that just means that bloglines or yahoo has to pay for the bandwidth instead of the author, just shifting the problem.
So how does the load get distributed evenly? You have to put the burden on the reader, unfortunately that generally means subscription fees, when what might be a better solution is a micro-payment solution where you automatically get charged for bandwidth you use from a site's upstream provider. So for reading this post I used 16,770 bytes of Jeremy's bandwidth, so his hosting service should charge me $0.01677, on top of that Jeremy should me for his exclusive content.
Maybe this analogy is flawed but, say that websites in general are like restaurants on the side of a road. Viewing a page is like getting in your car and driving over there, and getting a hamburger. Your computer is your car, you had to buy it. The bandwidth is the road upkeep and gas you have to buy, only online you get the road upkeep and burger for free, why?
What is RSS (or atom for that matter) trying to solve?
If it's role is to indicate that new content is available, then it should be pared down to the essentials, providing just enough to allow a prospective reader with enticement to read more.
If it's role is to replace HTTP (stupid if you ask me), then you're not losing bandwidth, merely duplicating it.
If it's role lies in some nebulous in-between state, then you should pay a price accordingly.
I consider syndication mechanisms to be more "biff" than "lynx", so I have no problem with getting reduced snippets. Heck, i'd be happy if the feed included only the content for the current day (or the last five articles for folks who don't post quite that often). When I want to read the content in full, then I use my browser.
I also see no problem with the slashdot model of denying "abusive" scanners. It's the electronic equivalent of your mom telling you to go outside and play more often. That or if you're that desperate to read the absolute latest crap someone's seen, pay your share of their bandwidth bill.
You can use squid in httpd accelerator mode (reverse proxy). Basically you need 1 squid per httpd which runs in front of it on port 80. No aggregator has to change anything, it's just a different server setup.
It cannot save any bandwidth, but it can reduce httpd load significantly.
An even simpler solution was to randomize the attack a little bit, instead of updating every x:00 hour just update with some minutes/seconds variation, as x:01 etc.
Or you could build in some slot allocation, some sort of http header extension. Every RSS downloader would have to allocate a slot first from the server, it would be given a number of seconds to wait, a unique id (unique only for 1h or so) and an ip to contact for the actual RSS.
The actual RSS feed would only be granted with the id check all right.
My trackback from WordPress doesn't appear to be working either, but here's one area of potential interest:
Lets look at RSS another way - basically it's a kind of "push" technology, except , it's really us the clients who are pulling it.
This bandwidth + push issue has been long solved by the likes of Tibco , with their core multicast Rendezvous technology.
Never heard of them?
Folks like, errr, Nasdaq , use it to push market data around instantly to traders screens.
(and no - i dont work for them...)
I think everybody is forgetting that bandwidth costs are dropping. I pay less every year for 2-3 times more required bandwidth. Add all the bandwidth saving mechanism being suggested and maybe polling becomes an option, as is. And think about the share price of Nortel? :)
A squid in reverse proxy setup is one excellent way of helping the sites. DIY instructions for major blog engines would be good. A couple of recent front page mentions from Yahoo Japan showed a very sharp spike in requests per second for Wikipedia.... and the web servers showed no immediate spike at all.
Don't do mass updates of source sites at neat times. Do it piece by piece so everyoe will want something different and may choose to spread the load based on announced different times for each piece. If you must do them all at once, do them at different odd times in each time zone or country IP range (if your servers are already geographically aware) so that smart clients will learn different update times.
Clients randomly decide which second a client is allocated and hit starting at that second, not the exact minute.
Have feeds allocate a client a time slot and let them authenticate to get an update earlier if they follow the instruction and ask for the magic feed URL within the time slot encoded in the URL. And don't make those URLs unique - you want them to be cachable by your squids. This rewards the polite clients.
Set up a cached feed protocol, designed so that intercepting proxies at ISPs can cache sensibly. Then preach the word to ISPs so they can save their own money by setting up a feed cache. European ISPs may be particularly good listeners, since US/Asia bandwidth is more costly than European. Remember to get someone other than Yahoo to tell AOL and Microsoft.:) If Yahoo does it for its own feeds, offer to maintain these and let them work for all feeds, not just Yahoo feeds. Even less work for the ISPs. Don't screw it up by using any available data for Yahoo commercial purposes. Or do and make the popularity data freely available to anyone, in the public domain. If this would save Yahoo enough money, give away preconfigured boxes to the top ISPs in the n markets which cause the biggest load surges.
I see two load problems to address: (1) awareness that there is new content, and (2) downloading the new content.
With (1), uber-aggregators could push notifications to clients, or provide a feed which lists the subscribed feeds which have changed. A client would then only fetch the feeds which have actually changed, and not poll those feeds which have not.
With (2), if a feed changes by just one entry then the whole feed gets retrieved again (sans any delta mechanisms). The Atom community has considered splitting a feed into a list of entries, and a separate retrievable file for each entry. Thus an aggregator client would download a much smaller simple feed, and then download just those entries which it hasn't downloaded yet.
Hi ... Eric from FeedBurner here.
We wrote a little something on this topic and FeedBurner's approach to bandwidth at "FeedBurner Saved My Bandwidth!". As Barnaby mentioned above, we provide a service to feed publishers to "displace" the bandwidth consumption to our servers, and we ensure that we handle If-Modified-Since requests and gzip encoding correctly for all accessing clients. We provide a number of other publisher services as well ... I encourage you to check out FeedBurner.
For desktop clients, there's the issue of the "last mile", where the clients are still polling to simulate a push. There are a number of ideas (some mentioned in this thread) to help with this. I have confidence that this community will figure something out before it gets out of control.
Y'know, this is an old, solved problem, and not really a tough one. (Other than the implied distributed trust issue, which isn't that big a deal for this particular sort of limited application)
The margins of this comment are too small to deteail it, though. :) I probably ought to write up a full-scale proposal, though the only problem there is that I'm not going to follow through past that, unless I get an urge to do a prototype implementation.
In the short term, you can use a Content Delivery Network (Akamai, Speedera, OpenCDN, etc.). No need to change anything in technology, protocols, clients...
I'm wondering about the reality of this alleged threat in the short term. For me, between a visitor who connects to my home page and one that fetches even 12 times the RSS feed, the latter still consumes less bandwidth (visitors do not do both, only one or the other). Bandwidth costs are going down, we're reducing pages size significantly by moving to web standards, we're using Not-Modified headers and gzip compression. For big sites we're using a CDN (which actually can save you a lot in the last-mile infrastructure costs). What's the horror story out there?