As a continuation of a series of explanatory blog entries on RSS, today I wanted to explain how conventional RSS polling works.
Not Push
Remember first that RSS is not push. It's polling. The conventional way of implementing alert technology is via push because experience has taught us that push scales more effectively than polling. There were several attempts to create pushing protocols on the Internet, but they failed because network administrators were blocking incoming Internet connections to combat viruses. This is why RSS became so popular, because it broke through the firewalls by using polling. Finally, we had alerts on the Internet.
Once an Hour
Nobody ever told us how often to poll RSS news feeds, but most of the original RSS readers polled the feeds by default once every hour. This convention has persisted and even today many RSS aggregators continue to poll feeds once per hour. Based on my own evidence, the average RSS reader polls a feed about eight times per day. Why not 24 times? Because many RSS clients are native and only poll when the user is logged on or even active.
TTL
Beyond the once an hour rule, RSS does have a few elements that give the readers hints about when they should be polled. The first such hint was <ttl>. This was originally put into RSS in order to accommodate peer-to-peer clients like Morpheus. The idea was that the RSS feed could live in a P2P network for seconds equal to the <ttl> before it is re-fetched from source. Unfortunately, RSS never took off in the P2P world and P2P itself has struggled to stay alive. Some publishers and aggregators have started to use <ttl> as the default polling interval in place of the one hour default. In fact, aggregators that cache feeds on behalf of many readers should always respect the <ttl> as a ceiling for how long they are allowed to cache the feed before refreshing.
skipHours and skipDays
Two other elements that affect how often a feed is polled are skipHours and skipDays. Unfortunately, these elements are rarely implemented or respected. Their application is obvious, but there are holes in the implementation. Theses element will contain a list of <day> and <hour> elements when RSS client should avoid taxing the source will excessive polls. The issues behind the elements are too confusing to be covered in this blog entry. I'll discuss them later in a separate entry. Just remember, these elements are rarely used or respected, so I would worry too much about implementing it perfectly in your own RSS application.
Weblogs.com
The last method that I'll discuss here was a key part of the early blogosphere. A website was created by Dave Winer that aggregates data of which blogs were recently updated. The website is called Weblogs.com. Millions of blogs and RSS feeds are pinging this website each and every time they are updated. An aggregator that is polling a lot of feeds should pull the shortChanges.xml file on this website to determine which blogs have been recently updated. You can poll this file about once every 5 minutes for best results. The shortChanges file will contain a list of blogs that have been updated in the last 5 minutes. You can also pull the changes.xml file on this website once every hour. This changes file contains a list of blogs that have been updated in the last hour.
If you are writing a RSS aggregator, then I hope you'll take advantage of these techniques to better your software's behaviour for polling news feeds and reduce your bandwidth. Now go code something.