I'm slowly building a collection of traces for access to a relatively
new kind of Web content - RSS feeds.
If you're not familiar with it, RSS is a format that represents a list
of items, each with its own title, description, link, and other
metadata. Clients (often called "aggregators") periodically poll to
build a view of the channel over time, adding new items in the
representation to a local store. In this manner, one can keep abreast
of news headlines and other chronologically-ordered lists.
This format is becoming more popular, both because of increasing
support (sites like MSDN, the New York Times and CNN all have RSS
feeds) and the arrival of "weblogs" (which also uses RSS).
There are a number of interesting questions about RSS that come to
mind, including;
- what is the polling interval?
- how common is validation?
- what is the rate of change for the RSS?
- what is the size of the feed?
- what times of day does polling happen?
- how self-similar is RSS traffic (is is "lumpy" around the top of
the hour, for example?)
I suspect that RSS, because it is polled, is not at all typical Web
traffic, and therefore places unusual requirements on Web servers and
intermediaries. I also suspect that it may eventually require us to
rethink distribution; invalidation and other approaches may become much
more desirable, as opposed to polling.
Rather than keep all of the fun for myself (I have a day job), I've
placed the traces on the Web for the greater enjoyment of the caching
and traffic characterization community ;) They are at:
http://www.mnot.net/rss/traces/
(this will redirect to another site; please bookmark the URI above, in
case it changes).
So far, I have one trace; it has been anonymized (combined log format
with the client IP, ident, userinfo, URI, referer and user-agent fields
hashed or half-hashed). If there is another format that's more
suitable, please tell me.
This trace contains about 500,000 entries and represents a week's worth
of access to a RSS "scraping" service; i.e., it's a Web site that
processes other Web sites to produce a number of feeds. As such, it
contains accesses to multiple feeds.
Please tell me if this is interesting/useful, and send along any
results you come up with. I'm working on getting more traces; stay
tuned (are any of the repositories - e.g., W3C WCA, Internet Traffic
Archive - still active? Neither has seen anything new in quite some
time...).
Regards,
This archive was generated by hypermail 2b29 : Thu Nov 18 2004 - 11:21:30 MST