Mar 6, 2010

Really, robots.txt is There for a Reason

Quite a while back now in Internet history a system called ‘robots.txt’ was introduced to control the content that web crawlers and services are permissible to crawl. This file details a map which denotes what crawler/bot is allowed to crawl a site. At tweekly.fm I deal with a lot of user traffic and the complicated nature of our queries means that its quite intensive to generate the user pages.

There are multiple levels of caching and memcache is there too but even under peak times we still have to serve pages to users. The most infuriating thing is during the period in which we send tweets out each day for users, we get bombarded with requests from crawlers and bots. This slows both our service and our processing down but if the crawlers in question were to honour robots.txt as they should do then this wouldn’t even be an issue. It wouldn’t even be too bad if requests were staggered out over a period of time.

The worst offenders at first glance are TweetMeme, Radian6, PostRank, TwittUrls, MetaURI, Twingly, Page-Store & Chainn.