You might have noticed an increase in the number of bots trawling the Web these days. Some are good, some are not. Good bots obey the robots.txt file, but unfortunately most bad bots don’t.
In fact bad bots not only don’t obey the robots.txt file, they also steal information or download entire pages off your website. Bad bots add to your bandwidth consumption and choke up your spam dustbin. Everyday, unknown bots crawl your sites and they dont identify themselves.
The problem is, how do we know which are good bots and which are bad bots? A wrongly used robots.txt file will hinder spidering, which is what we dont want.
Perhaps we should take a leaf out of Wikipedia, the premier content site on the Web, that we can safely assume to be the daily target of content eating bots and all other kinds of nefarious bots.
On Wikipedia.org/robots.txt we probably have a useful list of bots to exclude from our sites. I’m not totally 100% sure on this, but I feel that we can very safely exclude some of those bots listed there, especially those branded as trouble by Wikipedia! It’s a personal choice, but we can safely assume Wikipedia to be the most hit up site on the Web by both humans AND robots. I would like to hear any comments on this.
Since bad bots never obey the rules, why bother?
There is also the opposite train of thought amongst webmasters which is to just leave the robots.txt as sparse as possible. You should experiment around and see if it makes a difference.
2 things about the robots txt file –
- Don’t think of password protecting your robots.txt file. That goes against it’s very purpose. If you password protect your robots.txt file than it will not be readable by a bot and therefore, they won’t know what pages are forbidden. Accept that the robots.txt file should be human readable.
- Listing something in your robots.txt is no guarantee that it will be excluded. There are many bots that disobey the laws. The robots txt file is really quite limited in its usefulness, as I implied earlier.
Google, Ebay and Wikipedia all employ the robots txt file on their respective sites so it might be quite useful to take a look at these mega sites’ robots txt files for the educational value.
What about the bots that blatantly disobey the robots.txt file? Can we actually do something about it? Yes there is, but it does involve a fair deal of technical know-how. Here is a very good but technical write up on it.