Finding bot names to exclude from your robots file.

robot.jpgYou might have noticed an increase in the number of bots trawling the Web these days. Some are good, some are not. Good bots obey the robots.txt file, but unfortunately most bad bots don’t.

In fact bad bots not only don’t obey the robots.txt file, they also steal information or download entire pages off your website. Bad bots add to your bandwidth consumption and choke up your spam dustbin. Everyday, unknown bots crawl your sites and they dont identify themselves.

The problem is, how do we know which are good bots and which are bad bots? A wrongly used robots.txt file will hinder spidering, which is what we dont want.

Perhaps we should take a leaf out of Wikipedia, the premier content site on the Web, that we can safely assume to be the daily target of content eating bots and all other kinds of nefarious bots.

On Wikipedia.org/robots.txt we probably have a useful list of bots to exclude from our sites. I’m not totally 100% sure on this, but I feel that we can very safely exclude some of those bots listed there, especially those branded as trouble by Wikipedia! It’s a personal choice, but we can safely assume Wikipedia to be the most hit up site on the Web by both humans AND robots. I would like to hear any comments on this.

Since bad bots never obey the rules, why bother?

There is also the opposite train of thought amongst webmasters which is to just leave the robots.txt as sparse as possible. You should experiment around and see if it makes a difference.

2 things about the robots txt file –

  • Don’t think of password protecting your robots.txt file. That goes against it’s very purpose. If you password protect your robots.txt file than it will not be readable by a bot and therefore, they won’t know what pages are forbidden. Accept that the robots.txt file should be human readable.
  • Listing something in your robots.txt is no guarantee that it will be excluded. There are many bots that disobey the laws. The robots txt file is really quite limited in its usefulness, as I implied earlier.

Google, Ebay and Wikipedia all employ the robots txt file on their respective sites so it might be quite useful to take a look at these mega sites’ robots txt files for the educational value.

What about the bots that blatantly disobey the robots.txt file? Can we actually do something about it? Yes there is, but it does involve a fair deal of technical know-how. Here is a very good but technical write up on it.

3 Responses to “Finding bot names to exclude from your robots file.”

  1. I normally do not put anything into my robots.txt file that I do not want seen.
    Also, I never link to anything I want to keep private on my site. Linking and putting it into the robots.txt file is the only want bots will know it’s there.
    I rant a lot (and I mean a lot) about this on my blog – http://www.a-daily-rant.com/

    Robots.txt file is just a road-map for the bad-bots out there since it really doesn’t prevent the bots from accessing the site.

  2. I go for flat architecture with my site and whenever i can i exclude extraneous files with robots

  3. It is happening to my site, frequently i get lot of bots visiting my website daily, but the resource you gave in this article is going to be very useful for me, it is better to know the name of the bots from wikipedia and block those bots from entering a site. Other than wikipedia bot list ,if anyone knows the other bots post here thanks.