Using robots.txt file to prevent search engine spidering

By | April 6, 2007

When a search engine spider vists a site, say http://www.YourSite.com/, first of all, it checks for YourSite.com/robots.txt. If the robots.txt file exists (you actually created one) it will look for this code.

User-agent: *
Disallow: /

Sometimes, for certain reasons such as:

  • sales pages
  • site rules
  • disclaimers
  • privacy policies
  • private pages
  • contact pages (prevent spamming)

we don’t want search engines to spider a page. We certainly don’t want our “thank you” sales pages to show up in the search engines, since everyone could then come and download our stuff for free. I made a mistake of not adding the robots.txt to one of my sites and now the contact page has been spidered by Google, so maybe some spam is headed for my mailbox.

We add a robots.txt into the public_html directory and specify which pages or directories should NOT be indexed. Just create a notepad file, specify which pages should not spidered, save it as robots.text and upload it. So if your thank you page is located at yoursite.com/thankyou, than you can specify this code in your robots.txt file.

User-agent: *
Disallow: /thankyou

For particular html pages, if you just want that page to be NOT indexable, the easiest way to stop search engine spidering is to add this piece of meta tag code into the first text block of a page, or somewhere within the <head> tags. If you don’t want the page to be indexed, use this code:

<META NAME=”ROBOTS” CONTENT=”NOINDEX”>

If you only dont want the links to be parsed, then use this meta tag:

<META NAME=”ROBOTS” CONTENT=”NOFOLLOW”>

If you dont want indexing AND link parsing, use this meta tag:

<META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”>

This will prevent that particular page from being spidered in the ways you specified. Some spiders or robots do not obey the robots.txt file commands, and these are usually”bad” robots, like spam bots or content scrapers. So, having a robots.txt file helps in identifying the bad robots and blacklisting them. And here’s how to exclude particular robots/spiders in the robots.txt file from your entire site:

User-agent: BadRobotX
Disallow: /

Share This:
FacebooktwitterreddittumblrFacebooktwitterreddittumblr