Tell the Robots Where Not to Go


Part of being a webmaster or blogger is making sure the little things get done. One of the little things that can actually be a very big thing is telling the robots on the web how to behave when they come to your website.

A robot or web spider is software that automatically travels the Internet requesting information and then requesting the documents that are referenced. The robots that most people are interested in are the search engine robots, like Googlebot, MSNbot and Yahoo Slurp (or whatever name they are using these days). There are many other robots out there besides the search engine robots, and you can find a comprehensive list of web robots at the Web Robots Database.

There are two ways to control the behavior of robots when they visit your website. One is using a robots.txt file, the other is using the robots meta tag.

The robots.txt file is a plain text file that is placed in the root web folder on your web server. A very simple robots.txt file looks liked this:

# robots.txt
User-agent: *
Disallow: /cgi-bin/

This simple tells every robot that comes to your website to ignore the cgi-bin folder and all files in that folder. If you want to exclude the images folder as well add the line

Disallow: /images/

According to the web robots exclusion protocol you can tell robots that visit your site what folders to not visit. Officially there is not a way of telling the robots what folders you want them to visit. It means everything is included unless you specifically state in the robots.txt file that a folder is excluded. Something interesting to note is Google robots.txt file actually has an allow, but that is not specified in the protocol.

The second way to control the robots is using the robots meta tag in the head portion of your web pages. Like the robots.txt file most robots will include a file for indexing unless it is told not to index the file. This means, if you don’t include the robots meta tag you should get indexed by the search robots without the tag. I have tried to get in the habit of including the robots meta tag on all pages, even though it is not required for the robots.

If you want to exclude a page from the robots and not have the links followed on the page you tell the robot to not index and not follow the links using the noindex, nofollow content values.

One of the ways I have used the robots meta tag is to control how pages from catalogues are indexed. The main content of the catalogue is on the product pages, the listings of the individual catalogue categories contains duplicate information that you may not want the search engines to index but you do want them to follow the links to the individual product pages. In this case I have found using the robots meta tag

to control the indexing of the category pages and to follow the links to the product pages has increased the number of products index by the search engines.

Controlling the robots when they visit your site can make a difference in how your site is indexed by the search engines and gives you the opportunity to direct the robots to the content that has the highest value for them to index. This can mean an increase in indexed pages, less duplicate content from your site that is indexed, and in the end more visitors to your website finding exactly what they are looking for from the search engines.

Want more information:
The Web Robots Pages
The Official Google Webmaster Central Blog

Categories: web-programming 
Comments