Mar 30, 2007

Tell the Robots Where Not to Go

Part of being a webmaster or blogger is making sure the little things get done. One of the little things that can actually be a very big thing is telling the robots on the web how to behave when they come to your website.

A robot or web spider is software that automatically travels the Internet requesting information and then requesting the documents that are referenced. The robots that most people are interested in are the search engine robots, like Googlebot, MSNbot and Yahoo Slurp (or whatever name they are using these days). There are many other robots out there besides the search engine robots, and you can find a comprehensive list of web robots at the Web Robots Database.

There are two ways to control the behavior of robots when they visit your website. One is using a robots.txt file, the other is using the robots meta tag.

The robots.txt file is a plain text file that is placed in the root web folder on your web server. A very simple robots.txt file looks liked this:

# robots.txt
User-agent: *
Disallow: /cgi-bin/

This simple tells every robot that comes to your website to ignore the cgi-bin folder and all files in that folder. If you want to exclude the images folder as well add the line

Disallow: /images/

According to the web robots exclusion protocol you can tell robots that visit your site what folders to not visit. Officially there is not a way of telling the robots what folders you want them to visit. It means everything is included unless you specifically state in the robots.txt file that a folder is excluded. Something interesting to note is Google robots.txt file actually has an allow, but that is not specified in the protocol.

The second way to control the robots is using the robots meta tag in the head portion of your web pages. Like the robots.txt file most robots will include a file for indexing unless it is told not to index the file. This means, if you don’t include the robots meta tag you should get indexed by the search robots without the tag. I have tried to get in the habit of including the robots meta tag on all pages, even though it is not required for the robots.

If you want to exclude a page from the robots and not have the links followed on the page you tell the robot to not index and not follow the links using the noindex, nofollow content values.

One of the ways I have used the robots meta tag is to control how pages from catalogues are indexed. The main content of the catalogue is on the product pages, the listings of the individual catalogue categories contains duplicate information that you may not want the search engines to index but you do want them to follow the links to the individual product pages. In this case I have found using the robots meta tag

to control the indexing of the category pages and to follow the links to the product pages has increased the number of products index by the search engines.

Controlling the robots when they visit your site can make a difference in how your site is indexed by the search engines and gives you the opportunity to direct the robots to the content that has the highest value for them to index. This can mean an increase in indexed pages, less duplicate content from your site that is indexed, and in the end more visitors to your website finding exactly what they are looking for from the search engines.

Want more information:
The Web Robots Pages
The Official Google Webmaster Central Blog

Categories: web-programming

Comments

Rhett
Mar 31, 2007
Lee do you ever wonder if when I leave my apartment and look up at the stars and you leave your house and you look up at the stars if we are gazing upon the same one, somehow connected through time and space... :D I swear I was just reading about this the other day and now it's here. Well, I have a quick question. I think what's going on here is you are talking to a certain audience that knows one more thing than I do and so I can't quite make sense of what you are saying. If I create said text file do I just put it in the "www" folder? Would this be a way for say bots not to find like an audio folder with mp3s so they don't get ripped off? (This question might be too large) How do I know where I want bots to go and not want them to go? What's in the cgi bin?
LGR
Mar 31, 2007
Rhett, Yes usually the root web folder is called www or public_html. I know for a fact on your server that is what it is called. For Wordpress, I ran across a good post the other day at <a HREF="http://www.dailyblogtips.com/create-a-robotstxt-file/" REL="nofollow">Daily Blog Tips</a> that had a good sample robots.txt file. As for where you don't want the bots to go? Look at all the files/folders in your www folder and ask yourself "do you want search engines to index this folder?", if the answer is no then disallow it. This is why I often disallow image folders. I don't need the images I use to build a website to be index and stored in Google Images for example. If it is a photo gallery that is different. Remember though, by disallowing a folder in the robots.txt file, the robots will ignore the folder, but that does not mean people will, since people can read the robots.txt file. Often those you don't want reading the file will look at it to see if there are folders listed that might allow them some kind of access to your site. If you really don't want people or robots to have access to a folder I would password protect it using .htaccess. If you use cPanel this can be done easily under the Web Protect section. As for the mp3 question, as long as you own the copyright to those works, you could put them in a folder and disallow robots from reading it. That will stop robots from looking in the folder. This does not mean that people won't find them and download them. If the mp3's are others copyrighted work, don't put them on the web. Many hosts will remove your account if you have copyrighted material on your account. What's in the cgi-bin folder? These days not much since PHP has become so popular. It was used a lot more in the past for PERL scripts and other programs. There also might be reasons to not block bots from that folder. If you actually used a script that output data from that folder you might want to allow bots access. I just disallow it since it is not used much anymore.
Rhett
Apr 2, 2007
What about the stars Lee? :'(
LGR
Apr 2, 2007
It is just a little to metaphysical for me these days Rhett.
WordPress Ninja Affiliate for Free - LGR Internet Solutions
Jul 15, 2009
[...] BlogMechanics KeywordLink allow you to do this. You can also add your custom URL’s to your robots.txt file to prevent the search engine spiders from following your affiliate links. For example I could [...]
Alex
Sep 26, 2011
It's worth pointing out (and/or reminding) that robots.txt is always open and visible to anyone who puts that filename in the url. (so I could put in www.lgr.ca/robots.txt and see the robots.txt file) So don't hide important / super-secret files and folders by listing them in robots.txt . It's the first place hackers look to see if they can find what's now indexed.
LGR
Sep 26, 2011
Very true! My robots.txt file is actually out of date, since there are folders listed that no longer exist. In theory one could list a folder and use it to trap hackers looking for exploits. I am sure a quick Google search would find something.