Creating
a Robots.txt file
By
Sumantra Roy
Some people believe that they
should create different pages for different search engines, each page optimized
for one keyword and for one search engine. Now, while I don't recommend that
people create different pages for different search engines, if you do decide to
create such pages, there is one issue that you need to be aware of.
These pages, although
optimized for different search engines, often turn out to be pretty similar to
each other. The search engines now have the ability to detect when a site has
created such similar looking pages and are penalizing or even banning such
sites. In order to prevent your site from being penalized for spamming, you need
to prevent the search engine spiders from indexing pages which are not meant for
it, i.e. you need to prevent AltaVista
from indexing pages meant for Google
and vice-versa. The best way to do that is to use a robots.txt file.
You should create a robots.txt
file using a text editor like Windows Notepad. Don't use your word processor to
create such a file.
Here is the basic syntax of the
robots.txt file:
User-Agent: [Spider Name]
Disallow: [File Name]
For instance, to tell AltaVista's
spider, Scooter, not to spider the file named myfile1.html residing in the root
directory of the server, you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Google's
spider, called Googlebot, not to spider the files myfile2.html and myfile3.html,
you would write
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course, put
multiple User-Agent statements in the same robots.txt file. Hence, to tell AltaVista
not to spider the file named myfile1.html, and to tell Google
not to spider the files myfile2.html and myfile3.html, you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to prevent all robots
from spidering the file named myfile4.html, you can use the * wildcard character
in the User-Agent line, i.e. you would write
User-Agent: *
Disallow: /myfile4.html
However, you cannot use the
wildcard character in the Disallow line.
Once you have created the
robots.txt file, you should upload it to the root directory of your domain.
Uploading it to any sub-directory won't work - the robots.txt file needs to be
in the root directory.
I won't discuss the syntax and
structure of the robots.txt file any further - you can get the complete
specifications from here.
Now we come to how the robots.txt
file can be used to prevent your site from being penalized for spamming in case
you are creating different pages for different search engines. What you need to
do is to prevent each search engine from spidering pages which are not meant for
it.
For simplicity, let's assume
that you are targeting only two keywords: "tourism in Australia" and
"travel to Australia". Also, let's assume that you are targeting only
three of the major
search engines: AltaVista,
HotBot
and Google.
Now, suppose you have followed the
following convention for naming the files: Each page is named by separating the
individual words of the keyword for which the page is being optimized by
hyphens. To this is added the first two letters of the name of the search engine
for which the page is being optimized.
Hence, the files for AltaVista
are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for HotBot
are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for Google
are
tourism-in-australia-go.html
travel-to-australia-go.html
As I noted earlier, AltaVista's
spider is called Scooter and Google's
spider is called Googlebot.
A list of spiders for the major
search engines can be found here.
Now, we know that HotBot
uses Inktomi
and from this list, we find and Inktomi's spider is called Slurp.
Using this knowledge, here's what
the robots.txt file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above lines in
the robots.txt file, you instruct each search engine not to spider the files
meant for the other search engines.
When you have finished creating
the robots.txt file, double-check to ensure that you have not made any errors
anywhere in it. A small error can have disastrous consequences - a search engine
may spider files which are not meant for it, in which case it can penalize your
site for spamming, or, it may not spider any files at all, in which case you
won't get top rankings in that search engine.
An useful tool to check the syntax
of your robots.txt file can be found here. While it will help you correct
syntactical errors in the robots.txt file, it won't help you correct any logical
errors, for which you will still need to go through the robots.txt thoroughly,
as mentioned above.
Article
by Sumantra Roy. Sumantra is one of the most respected and recognized search
engine positioning specialists on the Internet. For more articles on search
engine placement, subscribe to his 1st
Search Ranking Newsletter or go to http://www.1stSearchRanking.com
|