Reasons to use the robots.txt file
- Not all robots which visit your website have good intentions! There are many, many robots out there whose sole purpose is to scan your website and extract your email address for spamming purposes! A list of the "evil" ones later.
- You may not be finished building your website (under construction) or sections may be date/ sensitive. Exclude all robots from any page of my website whilst in the middle of creating the site. Only let the robots in when the site was ready. This is not only useful for new websites being built but also for old ones getting re-launched.
- You may well have a membership area that you do not wish to be visible in googles cache. Not letting the robot in is one way to stop this.
- There are certain things you may wish to keep private. If you have a look at the abakus robots.txt file (http://www.abakus-internet-marketing.de/robots.txt) You will notice there is a stop indexation of unnecessary forum files/profiles for privacy reasons. Some webmasters also block robots from their cgi-bin or image directories.
What is the robots.txt file?
The robots.txt file is an ASCII text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions are the deciding factor of how a search engine indexes your website’s pages.
The User-agent field is for specifying robot name for which the access policy follows in the Disallow field. Disallow field specifies URLs which the specified robots have no access to. A robots.txt example :
- The universal address of the robots.txt file is: www.domain.com/robots.txt (always in lowercase for the filename and the contents). This is the first file that a robot visits. It picks up instructions for indexing the site content and follows them. This file contains two text fields.
- Separate lines are required for specifying access to different user agents and Disallow field should not carry more than one command in a line in the robots.txt file. There is no limit to the number of lines though i.e. both the User-agent and Disallow fields can be repeated with different commands any number of times.
- Blank lines will also not work within a single record set of both the commands.
Characters following # are ignored up to the line termination as they are considered to be comments.
- Please also note that there is no “Allow” command in the standard robots.txt protocol. Content not blocked in the “Disallow” field is considered allowed.
- You can use The robots.txt Validator to check your robots.txt from www.searchengineworld.com
- The robots.txt file can be used to specify the directories on your server that you don’t want robots to access and/or index e.g. temporary, cgi, and private/back-end directories. (Note:
- Careless handling of directory and filenames can lead hackers to snoop around your site by studying the robots.txt file, as you sometimes may also list filenames and directories that have classified content.)
- Do not put multiple file URLs in one Disallow line in the robots.txt file. Use a new Disallow line for every directory that you want to block access to.
- If you want to block access to all but one or more than one robot, then the specific ones should be mentioned first. Lets study this robots.txt example :
In the above case, MSNBot would simply leave the site without indexing after reading the first command. Correct syntax is: