Exclusions Standard for Robots

Introduction

Search robots (wanderers, spiders) are programs which index web-documents in Internet.

In 1993-94 it was discovered, that search robots often perform documents indexing against will of web-site owners. Sometimes, robots interfered with common users and the same files were indexed several times. In some cases robots indexed wrong documents - very deep virtual directories, temporary information or CGI-scripts. Exclusions Standard was designed to solve such problems.

Function

It is necessary to create a file containing information about robot's behaviour management to avoid robot's request to a web-server or its parts. This file must be available by HTTP protocol at local URL '/robots.txt'. Content of this file is given below.

This solution was made to allow robot find rules describing its required actions by requesting just one file. File '/robots.txt' can be easily created on any existing web-server.

The choice of such particular URL is dictated by several circumstances:

Format

Format and semantics of '/robots.txt" are:

The file must have one or several records separated by one or several lines (ending with CR, CR/NL, or NL). Each record must contain lines: "<field>:<optional_space><value><optional_space>".

Field <field> is register independent.

Comments may be included in usual UNIX way: symbol '#' denotes start of a comment, end of line denotes end of a comment.

A record should start with one ore more 'User-Agent' lines followed by one ore more Disallow lines (see format below). Unrecognized lines are ignored.

User-Agent

Disallow

If file '/robots.txt' is empty, does not conform to the format and semantics, or is missing, then search robots act according to their settings.

Examples

Example 1:

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /cyberworld/map/ # this is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear

Content of '/cyberworld/map/' and '/tmp/' are protected in this example.

Example 2:

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /cyberworld/map/ # this is an infinite virtual URL space
# Cybermapper knows where to go
User-Agent: cybermapper
Disallow:

In this example the search robot 'cybermapper' is granted full access, while the rest do not have access to content of '/cyberworld/map/'.

Example 3:

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /

Any search robot is denied access to the server in this example.

Up

Back | Contents | Proceed