Selected from : robots information
A Method for Web Robots Control | Uncategorized


Network Working Group                                          M. Koster
INTERNET DRAFT                                                WebCrawler
Category: Informational                                    November 1996
Dec 4, 1996                                         Expires June 4, 1997
<draft-koster-robots-00.txt>

                      A Method for Web Robots Control


Status of this Memo

     This document is an Internet-Draft.  Internet-Drafts are
     working documents of the Internet Engineering Task Force
     (IETF), its areas, and its working groups.  Note that other
     groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     ``work in progress.''

     To learn the current status of any Internet-Draft, please
     check the ``1id-abstracts.txt'' listing contained in the
     Internet- Drafts Shadow Directories on ftp.is.co.za (Africa),
     nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
     ds.internic.net (US East Coast), or ftp.isi.edu (US West
     Coast).






















Koster                draft-koster-robots-00.txt                [Page 1]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996


Table of Contents

   1.    Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . 2
   2.    Introduction  . . . . . . . . . . . . . . . . . . . . . . . 2
   3.    Specification . . . . . . . . . . . . . . . . . . . . . . . 3
   3.1   Access method . . . . . . . . . . . . . . . . . . . . . . . 3
   3.2   File Format Description . . . . . . . . . . . . . . . . . . 4
   3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 5
   3.2.2 The Allow and Disallow lines . . . . . . . . . . .  . . . . 5
   3.3   Formal Syntax . . . . . . . . . . . . . . . . . . . . . . . 6
   3.4   Expiration . . . . . . . . . . . . .  . . . . . . . . . . . 8
   4.    Examples . . . . . . . . . . . . . .  . . . . . . . . . . . 8
   5.    Implementor's Notes . . . . . . . . . . . . . . . . . . . . 9
   5.1   Backwards Compatibility . . . . . . . . . . . . . . . . . . 9
   5.2   Interoperability . . .. . . . . . . . . . . . . . . . . . . 10
   6.    Security Considerations . . . . . . . . . . . . . . . . . . 10
   7.    References  . . . . . . . . . . . . . . . . . . . . . . . . 10
   8.    Acknowledgements  . . . . . . . . . . . . . . . . . . . . . 11
   9.    Author's Address  . . . . . . . . . . . . . . . . . . . . . 11


1.  Abstract

   This memo defines a method for administrators of sites on the World-
   Wide Web to give instructions to visiting Web robots, most
   importantly what areas of the site are to be avoided.

   This document provides a more rigid specification of the Standard
   for Robots Exclusion [1], which is currently in wide-spread use by
   the Web community since 1994.


2.  Introduction

   Web Robots (also called "Wanderers" or "Spiders") are Web client
   programs that automatically traverse the Web's hypertext structure
   by retrieving a document, and recursively retrieving all documents
   that are referenced.

   Note that "recursively" here doesn't limit the definition to any
   specific traversal algorithm; even if a robot applies some heuristic
   to the selection and order of documents to visit and spaces out
   requests over a long space of time, it qualifies to be called a
   robot.

   Robots are often used for maintenance and indexing purposes, by
   people other than the administrators of the site being visited. In
   some cases such visits may have undesirable effects which the



Koster                draft-koster-robots-00.txt                [Page 2]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996


   administrators would like to prevent, such as indexing of an
   unannounced site, traversal of parts of the site which require vast
   resources of the server, recursive traversal of an infinite URL
   space, etc.

   The technique specified in this memo allows Web site administrators
   to indicate to visiting robots which parts of the site should be
   avoided. It is solely up to the visiting robot to consult this
   information and act accordingly. Blocking parts of the Web site
   regardless of a robot's compliance with this method are outside
   the scope of this memo.
  
  
3. The Specification

   This memo specifies a format for encoding instructions to visiting
   robots, and specifies an access method to retrieve these
   instructions. Robots must retrieve these instructions before visiting
   other URLs on the site, and use the instructions to determine if
   other URLs on the site can be accessed.

3.1 Access method

   The instructions must be accessible via HTTP [2] from the site that
   the instructions are to be applied to, as a resource of Internet
   Media Type [3] "text/plain" under a standard relative path on the
   server: "/robots.txt".

   For convenience we will refer to this resource as the "/robots.txt
   file", though the resource need in fact not originate from a file-
   system.

   Some examples of URLs [4] for sites and URLs for corresponding
   "/robots.txt" sites:

     http://www.foo.com/welcome.html http://www.foo.com/robots.txt

     http://www.bar.com:8001/        http://www.bar.com:8001/robots.txt

   If the server response indicates Success (HTTP 2xx Status Code,)
   the robot must read the content, parse it, and follow any
   instructions applicable to that robot.

   If the server response indicates the resource does not exist (HTTP
   Status Code 404), the robot can assume no instructions are
   available, and that access to the site is not restricted by
   /robots.txt.




Koster                draft-koster-robots-00.txt                [Page 3]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996


   Specific behaviors for other server responses are not required by
   this specification, though the following behaviours are recommended:

     - On server response indicating access restrictions (HTTP Status
       Code 401 or 403) a robot should regard access to the site
       completely restricted.

     - On the request attempt resulted in temporary failure a robot
       should defer visits to the site until such time as the resource
       can be retrieved.
 
     - On server response indicating Redirection (HTTP Status Code 3XX)
       a robot should follow the redirects until a resource can be
       found.


3.2 File Format Description

   The instructions are encoded as a formatted plain text object,
   described here. A complete BNF-like description of the syntax of this
   format is given in section 3.3.
 
   The format logically consists of a non-empty set or records,
   separated by blank lines. The records consist of a set of lines of
   the form:
 
     <Field> ":" <value>
 
   In this memo we refer to lines with a Field "foo" as "foo lines".

   The record starts with one or more User-agent lines, specifying
   which robots the record applies to, followed by "Disallow" and
   "Allow" instructions to that robot. For example:
 
     User-agent: webcrawler
     User-agent: infoseek
     Allow:    /tmp/ok.html
     Disallow: /tmp
     Disallow: /user/foo
   
   These lines are discussed separately below.
  
   Lines with Fields not explicitly specified by this specification
   may occur in the /robots.txt, allowing for future extension of the
   format. Consult the BNF for restrictions on the syntax of such
   extensions. Note specifically that for backwards compatibility
   with robots implementing earlier versions of this specification,
   breaking of lines is not allowed.


  
Koster                draft-koster-robots-00.txt                [Page 4]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996


   Comments are allowed anywhere in the file, and consist of optional
   whitespace, followed by a comment character '#' followed by the
   comment, terminated by the end-of-line.
 
3.2.1 The User-agent line

   Name tokens are used to allow robots to identify themselves via a
   simple product token. Name tokens should be short and to the
   point. The name token a robot chooses for itself should be sent
   as part of the HTTP User-agent header, and must be well documented.

   These name tokens are used in User-agent lines in /robots.txt to
   identify to which specific robots the record applies. The robot
   must obey the first record in /robots.txt that contains a User-
   Agent line whose value contains the name token of the robot as a
   substring. The name comparisons are case-insensitive. If no such
   record exists, it should obey the first record with a User-agent
   line with a "*" value, if present. If no record satisfied either
   condition, or no records are present at all, access is unlimited.

   The name comparisons are case-insensitive.
 
   For example, a fictional company FigTree Search Services who names
   their robot "Fig Tree", send HTTP requests like:
 
     GET / HTTP/1.0
     User-agent: FigTree/0.1 Robot libwww-perl/5.04
   
   might scan the "/robots.txt" file for records with:
 
     User-agent: figtree
 
3.2.2 The Allow and Disallow lines

   These lines indicate whether accessing a URL that matches the
   corresponding path is allowed or disallowed. Note that these
   instructions apply to any HTTP method on a URL.
 
   To uate if access to a URL is allowed, a robot must attempt to
   match the paths in Allow and Disallow lines against the URL, in the
   order they occur in the record. The first match found is used. If no
   match is found, the default assumption is that the URL is allowed.

   The /robots.txt URL is always allowed, and must not appear in the
   Allow/Disallow rules.

   The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is



Koster                draft-koster-robots-00.txt                [Page 5]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996


   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   uates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.

   This table illustrates some examples:
 
     Record Path        URL path         Matches
     /tmp               /tmp               yes
     /tmp               /tmp.html          yes
     /tmp               /tmp/a.html        yes
     /tmp/              /tmp               no
     /tmp/              /tmp/              yes
     /tmp/              /tmp/a.html        yes
    
     /a%3cd.html        /a%3cd.html        yes
     /a%3Cd.html        /a%3cd.html        yes
     /a%3cd.html        /a%3Cd.html        yes
     /a%3Cd.html        /a%3Cd.html        yes
    
     /a%2fb.html        /a%2fb.html        yes
     /a%2fb.html        /a/b.html          no
     /a/b.html          /a%2fb.html        no
     /a/b.html          /a/b.html          yes
    
     /%7ejoe/index.html /~joe/index.html   yes
     /~joe/index.html   /%7Ejoe/index.html yes
   
3.3 Formal Syntax

  This is a BNF-like description, using the conventions of RFC 822 [5],
  except that "|" is used to designate alternatives.  Briefly, literals
  are quoted with "", parentheses "(" and ")" are used to group
  elements, optional elements are enclosed in [brackets], and elements
  may be preceded with <n>* to designate n or more repetitions of the
  following element; n defaults to 0.

    robotstxt    = *blankcomment
                 | *blankcomment record *( 1*commentblank 1*record )
                   *blankcomment
    blankcomment = 1*(blank | commentline)
    commentblank = *commentline blank *(blankcomment)
    blank        = *space CRLF
    CRLF         = CR LF
    record       = *commentline agentline *(commentline | agentline)
                   1*ruleline *(commentline | ruleline)





Koster                draft-koster-robots-00.txt                [Page 6]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996


    agentline    = "User-agent:" *space agent  [comment] CRLF
    ruleline     = (disallowline | allowline | extension)
    disallowline = "Disallow" ":" *space path [comment] CRLF
    allowline    = "Allow" ":" *space rpath [comment] CRLF
    extension    = token : *space value [comment] CRLF
    value        = <any CHAR except CR or LF or "#">

    commentline  = comment CRLF
    comment      = *blank "#" anychar
    space        = 1*(SP | HT)
    rpath        = "/" path
    agent        = token
    anychar      = <any CHAR except CR or LF>
    CHAR         = <any US-ASCII character (octets 0 - 127)>
    CTL          = <any US-ASCII control character
                        (octets 0 - 31) and DEL (127)>
    CR           = <US-ASCII CR, carriage return (13)>
    LF           = <US-ASCII LF, linefeed (10)>
    SP           = <US-ASCII SP, space (32)>
    HT           = <US-ASCII HT, horizontal-tab (9)>

   The syntax for "token" is taken from RFC 1945 [2], reproduced here for
   convenience:
  
    token        = 1*<any CHAR except CTLs or tspecials>

    tspecials    = "(" | ")" | "<" | ">" | "@"
                 | "," | ";" | ":" | "" | <">
                 | "/" | "[" | "]" | "?" | "="
                 | "{" | "}" | SP | HT

  The syntax for "path" is defined in RFC 1808 [6], reproduced here for
  convenience:

    path        = fsegment *( "/" segment )
    fsegment    = 1*pchar
    segment     =  *pchar

    pchar       = uchar | ":" | "@" | "&" | "="
    uchar       = unreserved | escape
    unreserved  = alpha | digit | safe | extra

    escape      = "%" hex hex
    hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                         "a" | "b" | "c" | "d" | "e" | "f"

    alpha       = lowalpha | hialpha




Koster                draft-koster-robots-00.txt                [Page 7]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996

    lowalpha    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
    hialpha     = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

    digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                  "8" | "9"

    safe        = "$" | "-" | "_" | "." | "+"
    extra       = "!" | "*" | "'" | "(" | ")" | ","

                  
3.4 Expiration

   Robots should cache /robots.txt files, but if they do they must
   periodically verify the cached copy is fresh before using its
   contents.

   Standard HTTP cache-control mechanisms can be used by both origin
   server and robots to influence the caching of the /robots.txt file.
   Specifically robots should take note of Expires header set by the
   origin server.

   If no cache-control directives are present robots should default to
   an expiry of 7 days.


4. Examples

   This section contains an example of how a /robots.txt may be used.

   A fictional site may have the following URLs:

     http://www.fict.org/
     http://www.fict.org/index.html
     http://www.fict.org/robots.txt
     http://www.fict.org/server.html
     http://www.fict.org/services/fast.html
     http://www.fict.org/services/slow.html
     http://www.fict.org/orgo.gif
     http://www.fict.org/org/about.html
     http://www.fict.org/org/plans.html
     http://www.fict.org/%7Ejim/jim.html
     http://www.fict.org/%7Emak/mak.html

   The site may in the /robots.txt have specific rules for robots that
   send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and


Koster                draft-koster-robots-00.txt                [Page 8]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996

   "Excite/1.0", and a set of default rules:

      # /robots.txt for http://www.fict.org/
      # comments to webmaster@fict.org

      User-agent: unhipbot
      Disallow: /

      User-agent: webcrawler
      User-agent: excite
      Disallow:

      User-agent: *
      Disallow: /org/plans.html
      Allow: /org/
      Allow: /serv
      Allow: /~mak
     Disallow: /

   The following matrix shows which robots are allowed to access URLs:

                                               unhipbot webcrawler other
                                                        & excite
     http://www.fict.org/                         No       Yes       No
     http://www.fict.org/index.html               No       Yes       No
     http://www.fict.org/robots.txt               Yes      Yes       Yes
     http://www.fict.org/server.html              No       Yes       Yes
     http://www.fict.org/services/fast.html       No       Yes       Yes
     http://www.fict.org/services/slow.html       No       Yes       Yes
     http://www.fict.org/orgo.gif                 No       Yes       No
     http://www.fict.org/org/about.html           No       Yes       Yes
     http://www.fict.org/org/plans.html           No       Yes       No
     http://www.fict.org/%7Ejim/jim.html          No       Yes       No
     http://www.fict.org/%7Emak/mak.html          No       Yes       Yes


5. Notes for Implementors

5.1   Backwards Compatibility

   Previous of this specification didn't provide the Allow line. The
   introduction of the Allow line causes robots to behave slightly
   differently under either specification:
  
   If a /robots.txt contains an Allow which overrides a later occurring
   Disallow, a robot ignoring Allow lines will not retrieve those
   parts. This is considered acceptable because there is no requirement
   for a robot to access URLs it is allowed to retrieve, and it is safe,
   in that no URLs a Web site administrator wants to Disallow are be
   allowed. It is expected this may in fact encourage robots to upgrade
   compliance to the specification in this memo.


Koster                draft-koster-robots-00.txt                [Page 9]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996

5.2   Interoperability

   Implementors should pay particular attention to the robustness in
   parsing of the /robots.txt file. Web site administrators who are not
   aware of the /robots.txt mechanisms often notice repeated failing
   request for it in their log files, and react by putting up pages
   asking "What are you looking for?".

   As the majority of /robots.txt files are created with platform-
   specific text editors, robots should be liberal in accepting files
   with different end-of-line conventions, specifically CR and LF in
   addition to CRLF.


6. Security Considerations

   There are a few risks in the method described here, which may affect
   either origin server or robot.

   Web site administrators must realise this method is voluntary, and
   is not sufficient to guarantee some robots will not visit restricted
   parts of the URL space. Failure to use proper authentication or other
   restriction may result in exposure of restricted information. It even
   possible that the occurence of paths in the /robots.txt file may
   expose the existence of resources not otherwise linked to on the
   site, which may aid people guessing for URLs.

   Robots need to be aware that the amount of resources spent on dealing
   with the /robots.txt is a function of the file contents, which is not
   under the control of the robot. For example, the contents may be
   larger in size than the robot can deal with. To prevent denial-of-
   service attacks, robots are therefore encouraged to place limits on
   the resources spent on processing of /robots.txt.

   The /robots.txt directives are retrieved and applied in separate,
   possible unauthenticated HTTP transactions, and it is possible that
   one server can impersonate another or otherwise intercept a
   /robots.txt, and provide a robot with false information. This
   specification does not preclude authentication and encryption
   from being employed to increase security.

7. Acknowledgements

   The author would like the subscribers to the robots mailing list for
   their contributions to this specification.







Koster                draft-koster-robots-00.txt               [Page 10]

INTERNET DRAFT        A Method for Robots Control       December 4, 1996

8. References

   [1] Koster, M., "A Standard for Robot Exclusion",
       http://info.webcrawler.com/mak/projects/robots/norobots.html,
       June 1994.
  
   [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
       Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
      
   [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
        USC/ISI, March 1994.
  
   [4]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
        Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
        University of Minnesota, December 1994.

   [5] Crocker, D., "Standard for the Format of ARPA Internet Text
       Messages", STD 11, RFC 822, UDEL, August 1982.

   [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
       UC Irvine, June 1995.

9. Author's Address

   Martijn Koster
   WebCrawler
   America Online
   690 Fifth Street
   San Francisco
   CA 94107
  
   Phone: 415-3565431
   EMail: m.koster@webcrawler.com

                                                    Expires June 4, 1997




Koster                draft-koster-robots-00.txt               [Page 11]



Related posts :
Can I block just bad robots?
Why did this robot ignore my /robots.txt?
A Standard for Robot Exclusion
Surely listing sensitive files is asking for trouble?

Tags : method - for - web - robots - control
Posted by robot at Thu 20 May 2010 Time 20:44
Surely listing sensitive files is asking for trouble? | Uncategorized

Surely listing sensitive files is asking for trouble?

Some people are concerned that listing pages or directories in the /robots.txt file may invite unintended access. There are two answers to this.

The first answer is a workaround: You could put all the files you don't want robots to visit in a separate sub directory, make that directory un-listable on the web (by configuring your server), then place your files in there, and list only the directory name in the /robots.txt. Now an ill-willed robot won't traverse that directory unless you or someone else puts a direct link on the web to one of your files, and then it's not /robots.txt fault.

For example, rather than:

User-Agent: *
Disallow: /foo.html
Disallow: /bar.html

do:

User-Agent: *
Disallow: /norobots/

and make a "norobots" directory, put foo.html and bar.html into it, and configure your server to not generate a directory listing for that directory. Now all an attacker would learn is that you have a "norobots" directory, but he won't be able to list the files in there; he'd need to guess their names.

However, in practice this is a bad idea -- it's too fragile. Someone may publish a link to your files on their site. Or it may turn up in a publicly accessible log file, say of you user's proxy server, or maybe it will show up in someone's web server log as a Referer. Or someone may misconfigure your server at some future date, "fixing" it to show a directory listing. Which leads me to the real answer:

The real answer is that /robots.txt is not intended for access control, so don't try to use it as such. Think of it as a "No Entry" sign, not a locked door. If you have files on your web site that you don't want unauthorized people to access, then configure your server to do authentication, and configure appropriate authorization. Basic Authentication has been around since the early days of the web (and in e.g. Apache on UNIX is trivial to configure). Modern content management systems support access controls on individual pages and collections of resources.



Related posts :
A Standard for Robot Exclusion

Tags : surely - listing - sensitive - files - asking - for - trouble
Posted by robot at Thu 20 May 2010 Time 20:41
Why did this robot ignore my /robots.txt? | Uncategorized

Why did this robot ignore my /robots.txt?

It could be that it was written by an inexperienced software writer. Occasionally schools set their students "write a web robot" assignments. But, these days it's more likely that the robot is explicitly written to scan your site for information to abuse: it might be collecting email addresses to send email spam, look for forms to post links ("spamdexing"), or security holes to exploit.



Related posts :
A Standard for Robot Exclusion
About /robots.txt
Can I block just bad robots?

Tags : why - did - this - robot - ignore - robots - txt
Posted by robot at Thu 20 May 2010 Time 20:40
Can I block just bad robots? | Uncategorized

Can I block just bad robots?

In theory yes, in practice, no. If the bad robot obeys /robots.txt, and you know the name it scans for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.

If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall.

If copies of the robot operate at lots of different IP addresses, such as hijacked PCs that are part of a large Botnet, then it becomes more difficult. The best option then is to use advanced firewall rules configuration that automatically block access to IP addresses that make many connections; but that can hit good robots as well your bad robots.



Related posts :
About /robots.txt

Tags : can - block - just - bad - robots
Posted by robot at Thu 20 May 2010 Time 20:38
About /robots.txt | Uncategorized

About /robots.txt

In a nutshell

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

See also:

The details

The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:

In addition there are external resources:

The /robots.txt standard is not actively developed. See What about further development of /robots.txt? for more discussion.

The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ.

How to create a /robots.txt file

Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.

For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

See also:

What to put in it

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:

User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html


Tags : about - robots - txt
Posted by robot at Thu 20 May 2010 Time 20:36
A Standard for Robot Exclusion | Uncategorized

A Standard for Robot Exclusion

Table of contents:

Status of this document

This document represents a consensus on 30 June 1994 on the robots mailing list (robots-request@nexor.co.uk), between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk@info.cern.ch). This document is based on a previous working draft under the same title.

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

 

The latest version of this document can be found on http://www.robotstxt.org/wc/robots.html.

 

Introduction

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

 

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

 

The Method

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below.

This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retri.

 

A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.

 

The choice of the URL was motivated by several criteria:

 

  • The filename should fit in file naming restrictions of all common operating systems.
  • The filename extension should not require extra server configuration.
  • The filename should indicate the purpose of the file and be easy to remember.
  • The likelihood of a clash with existing files should be minimal.

The Format

The format and semantics of the "/robots.txt" file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form ":". The field name is case insensitive.

 

Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.

 

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

 

User-agent
The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

 

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

 

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

 

 

 

 

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

 

 

The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

Examples

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:
# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html
This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
This example indicates that no robots should visit this site further:
# go away
User-agent: *
Disallow: /

Example Code

Although it is not part of this specification, some example code in Perl is available in norobots.pl. It is a bit more flexible in its parsing than this document specificies, and is provided as-is, without warranty.

 

Note: This code is no longer available. Instead I recommend using the robots exclusion code in the Perl libwww-perl5 library, available from CPAN in the LWP directory.

Author's Address

Martijn Koster

 



Tags : standard - for - robot - exclusion
Posted by robot at Thu 20 May 2010 Time 12:49
Older Posts
:: Notice :: For watching each post just click on title to open or close it
Pages :
Next page - Random page - Previous page
Blog Labels
blog en dedi 28 panel a-z