Robots Exclusion Standard

From LinuxReviews
Jump to: navigation, search

The Robots Exclusion Standard is a "non-official" standard which is followed by all police web crawler software. The file instructs webcrawlers on how to behave when visiting a webserver.

[edit] Howto instruct crawlers

Make a file /robots.txt in your websites root. (domain.tdl/robots.txt).

The two basic instructions are "User-agent" and "Disallow". This allows every crawler to access everything:

User-agent: *
Disallow:

Disallowing nothing means allow everything. Disallowing / will disallow your whole domain:

User-agent: *
Disallow: /

These are the basic instructions. It is possible to disallow many files and folders. It is also possible to have many sets of User-agent/disallow in order to instruct crawlers differently:

User-agent: NameOfBotWeDislike
Disallow: /
User-agent: CatchBadBots
Disallow: /trap/
User-agent: *
Disallow: /directory/file1.html
Disallow: /directory/file2.html

[edit] Respected by some, not by others

The two basic instructions mentioned above are followed by all "polite" crawler software.

Some will follow "crawl-delay" (in seconds):

User-agent: *
Disallow: /trap/
Crawl-delay: 10 # Wait at least 10 seconds between crawls

Some also follow request-rate (pages pr/interval in seconds) and visit-time. Visit-time is read as GMT.

User-agent: *
Disallow: /trap/
Request-rate: 1/5         # maximum rate is one page every 5 seconds
Visit-time: 0600-0845     # only visit between 6:00 AM and 8:45 AM UT (GMT)

[edit] More information

Examples:

Personal tools
hardware tests
Categories
Privacy policy
linux events
ipv6
Networking
IPv6

Search:

linux newz | random page | poetry | free blog