Robots.txt: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
Revert to the revision prior to revision 22874673 dated 2022-02-22 05:22:55 by Ogmommy3 using popups
Tag: Manual revert
 
(18 intermediate revisions by 17 users not shown)
Line 1: Line 1:
{{MovedToMediaWiki}}
<- [[MediaWiki architecture]] < [[Apache config]]
Imported with full history. [[User:IAlex|<b><font color="#66A7CC">i</font><font color="#9966CC">Alex</font></b>]] 13:01, 8 November 2007 (UTC)

The [[en:Robots Exclusion Standard|Robots Exclusion Standard]] allows advising [[en:web robot|web robot]]s by means of the file <nowiki>{{SERVER}}/robots.txt</nowiki>, e.g for this project {{SERVER}}/robots.txt.

== Nice robot ==

In your robots.txt file, you would be wise to deny access to the script directory, hence diffs, old revisions, contribs lists, etc etc, which could severely raise the load on the server.

If not using URL rewriting, this could be difficult to do cleanly. If using a system like on Wikipedia where plain pages are gotten to via /wiki/Some_title and anything else via /w/wiki.phtml?title=Some_title&someoption=blah, it's easy:

User-agent: *
Disallow: /w/

Be careful, though! If you put this line by accident:

Disallow: /w

you'll block access to the '''/w'''iki directory, and search engines will drop your wiki.

== Problems ==

Unfortunately, there are three big problems with robots.txt:

=== Rate control ===

You can only specify what ''paths'' a bot is allowed to spider. Even just allowing the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "[http://mail.wikipedia.org/pipermail/wikitech-l/2003-August/005712.html Crawl-delay]" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)

Bots that don't behave well by default could be forced into line with some sort of [[request throttling]].

=== Don't ''index'' vs don't ''spider'' ===

Most search engine spiders will consider a match on a robots.txt 'Disallow' entry to mean that they should not return that URL in search results. [[en:Google|Google]] is a rare exception, which is ''technically'' to specs but is very annoying: it will index such URLs and may return them in search results, albeit without being able to show the content or title of the page or anything other than the URL.

This means that sometimes "edit" URLs will turn up in Google results, which is very VERY annoying.

The only way to keep a URL out of Google's index is to ''let'' Google slurp the page and see a meta tag specifying robots="noindex". With our current system, this would be difficult to special case.

:As nonexistent articles mostly bring up an edit page, can we not just set that robots="noindex" meta tag ''on the edit page HTML template?'' This way, the meta tag would be there on all edit pages, so none of them will get indexed. [[User:Ropers|Ropers]] 18:15, 28 Aug 2004 (UTC)

::We already do. The issue discussed above is that Google returns search results including URLs that are forbidden by robots.txt. Because they are forbidden by robots.txt, Google does not spider the pages and does not see the meta tag. --[[User:Brion VIBBER|Brion VIBBER]] 21:19, 28 Aug 2004 (UTC)

:::Ah. I misunderstood earlier. But then, can we not just do away with any mention of edit pages in robots.txt (which is what I think was proposed above by "letting Google slurp the page")? [[User:Ropers|Ropers]] 21:30, 28 Aug 2004 (UTC)

=== Evil bots ===

Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

Consider also [[request throttling]].

Next page: [[Rewrite Rules]] >

Latest revision as of 06:02, 22 February 2022

Imported with full history. iAlex 13:01, 8 November 2007 (UTC)[reply]