Jump to content

Robots.txt: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
No edit summary
Line 47: Line 47:


Next page: [[Rewrite Rules]] >
Next page: [[Rewrite Rules]] >
<br>
[http://www.hg-fix.com/ 数据恢复]
[http://www.hg-fix.com/ 硬盘数据恢复]
[http://www.hg-fix.com/ 磁带数据恢复]
[http://www.hg-fix.com/ raid数据恢复]
[http://www.hg-fix.com/ 磁盘阵列数据恢复]
[http://www.hg-fix.com/diskrecover.htm 数据恢复]
[http://www.hg-fix.com/taperecover.htm 数据恢复]
[http://www.hg-fix.com/raidrecover.htm 数据恢复]
[http://www.hg-fix.com/ 数据修复]
[http://www.hg-fix.com/ 硬盘数据修复]
[http://www.hg-fix.com/ 磁带数据修复]
[http://www.hg-fix.com/ raid数据修复]
[http://www.hg-fix.com/diskrecover.htm 数据修复]
[http://www.hg-fix.com/raidrecover.htm 数据修复]
[http://www.hg-fix.com/taperecover.htm 数据修复]
[http://www.hg-fix.com/ 磁盘阵列数据修复]

Revision as of 08:14, 12 April 2005

<- MediaWiki architecture < Apache config

The Robots Exclusion Standard allows advising web robots by means of the file {{SERVER}}/robots.txt, e.g for this project //meta.wikimedia.org/robots.txt.

Nice robot

In your robots.txt file, you would be wise to deny access to the script directory, hence diffs, old revisions, contribs lists, etc etc, which could severely raise the load on the server.

If not using URL rewriting, this could be difficult to do cleanly. If using a system like on Wikipedia where plain pages are gotten to via /wiki/Some_title and anything else via /w/wiki.phtml?title=Some_title&someoption=blah, it's easy:

 User-agent: *
 Disallow: /w/

Be careful, though! If you put this line by accident:

 Disallow: /w

you'll block access to the /wiki directory, and search engines will drop your wiki.

Problems

Unfortunately, there are three big problems with robots.txt:

Rate control

You can only specify what paths a bot is allowed to spider. Even just allowing the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)

Bots that don't behave well by default could be forced into line with some sort of request throttling.

Don't index vs don't spider

Most search engine spiders will consider a match on a robots.txt 'Disallow' entry to mean that they should not return that URL in search results. Google is a rare exception, which is technically to specs but is very annoying: it will index such URLs and may return them in search results, albeit without being able to show the content or title of the page or anything other than the URL.

This means that sometimes "edit" URLs will turn up in Google results, which is very VERY annoying.

The only way to keep a URL out of Google's index is to let Google slurp the page and see a meta tag specifying robots="noindex". Although this meta tag is already present on the edit page HTML template, Google does not spider the edit pages (because they are forbidden by robots.txt) and therefore does not see the meta tag.

With our current system, this would be difficult to special case. It would be technically possible to exclude the edit pages from the disallow line in robots.txt, but this would require reworking some functions.

Evil bots

Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

Consider also request throttling.

Next page: Rewrite Rules >
数据恢复 硬盘数据恢复 磁带数据恢复 raid数据恢复 磁盘阵列数据恢复 数据恢复 数据恢复 数据恢复 数据修复 硬盘数据修复 磁带数据修复 raid数据修复 数据修复 数据修复 数据修复 磁盘阵列数据修复