You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Rod Taylor (JIRA)" <ji...@apache.org> on 2005/12/03 20:34:30 UTC

[jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

    [ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12359237 ] 

Rod Taylor commented on NUTCH-98:
---------------------------------

According to the Googlebot faq their implementation takes the longest matching URL as the one they obey.

See point 7 of http://www.google.com/webmasters/bot.html.

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin 

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the *longest* rule that matches.  I will attach a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira