You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2012/08/15 15:58:38 UTC

[jira] [Commented] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names

    [ https://issues.apache.org/jira/browse/NUTCH-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435123#comment-13435123 ] 

Ken Krugler commented on NUTCH-1455:
------------------------------------

I added a test to crawler-commons to confirm that its robots.txt parser handles this correctly :)
                
> RobotRulesParser to match multi-word user-agent names
> -----------------------------------------------------
>
>                 Key: NUTCH-1455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1455
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.5.1
>            Reporter: Sebastian Nagel
>             Fix For: 1.6
>
>
> If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt
> http.robots.agents = "Download Ninja,*"
> If the robots.txt (http://en.wikipedia.org/robots.txt) contains
> {code}
> User-agent: Download Ninja
> Disallow: /
> {code}
> all content should be forbidden. But it isn't:
> {code}
> % curl 'http://en.wikipedia.org/robots.txt' > robots.txt
> % grep -A1 -i ninja robots.txt 
> User-agent: Download Ninja
> Disallow: /
> % cat test.urls
> http://en.wikipedia.org/
> % bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls 'Download Ninja'
> ...
> allowed:        http://en.wikipedia.org/
> {code}
> The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that
> bq. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a    substring.
> Assumed that "Downlaod Ninja" is a substring of itself it should match and http://en.wikipedia.org/ should be forbidden.
> The point is that the agent name from the User-Agent line is split at spaces while the names from the http.robots.agents property are not (they are only split at ",").

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira