You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2020/08/03 19:08:00 UTC

[jira] [Resolved] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back

     [ https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel resolved NUTCH-2801.
------------------------------------
    Resolution: Fixed

> RobotsRulesParser command-line checker to use http.robots.agents as fall-back
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-2801
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2801
>             Project: Nutch
>          Issue Type: Bug
>          Components: checker, robots
>    Affects Versions: 1.17
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.18
>
>
> The RobotsRulesParser command-line tool, used to check a list of URLs against one robots.txt file, should use the value of the property {{http.robots.agents}} as fall-back if no user agent names are explicitly given as command-line argument. In this case it should behave same as the robots.txt parser, looking first for {{http.agent.name}}, then for other names listed in {{http.robots.agents}}, finally picking the rules for {{User-agent: *}}
> {noformat}
> $> cat robots.txt
> User-agent: Nutch
> Allow: /
> User-agent: *
> Disallow: /
> $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
>       -Dhttp.agent.name=mybot \
>       -Dhttp.robots.agents='nutch,goodbot' \
>       robots.txt urls.txt 
> Testing robots.txt for agent names: mybot,nutch,goodbot
> not allowed:    https://www.example.com/
> {noformat}
> The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only the name "mybot" is actually checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)