You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2014/01/24 18:29:37 UTC
[jira] [Created] (NUTCH-1715) RobotRulesParser adds additional '*'
to the robots name
Tejas Patil created NUTCH-1715:
----------------------------------
Summary: RobotRulesParser adds additional '*' to the robots name
Key: NUTCH-1715
URL: https://issues.apache.org/jira/browse/NUTCH-1715
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 2.2.1, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
Fix For: 2.3, 1.8
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard (*) to it in the end. This is sent to crawler commons while parsing the rules. The wildcard (*) added to the end gets matched with the first rule in robots file and thus results in the url being robots denied while the robots.txt actually allows them.
This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)