You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2015/12/10 16:37:11 UTC

[jira] [Comment Edited] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

    [ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051126#comment-15051126 ] 

Markus Jelsma edited comment on NUTCH-1995 at 12/10/15 3:36 PM:
----------------------------------------------------------------

Guys, we upgraded to 1.11 but got these curious exceptions when running the crawler on Hadoop. This error won't appear when Nutch runs locally. I've tracked it via the CHANGES.txt down to this issue:

{code}
2015-12-10 15:23:16,725 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: fetch of http://www.example.org.nl/bla_bla failed with: java.lang.NoSuchMethodError: org.apache.nutch.protocol.http.api.HttpRobotRulesParser.isWhiteListed(Ljava/net/URL;)Z
	at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:105)
	at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:151)
	at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:567)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:730)
{code}


was (Author: markus17):
Guys, we upgraded to 1.11 but got these curious exceptions when running the crawler on Hadoop. This error won't appear when Nutch runs locally.

2015-12-10 15:23:16,725 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: fetch of http://www.example.org.nl/bla_bla failed with: java.lang.NoSuchMethodError: org.apache.nutch.protocol.http.api.HttpRobotRulesParser.isWhiteListed(Ljava/net/URL;)Z
	at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:105)
	at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:151)
	at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:567)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:730)

> Add support for wildcard to http.robot.rules.whitelist
> ------------------------------------------------------
>
>                 Key: NUTCH-1995
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1995
>             Project: Nutch
>          Issue Type: Improvement
>          Components: robots
>    Affects Versions: 1.10
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: memex
>             Fix For: 1.11
>
>         Attachments: NUTCH-1995.MattmannNagelTotaro.05-26-2015.patch, NUTCH-1995.MattmannNagelTotaro.05-27-2015.patch, NUTCH-1995.MattmannNagelTotaro.patch, NUTCH-1995.patch
>
>
> The {{http.robot.rules.whitelist}} ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration parameter allows to specify a comma separated list of hostnames or IP addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very useful and simplify the configuration, for example, if we need to give many hostnames/addresses. Here is an example:
> {noformat}
> <name>http.robot.rules.whitelist</name>
>   <value>*.sample.com</value>
>   <description>Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   </description>
> </property>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)