You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/12/05 14:45:00 UTC

[jira] [Commented] (NUTCH-2754) fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.

    [ https://issues.apache.org/jira/browse/NUTCH-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988873#comment-16988873 ] 

ASF GitHub Bot commented on NUTCH-2754:
---------------------------------------

sebastian-nagel commented on pull request #487: NUTCH-2754 fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
URL: https://github.com/apache/nutch/pull/487
 
 
   Initialize crawler-commons's SimpleRobotRulesParser with the longest possible internal maxDelay so that Nutch can always handle the max. delay by itself.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
> --------------------------------------------------------------
>
>                 Key: NUTCH-2754
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2754
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, robots
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more) are always ignored, even if fetcher.max.crawl.delay is set to a higher value.
> We need to pass a higher value of fetcher.max.crawl.delay to [crawler-commons' robots.txt parser|https://github.com/crawler-commons/crawler-commons/blob/c9c0ac6eda91b13d534e69f6da3fd15065414fb0/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java#L78] otherwise it will use the internal default value of 300 sec. and disallow all sites specifying a longer Crawl-Delay in their robots.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)