You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2012/12/22 11:53:12 UTC

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

    [ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538725#comment-13538725 ] 

Tejas Patil commented on NUTCH-1284:
------------------------------------

I searched for the relevant mail thread[0] to get an idea why this bug was created. 
Quick recap of the issue: 
Despite fetcher.max.crawl.delay was set to -1, nutch was marking the url as ROBOTS_DENIED. With fetcher.max.crawl.delay= -1, the expected behavior is to wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be.

Lewis could reproduce the issue. He suggested the change mentioned in the bug and hinted that there might be some problem with that property.

An additional condition was needed to be changed which prevents urls from being marked DB_GONE when fetcher.max.crawl.delay= -1 (ie. maxCrawlDelay = -1000). After this change, I tested with the scenario mentioned in [0] and it worked fine.

[0]: http://lucene.472066.n3.nabble.com/Re-Re-Re-Re-fetcher-max-crawl-delay-1-doesn-t-work-tc3749639.html
                
> Add site fetcher.max.crawl.delay as log output by default.
> ----------------------------------------------------------
>
>                 Key: NUTCH-1284
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1284
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>             Fix For: 1.7
>
>         Attachments: NUTCH-1284.patch
>
>
> Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like:
> {code}
> 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms)
> {code}
> This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira