You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2012/10/30 00:20:15 UTC

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

    [ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486484#comment-13486484 ] 

Sebastian Nagel commented on NUTCH-578:
---------------------------------------

NUTCH-1245 provides a test to catch this problem.

Attached v5 patch:
* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit and status is set to db_gone. All attached patches do this: it will set the fetchInterval to a value larger than one day, so that from now on the URL is not fetched again and again.
* reset the retry counter in setPageGoneSchedule so that it cannot overflow and to get again 3 trials after db.max.fetch.interval is reached.
                
> URL fetched with 403 is generated over and over again
> -----------------------------------------------------
>
>                 Key: NUTCH-578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-578
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007
>            Reporter: Nathaniel Powell
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> <property>
>   <name>db.fetch.retry.max</name>
>   <value>3</value>
>   <description>The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.</description>
> </property>
> However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira