You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2015/09/16 06:27:46 UTC

[jira] [Resolved] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

     [ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-1922.
-----------------------------------------
    Resolution: Duplicate

This issue is a clone of NUTCH-1679 for which I just committed [~alxksn]'s most recent patch on the NUTCH-1679 issue.


> DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
> --------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1922
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1922
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Gerhard Gossen
>             Fix For: 2.3.1
>
>         Attachments: NUTCH-1922.patch
>
>
> When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even if its crawl interval is not yet over.
> To reproduce, using version 2.3:
> {code}
> # Nutch configuration
> ant runtime
> cd runtime/local
> mkdir seeds
> echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
> bin/crawl seeds test 2
> {code}
> This uses two files {{a.html}} and {{b.html}} that link to each other.
> In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 1 (status_unfetched)}}.
> Expected would be {{status: 2 (status_fetched)}}.
> The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed in the same batch always belong to new pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)