You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2013/10/06 22:33:42 UTC
[jira] [Commented] (NUTCH-1588) Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x

    [ https://issues.apache.org/jira/browse/NUTCH-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787763#comment-13787763 ] 

Sebastian Nagel commented on NUTCH-1588:
----------------------------------------

Ok, the patch is a port of the fix for 1.x (NUTCH-1245) and should work. Code style guidelines are not followed: [~talat], can you format the code accordingly (using [eclipse-codeformat.xml|http://svn.apache.org/viewvc/nutch/branches/2.x/eclipse-codeformat.xml], see [[1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing]]). Thanks!

> Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1588
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1588
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: NUTCH-1588.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days)
> # this does not change with every generate-fetch-update cycle, here for two segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
>     datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval
>     datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
>     if (maxInterval < datum.fetchInterval) // necessarily true
>        forceRefetch()
> forceRefetch:
>     if (datum.fetchInterval > maxInterval) // true because it's 1.35 * maxInterval
>        datum.fetchInterval = 0.9 * maxInterval
>     datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
>     if ((datum.fetchTime - curTime) > maxInterval)
>        // always true if the crawler is launched in short intervals
>        // (lower than 0.35 * maxInterval)
>        datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578



--
This message was sent by Atlassian JIRA
(v6.1#6144)