You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2012/07/17 14:16:33 UTC

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

    [ https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416124#comment-13416124 ] 

Markus Jelsma commented on NUTCH-1430:
--------------------------------------

If an existing record exists in the CrawlDB, it is just overwritten. The bug has been present in all recent versions. Until fixed it's bad idea to use the FreeGenerator tool with AdaptiveFetchScheduling enabled on an existing CrawlDB.
                
> Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1430
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1430
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.5
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>
> Steps to reproduce:
> Without AdaptiveFetchSchedule:
> {code}
> $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
> URL: http://www.openindex.io/en/home.html
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Thu Aug 16 13:58:23 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 0.0
> Signature: c2601ca503f2fc5edcb286501d7fb271
> Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
> {code}
> With AdaptiveFetchSchedule:
> {code}
> $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
> URL: http://www.openindex.io/en/home.html
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Tue Jul 17 13:56:33 CEST 2012
> Modified time: Tue Jul 17 13:55:33 CEST 2012
> Retries since fetch: 0
> Retry interval: 60 seconds (0 days)
> Score: 0.0
> Signature: 23567bb52ee8b905b8649c4305ed82ee
> Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira