You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Eason.Lee" <le...@gmail.com> on 2010/06/17 05:01:34 UTC

problem with url number in CrawlDB

I'am testing nutch by crawling test website
All the pages in the website are static html pages
but everytime i crawled the sites,the page number downloaded by nutch is
different
Looking into the crawlDB files ,I found that some url's signatures are
null,but sometimes they are not

details in the crawldb in the test we got 60w records

http://site117-1.com/chn13/wjdt/wjzc/default.htm Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Jul 09 03:04:20 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.1721904E-4
Signature: null
Metadata: _pst_: success(1), lastModified=0

details in the crawldb in the test we got 70w records
http://site117-1.com/chn13/wjdt/wjzc/default.htm Version: 7
Status: 2 (db_fetched)
Fetch time: Sun Jul 11 04:57:53 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.1721904E-4
Signature: 5f385fccc40b871e68cb86becbabea92
Metadata: _pst_: success(1), lastModified=0

anyone can tell me what's the problem?