You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/12/29 17:56:54 UTC

nutch retries

Dear nutchers,

below is an output of nutch-1.7 readdb -stats. Why is this retry count going so high?

In nutch-default, there is db.fetch.retry.max set to 3. I did not overwrite this property. Anything I missed?

Thanks,
Martin

13/12/29 15:24:35 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawl/crawldb
13/12/29 15:24:35 INFO crawl.CrawlDbReader: TOTAL urls: 222298055
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 0:    221451536
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 1:    393954
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 10:   13831
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 11:   13833
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 12:   13615
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 13:   13981
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 14:   13649
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 15:   14691
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 16:   14549
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 17:   32747
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 18:   6356
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 2:    111174
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 3:    62275
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 4:    46550
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 5:    35149
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 6:    17968
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 7:    15727
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 8:    13339
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 9:    13131
13/12/29 15:24:35 INFO crawl.CrawlDbReader: min score:  0.0
13/12/29 15:24:35 INFO crawl.CrawlDbReader: avg score:  0.037087735
13/12/29 15:24:35 INFO crawl.CrawlDbReader: max score:  587.999
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    158627810
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 2 (db_fetched):      58450261
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 3 (db_gone): 2755726
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   889557
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   1574578
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 6 (db_notmodified):  123
13/12/29 15:24:35 INFO crawl.CrawlDbReader: CrawlDb statistics: done

RE: nutch retries

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - db.fetch.retry.max only sets status DB_GONE is retry value exceeds it. It really doens't too that much. You can use a custom scheduler or set db.gone.interval.max to true. This is only for Nutch 1.8 if i remember correctly.

 
 
-----Original message-----
> From:Martin Aesch <ma...@googlemail.com>
> Sent: Sunday 29th December 2013 17:57
> To: user@nutch.apache.org
> Subject: nutch retries
> 
> Dear nutchers,
> 
> below is an output of nutch-1.7 readdb -stats. Why is this retry count going so high?
> 
> In nutch-default, there is db.fetch.retry.max set to 3. I did not overwrite this property. Anything I missed?
> 
> Thanks,
> Martin
> 
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawl/crawldb
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: TOTAL urls: 222298055
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 0:    221451536
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 1:    393954
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 10:   13831
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 11:   13833
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 12:   13615
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 13:   13981
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 14:   13649
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 15:   14691
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 16:   14549
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 17:   32747
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 18:   6356
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 2:    111174
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 3:    62275
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 4:    46550
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 5:    35149
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 6:    17968
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 7:    15727
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 8:    13339
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 9:    13131
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: min score:  0.0
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: avg score:  0.037087735
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: max score:  587.999
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    158627810
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 2 (db_fetched):      58450261
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 3 (db_gone): 2755726
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   889557
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   1574578
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 6 (db_notmodified):  123
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: CrawlDb statistics: done
> 
> 
>