You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/12/29 17:56:54 UTC
nutch retries
Dear nutchers,
below is an output of nutch-1.7 readdb -stats. Why is this retry count going so high?
In nutch-default, there is db.fetch.retry.max set to 3. I did not overwrite this property. Anything I missed?
Thanks,
Martin
13/12/29 15:24:35 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawl/crawldb
13/12/29 15:24:35 INFO crawl.CrawlDbReader: TOTAL urls: 222298055
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 0: 221451536
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 1: 393954
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 10: 13831
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 11: 13833
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 12: 13615
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 13: 13981
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 14: 13649
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 15: 14691
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 16: 14549
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 17: 32747
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 18: 6356
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 2: 111174
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 3: 62275
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 4: 46550
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 5: 35149
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 6: 17968
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 7: 15727
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 8: 13339
13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 9: 13131
13/12/29 15:24:35 INFO crawl.CrawlDbReader: min score: 0.0
13/12/29 15:24:35 INFO crawl.CrawlDbReader: avg score: 0.037087735
13/12/29 15:24:35 INFO crawl.CrawlDbReader: max score: 587.999
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 158627810
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 2 (db_fetched): 58450261
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 3 (db_gone): 2755726
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 889557
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 1574578
13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 6 (db_notmodified): 123
13/12/29 15:24:35 INFO crawl.CrawlDbReader: CrawlDb statistics: done
RE: nutch retries
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - db.fetch.retry.max only sets status DB_GONE is retry value exceeds it. It really doens't too that much. You can use a custom scheduler or set db.gone.interval.max to true. This is only for Nutch 1.8 if i remember correctly.
-----Original message-----
> From:Martin Aesch <ma...@googlemail.com>
> Sent: Sunday 29th December 2013 17:57
> To: user@nutch.apache.org
> Subject: nutch retries
>
> Dear nutchers,
>
> below is an output of nutch-1.7 readdb -stats. Why is this retry count going so high?
>
> In nutch-default, there is db.fetch.retry.max set to 3. I did not overwrite this property. Anything I missed?
>
> Thanks,
> Martin
>
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawl/crawldb
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: TOTAL urls: 222298055
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 0: 221451536
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 1: 393954
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 10: 13831
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 11: 13833
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 12: 13615
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 13: 13981
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 14: 13649
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 15: 14691
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 16: 14549
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 17: 32747
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 18: 6356
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 2: 111174
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 3: 62275
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 4: 46550
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 5: 35149
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 6: 17968
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 7: 15727
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 8: 13339
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: retry 9: 13131
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: min score: 0.0
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: avg score: 0.037087735
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: max score: 587.999
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 158627810
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 2 (db_fetched): 58450261
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 3 (db_gone): 2755726
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 889557
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 1574578
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: status 6 (db_notmodified): 123
> 13/12/29 15:24:35 INFO crawl.CrawlDbReader: CrawlDb statistics: done
>
>
>