You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Peter Harrington <pe...@gmail.com> on 2011/09/07 19:28:15 UTC

CrawlDb and Generator time growing unnaturally

I run Nutch1.3 crawl with topN = 5000, and depth=20.
For the first two crawl cycles the Generator and CrawlDb Update phases
take ~1hour.  Around the 3rd cycle this increases to 3.5 hours, then
around the 9th cycle these two phases take over 12 hours.  I have
plotted out this time and it's not growing naturally as in linearly or
exponsntially or anything like that.  There are distinct digital steps
in the Generator and CrawlDb time.  I expect these phases to take
longer as I have more links but not like this.
After the crawling is complete I started crawling again and the
Generator and CrawlDb time go back to taking ~1 hours.  It seems that
I can keep these times at 1hour if I do not use a depth>2.
Why is this happening?  Any ideas?
During these two phases the processor is 99% utilized, and the memory only 11%.

Re: CrawlDb and Generator time growing unnaturally

Posted by Markus Jelsma <ma...@openindex.io>.
It's likely you're normalizing and filtering in both jobs. We don't do 
filtering or normalization at all for both jobs and rely on ParseOutputFormat 
instead.

> I run Nutch1.3 crawl with topN = 5000, and depth=20.
> For the first two crawl cycles the Generator and CrawlDb Update phases
> take ~1hour.  Around the 3rd cycle this increases to 3.5 hours, then
> around the 9th cycle these two phases take over 12 hours.  I have
> plotted out this time and it's not growing naturally as in linearly or
> exponsntially or anything like that.  There are distinct digital steps
> in the Generator and CrawlDb time.  I expect these phases to take
> longer as I have more links but not like this.
> After the crawling is complete I started crawling again and the
> Generator and CrawlDb time go back to taking ~1 hours.  It seems that
> I can keep these times at 1hour if I do not use a depth>2.
> Why is this happening?  Any ideas?
> During these two phases the processor is 99% utilized, and the memory only
> 11%.