You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Peter Harrington <pe...@gmail.com> on 2011/09/07 19:28:15 UTC
CrawlDb and Generator time growing unnaturally
I run Nutch1.3 crawl with topN = 5000, and depth=20.
For the first two crawl cycles the Generator and CrawlDb Update phases
take ~1hour. Around the 3rd cycle this increases to 3.5 hours, then
around the 9th cycle these two phases take over 12 hours. I have
plotted out this time and it's not growing naturally as in linearly or
exponsntially or anything like that. There are distinct digital steps
in the Generator and CrawlDb time. I expect these phases to take
longer as I have more links but not like this.
After the crawling is complete I started crawling again and the
Generator and CrawlDb time go back to taking ~1 hours. It seems that
I can keep these times at 1hour if I do not use a depth>2.
Why is this happening? Any ideas?
During these two phases the processor is 99% utilized, and the memory only 11%.
Re: CrawlDb and Generator time growing unnaturally
Posted by Markus Jelsma <ma...@openindex.io>.
It's likely you're normalizing and filtering in both jobs. We don't do
filtering or normalization at all for both jobs and rely on ParseOutputFormat
instead.
> I run Nutch1.3 crawl with topN = 5000, and depth=20.
> For the first two crawl cycles the Generator and CrawlDb Update phases
> take ~1hour. Around the 3rd cycle this increases to 3.5 hours, then
> around the 9th cycle these two phases take over 12 hours. I have
> plotted out this time and it's not growing naturally as in linearly or
> exponsntially or anything like that. There are distinct digital steps
> in the Generator and CrawlDb time. I expect these phases to take
> longer as I have more links but not like this.
> After the crawling is complete I started crawling again and the
> Generator and CrawlDb time go back to taking ~1 hours. It seems that
> I can keep these times at 1hour if I do not use a depth>2.
> Why is this happening? Any ideas?
> During these two phases the processor is 99% utilized, and the memory only
> 11%.