You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "saravan.krish" <sa...@cognizant.com> on 2009/10/30 08:23:29 UTC

What are the configuration parameters to fine tune Nutch performance

I am new to nutch. I have few questions
1) Can anyone please let me know the configuration parameters by which we
can improve and fine tune the nutch performance? 

2) Also is there any way to resume the crawling process when it failed?
-- 
View this message in context: http://old.nabble.com/What-are-the-configuration-parameters-to-fine-tune-Nutch-performance-tp26125943p26125943.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: What are the configuration parameters to fine tune Nutch performance

Posted by John Whelan <jo...@whelanlabs.com>.
The default tuning parameters are specified in nutch/conf/nutch-default.xml,
and can be overridden in nutch/conf/nutch-site.xml. (Or in the crawl command
line, but I believe that the 'best practice' is to configure settings in
nutch-site.xml.)

My personal belief is that the two most valuable parameters for tuning the
crawler are 'fetcher.threads.fetch' and 'fetcher.threads.per.host'. However,
there are lots of other parameters for tuning, and you might find more value
in some of the timeout parameters. (You might also want to look at tuning
you JVM heap space, but I've never seen a real need to tweak it.)

As far as resuming a failed crawl, I don't know of any way to do so. I
always discard and restart.

-- 
View this message in context: http://old.nabble.com/What-are-the-configuration-parameters-to-fine-tune-Nutch-performance-tp26125943p26250181.html
Sent from the Nutch - User mailing list archive at Nabble.com.