You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/09/15 19:22:43 UTC

[0.7] Optimize Whole Web Crawl Process

Wondering if anyone would be willing to share 
optimizations/configurations they've done for the whole web crawling 
strategy.  I'm using a Dual CPU system with 4GB of ram and the 
performance has been lacking.  This is for a large academic domain with 
several (hundreds) or sub-domains and treating it as a whole web crawl 
process.

Questions:

1) What JVM are you using for SMP (Fedora Core 4)?  Is there a JVM (with 
OS)  where the underlying thread management will take full advantage of 
both CPUs? It appears SUN is locking nutch into one CPU.

2) What have you done for memory management? 4GB of RAM affords the JVM 
to grab a large memory slice but with top 10K - 50K URL segments the box 
will grind to a halt.

3) How are you scripting the processes of fetch, dedup, analyze, 
refetch, etc...   The useful scripts from the WIKI are a good starting 
point but I'm wondering if there is a more advanced/optimized 
configuration someone is using.

3a) Specifically, how are you handling/scripting the creation, fetching, 
merging of segments? What sizes? Using topN or other method?