You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by RP <rp...@earthlink.net> on 2006/12/21 05:24:03 UTC
Nutch tuning - speed improvements that worked for me

Some tuning results - play with what you have and you might be 
surprised..!!  A simple tweak to run Java as a server "-server" switch, 
gave a ~13% improvement as noted below for a readdb.  The -server tweak 
did not help on query results via Tomcat but for basic Nutch DB work, it 
did pretty well (this is a standalone box and a resource limited one as 
well).  As such, I've got this tweaked right in the nutch file in bin so 
it's picked up just for Nutch.  I also played with the -Xm? type 
settings and if you have a memory limited machine like I do, this helped 
to reduce the swapping that was really slowing things down (my Nutch 
install has 1000m heap size - way too big for my box).  There are other 
Java things I've not tried yet (incremental garbage collection, etc.).  
The experienced nutchers will have done this, but for other newbies like 
me this may help, and these Java tweaks are applicable to all Nutch 
revs....  Also - use jconsole to probe the jvm resources being used in 
real time - my basic setup is quite a bit faster now than in the default 
config:

-client (default for java and Nutch)
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2631275
retry 0:        2618055
retry 1:        6847
retry 2:        741
retry 3:        5632
min score:      0.0
avg score:      5.279
max score:      4063232.0
status 1 (DB_unfetched):        2201893
status 2 (DB_fetched):  390543
status 3 (DB_gone):     38839
CrawlDb statistics: done

real    7m34.655s
user    7m19.948s
sys     0m10.032s

-server
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2631275
retry 0:        2618055
retry 1:        6847
retry 2:        741
retry 3:        5632
min score:      0.0
avg score:      5.279
max score:      4063232.0
status 1 (DB_unfetched):        2201893
status 2 (DB_fetched):  390543
status 3 (DB_gone):     38839
CrawlDb statistics: done

real    6m39.170s
user    6m22.691s
sys     0m10.191s

That's ~13% better....

On another note - look at the switches you have available - for me 
turning off filtering on the generate, and turning off parsing during 
the fetch gave a nice boost.  I run filtering from time to time on the 
crawldb so no need to duplicate that effort in the generate step as it 
really slows it down.  I just run the parse after the fetch is done, and 
my combined times seem shorter than doing it in one step as I'm also CPU 
AND bandwidth throttled.  As always, your mileage may vary so give some 
things a try and you might get a nice surprise in improved speed....

-- 
rp