You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by RP <rp...@earthlink.net> on 2006/12/21 05:24:03 UTC
Nutch tuning - speed improvements that worked for me
Some tuning results - play with what you have and you might be
surprised..!! A simple tweak to run Java as a server "-server" switch,
gave a ~13% improvement as noted below for a readdb. The -server tweak
did not help on query results via Tomcat but for basic Nutch DB work, it
did pretty well (this is a standalone box and a resource limited one as
well). As such, I've got this tweaked right in the nutch file in bin so
it's picked up just for Nutch. I also played with the -Xm? type
settings and if you have a memory limited machine like I do, this helped
to reduce the swapping that was really slowing things down (my Nutch
install has 1000m heap size - way too big for my box). There are other
Java things I've not tried yet (incremental garbage collection, etc.).
The experienced nutchers will have done this, but for other newbies like
me this may help, and these Java tweaks are applicable to all Nutch
revs.... Also - use jconsole to probe the jvm resources being used in
real time - my basic setup is quite a bit faster now than in the default
config:
-client (default for java and Nutch)
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 2631275
retry 0: 2618055
retry 1: 6847
retry 2: 741
retry 3: 5632
min score: 0.0
avg score: 5.279
max score: 4063232.0
status 1 (DB_unfetched): 2201893
status 2 (DB_fetched): 390543
status 3 (DB_gone): 38839
CrawlDb statistics: done
real 7m34.655s
user 7m19.948s
sys 0m10.032s
-server
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 2631275
retry 0: 2618055
retry 1: 6847
retry 2: 741
retry 3: 5632
min score: 0.0
avg score: 5.279
max score: 4063232.0
status 1 (DB_unfetched): 2201893
status 2 (DB_fetched): 390543
status 3 (DB_gone): 38839
CrawlDb statistics: done
real 6m39.170s
user 6m22.691s
sys 0m10.191s
That's ~13% better....
On another note - look at the switches you have available - for me
turning off filtering on the generate, and turning off parsing during
the fetch gave a nice boost. I run filtering from time to time on the
crawldb so no need to duplicate that effort in the generate step as it
really slows it down. I just run the parse after the fetch is done, and
my combined times seem shorter than doing it in one step as I'm also CPU
AND bandwidth throttled. As always, your mileage may vary so give some
things a try and you might get a nice surprise in improved speed....
--
rp