You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2006/10/27 18:32:23 UTC

how to minimize reduce operations when using single machine

I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing "reduce > sort"
and reduce "reduce > reduce". This is not necessary since it uses only the
local file system.  I'm not familiar with map-reduce code, but guess it may
be possible to control the number of map and reduce operations.  Is it
possible to configure nutch to break fetch job to only few sub-operations so
that there will be only 1 or few map and reduce opresation?  What setting or
code can be changed to minimize the time spent on map-reduce operations when
crawling with a single machine?

Thanks,
AJ