You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/03/15 17:41:07 UTC

Question on scalability

Hi everyone,

I am hoping someone could help me on this. I am indexing ~ 2 million URLs on 
12 machines
and I found out that the results were not quite scalable, for example:

when mapred.reduce.tasks was set to 12, it took total about 20 minutes to 
complete the job
(11 minutes for reduce);
when mapred.reduce.tasks was set to 24, it took total about 28 minutes to 
complete the job
(20 minutes for reduce);
when  mapred.reduce.tasks was set to 6, it took total about 24 minutes to 
complete the job
(16 minutes for reduce).

Is hadoop/nutch scalable at all or I can tune some other parameters?

I already have:
mapred.map.tasks set to 100
mapred.job.tracker is not local
mapred.tasktracker.tasks.maximum is 2.
and everything else is default.

I would appreciate any advice on this.
Thank you.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Re: Question on scalability

Posted by Doug Cutting <cu...@apache.org>.
Olive g wrote:
> Is hadoop/nutch scalable at all or I can tune some other parameters?

I'm not sure what you're asking.  How long does it take to run this on a 
single machine?  My guess is that it's much longer.  So things are 
scaling: they're running faster when more hardware is added.  In all 
cases you're using the same number of machines, but varying parameters 
and seeing different performance, as one would expect.  For your current 
configuration, indexing appears fastest when the number of reduce tasks 
equals the number of nodes.

> I already have:
> mapred.map.tasks set to 100
> mapred.job.tracker is not local
> mapred.tasktracker.tasks.maximum is 2.
> and everything else is default.

How are you storing things?  Are you using dfs?

Are your nodes single-cpu or dual-cpu?  My guess is single-cpu, in which 
case you might see more consistent performance with 
mapred.tasktracker.tasks.maximum=1.

How many disks do you have per node?  If you have multiple drives, then 
configuring mapred.local.dir to contain a list of directories, one per 
drive, might make things faster.

Doug