You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by an...@orbita1.ru on 2005/11/21 11:22:56 UTC

mapred.map.tasks

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.



RE: mapred.map.tasks

Posted by an...@orbita1.ru.
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.

In nutch-site.xml I specified parameters:

1) On the both machines:
<property>
  <name>fs.default.name</name>
  <value>192.168.0.250:9009</value>
  <description>The name of the default file system.  Either the
  literal string "local" or a host:port for NDFS.</description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>192.168.0.250:9010</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".  
  </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>
 



On 192.168.0.250 I started:
2)       bin/nutch-daemon.sh start datanode
3)       bin/nutch-daemon.sh start namenode
4)       bin/nutch-daemon.sh start jobtracker
5)       bin/nutch-daemon.sh start tasktracker

I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..

 

Then I launched command: 
bin/nutch crawl seeds -depth 2

I a result I received log written by jobtracker:
....
051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.
 

Log written by tasktracker on 192.168.0.111:
......
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.
 

Log written by tasktracker on 192.168.0.250:
....
051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.

 

I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative

progress).

But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.



-----Original Message-----
From: Doug Cutting [mailto:cutting@nutch.org] 
Sent: Tuesday, November 22, 2005 2:10 AM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks

anton@orbita1.ru wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.

Can you please post a simple example that demonstrates the "negative 
progress" problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug



Re: mapred.map.tasks

Posted by Doug Cutting <cu...@nutch.org>.
anton@orbita1.ru wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.

Can you please post a simple example that demonstrates the "negative 
progress" problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug