You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel Clark <da...@verizon.net> on 2007/11/08 19:46:15 UTC

Cluster hadoop-site.xml Settings

I have a nine box cluster using hadoop and I want to get the optimum
performance.  I'm crawling 5 million sites.  There are three settings in the
hadoop-site.xml that I'm not clear on.  Please, help.

 

Mapred Map & Reduce Tasks

========================

I have the following based on the description note, but the wiki said to use
multiples of the number of slave hosts.  Can I up this to 36 or even more to
speed up my crawl?  What is recommended?

 

<property>

  <name>mapred.map.tasks</name>

  <value>9</value>

  <description>

    define mapred.map tasks to be number of slave hosts

  </description>

</property>

 

<property>

  <name>mapred.reduce.tasks</name>

  <value>9</value>

  <description>

    define mapred.reduce tasks to be number of slave hosts

  </description>

</property>

 

Replication

===========

The wiki said to use 2 or 3.  Why?  What is recommended for the best
performance?

 

<property>

  <name>dfs.replication</name>

  <value>2</value>

</property>

 

~~~~~~~~~~~~~~~~~~~~~

Daniel Clark, President

DAC Systems, Inc.

 (703) 403-0340

~~~~~~~~~~~~~~~~~~~~~