You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel Clark <da...@verizon.net> on 2007/11/08 19:46:15 UTC
Cluster hadoop-site.xml Settings
I have a nine box cluster using hadoop and I want to get the optimum
performance. I'm crawling 5 million sites. There are three settings in the
hadoop-site.xml that I'm not clear on. Please, help.
Mapred Map & Reduce Tasks
========================
I have the following based on the description note, but the wiki said to use
multiples of the number of slave hosts. Can I up this to 36 or even more to
speed up my crawl? What is recommended?
<property>
<name>mapred.map.tasks</name>
<value>9</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>9</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
Replication
===========
The wiki said to use 2 or 3. Why? What is recommended for the best
performance?
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
~~~~~~~~~~~~~~~~~~~~~
Daniel Clark, President
DAC Systems, Inc.
(703) 403-0340
~~~~~~~~~~~~~~~~~~~~~