You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mo...@richmondinformatics.com on 2006/03/22 11:05:58 UTC

Tuning nutch-0.8-dev (rev-374745 of 2006-02-03)

Hello Team,

Thanks to Andrzej for his support and a number of high level pointers in the
matter of performance tuning.

I am running the above version of nutch with mapred/ndfs across a cluster of
five servers.  One acting as namenode and jobtracker and all acting as
datanodes and tasktrackers.

Each server has a Xeon processor, 2Gig of RAM, and they are interconnected by
Gigabit copper ethernet.

My aim is to do a shallow whole-web crawl as fast as possible.  Later after
analysing the returns, I intend to re-tune in order to crawl priority sites
more deeply.

In order to achieve speed, the strategy has been to:

- Allocate 7mbps of bandwidth

- Filter out pdf, doc, xls in regex-urlfilter.txt

 
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|pdf|doc|xls)$

- Limit http.content.limit to 64k in nutch-site.xml and mapred-site.xml

  <name>http.content.limit</name>
  <value>65536</value>

- Limit the number of pages per site (unconfirmed)

  <name>generate.max.per.host</name>
  <value>10</value>

- Allocate a large number (150) threads to the fetcher

  <name>fetcher.threads.fetch</name>
  <value>150</value>

I did at one time see the fetcher fetching at speeds of up to 19 pages/s.  Must
have been beginners' luck, though, because since that time I've been getting
results like these.

060321 220217 task_m_3pokfb 0.21248446% 29936 pages, 2811 errors, 5.9 pages/s,
1087 kb/s,
060321 220218 task_m_3pokfb 0.21250586% 29939 pages, 2811 errors, 5.9 pages/s,
1087 kb/s,

I am confused because the "1087 kb/s" looks pretty healthy, and I'm wondering
how to get back up to, say, 12 pages/s.

Also, on the subject of tuning for speed, I am confused about the relevance of
the "-numFetchers n" flag in the "generate" command. I understand that it
causes that "n" segments to be created, but, when using mapred, does the
"fetch" command then understand that it should allocate one fetcher per
segment?

If so, is the benefit -

- resilience so that failed fetches can be re-started individually

- performance; or

- is it a throwback to pre-mapred implementations?

BTW, if I once I understand the performance envelope, I would like to add a
tuning tutorial to the FAQ, or perhaps to Dennis Kubes's NutchHadoopTutorial
(thanks Dennis).

Many thanks,

Monu Ogbe
---------

PS.  For completeness, the following is my nutch-site.xml.  mapred-site.xml is
an exact copy of it.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Do not modify this file directly.  Instead, copy entries that you -->
<!-- wish to modify from this file into nutch-site.xml and change them -->
<!-- there.  If nutch-site.xml does not already exist, create it.      -->

<nutch-conf>

<!-- HTTP properties -->

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

<!-- FTP properties -->

<!-- web db properties -->

<property>
  <name>db.max.outlinks.per.page</name>
  <value>1000</value>
  <description>Was 100. The maximum number of outlinks that we'll process for a
page.
  </description>
</property>

<!-- generate properties -->

<property>
  <name>generate.max.per.host</name>
  <value>10</value>
  <description>Was -1. The maximum number of urls per host in a single
  fetchlist.  -1 if unlimited.
  ID128 - Set super-low, for testing.
  </description>
</property>

<!-- fetcher properties -->

<property>
  <name>fetcher.threads.fetch</name>
  <value>150</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>

<property>
  <name>fetcher.parse</name>
  <value>true</value>
  <description>If true, fetcher will parse content.
  For ID128 purposes this will be set using the
  -noParse option in the command line, instead.
  </description>
</property>

<!-- i/o properties -->

<!-- file system properties -->

<property>
  <name>fs.default.name</name>
  <value>nutch1.my.domain:50000</value>
  <description>The name of the default file system.  Either the
  literal string "local" or a host:port for NDFS.</description>
</property>

<property>
  <name>ndfs.name.dir</name>
  <value>/home/nutch/ndfs/name</value>
  <description>Determines where on the local filesystem the NDFS name node
      should store the name table.</description>
</property>

<property>
  <name>ndfs.data.dir</name>
  <value>/home/nutch/ndfs/data</value>
  <description>Determines where on the local filesystem an NDFS data node
  should store its blocks.  If this is a comma- or space-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.</description>
</property>

<property>
  <name>ndfs.replication</name>
  <value>3</value>
  <description>How many copies we try to have at all times. The actual
  number of replications is at max the number of datanodes in the
  cluster.</description>
</property>

<!-- map/reduce properties -->

<property>
  <name>mapred.job.tracker</name>
  <value>nutch1.my.domain:50020</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/nutch/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a space- or comma- separated list of
  directories on different devices in order to spread disk i/o.
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/home/nutch/mapred/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>/home/nutch/mapred/temp</value>
  <description>A shared directory for temporary files.
  </description>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>51</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>5</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

<!-- indexer properties -->

<!-- analysis properties -->

<!-- searcher properties -->

<property>
  <name>searcher.max.hits</name>
  <value>200</value>
  <description>If positive, search stops after this many hits are
  found.  Setting this to small, positive values (e.g., 1000) can make
  searches much faster.  With a sorted index, the quality of the hits
  suffers little.</description>
</property>

<!-- URL normalizer properties -->

<!-- ipc properties -->

</nutch-conf>

Re: Tuning nutch-0.8-dev (rev-374745 of 2006-02-03)

Posted by Doug Cutting <cu...@apache.org>.

monu.ogbe@richmondinformatics.com wrote:
> Also, on the subject of tuning for speed, I am confused about the relevance of
> the "-numFetchers n" flag in the "generate" command. I understand that it
> causes that "n" segments to be created, but, when using mapred, does the
> "fetch" command then understand that it should allocate one fetcher per
> segment?

In 0.8 this determines the number of input directories that will be 
generated in each segment, and, consequently, the number of map tasks 
when fetching.  Urls are hashed into these so that they are hostwise 
disjoint.

> If so, is the benefit -
> 
> - resilience so that failed fetches can be re-started individually
> 
> - performance; or

Both of these.  Multiple fetchlists can be fetched in parallel, and, if 
they crash, can be restarted.  But if you use too many and don't have 
very many unique hosts, then each will be performance limited by 
politeness (if a task has urls from only 2 hosts, and it waits a second 
between accesses, then it can maximally fetch only 2 pages/second).

> PS.  For completeness, the following is my nutch-site.xml.  mapred-site.xml is
> an exact copy of it.
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> 
> <!-- Do not modify this file directly.  Instead, copy entries that you -->
> <!-- wish to modify from this file into nutch-site.xml and change them -->
> <!-- there.  If nutch-site.xml does not already exist, create it.      -->

This comment is confusing in a nutch-site.xml...

> <property>
>   <name>mapred.map.tasks</name>
>   <value>51</value>
>   <description>The default number of map tasks per job.  Typically set
>   to a prime several times greater than number of available hosts.
>   Ignored when mapred.job.tracker is "local".
>   </description>
> </property>
> 
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>5</value>
>   <description>The default number of reduce tasks per job.  Typically set
>   to a prime close to the number of available hosts.  Ignored when
>   mapred.job.tracker is "local".
>   </description>
> </property>

These values should not be placed in nutch-site, since that causes them 
to override job-specified values.  They should instead be in 
mapred-default,xml, so that jobs can sometimes override them.  For 
example, the generate task manipulates the number of reduce tasks in 
order to generate the appropriate number of input directories for 
fetching (as described above).

> <property>
>   <name>searcher.max.hits</name>
>   <value>200</value>
>   <description>If positive, search stops after this many hits are
>   found.  Setting this to small, positive values (e.g., 1000) can make
>   searches much faster.  With a sorted index, the quality of the hits
>   suffers little.</description>
> </property>

Make sure you're using a sorted indexer if you're using this.  Otherwise 
your results could suffer greatly.

Doug