You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Doug Cutting <cu...@apache.org> on 2006/04/05 23:47:01 UTC

Re: Tuning nutch-0.8-dev (rev-374745 of 2006-02-03)

monu.ogbe@richmondinformatics.com wrote:
> Also, on the subject of tuning for speed, I am confused about the relevance of
> the "-numFetchers n" flag in the "generate" command. I understand that it
> causes that "n" segments to be created, but, when using mapred, does the
> "fetch" command then understand that it should allocate one fetcher per
> segment?

In 0.8 this determines the number of input directories that will be 
generated in each segment, and, consequently, the number of map tasks 
when fetching.  Urls are hashed into these so that they are hostwise 
disjoint.

> If so, is the benefit -
> 
> - resilience so that failed fetches can be re-started individually
> 
> - performance; or

Both of these.  Multiple fetchlists can be fetched in parallel, and, if 
they crash, can be restarted.  But if you use too many and don't have 
very many unique hosts, then each will be performance limited by 
politeness (if a task has urls from only 2 hosts, and it waits a second 
between accesses, then it can maximally fetch only 2 pages/second).

> PS.  For completeness, the following is my nutch-site.xml.  mapred-site.xml is
> an exact copy of it.
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> 
> <!-- Do not modify this file directly.  Instead, copy entries that you -->
> <!-- wish to modify from this file into nutch-site.xml and change them -->
> <!-- there.  If nutch-site.xml does not already exist, create it.      -->

This comment is confusing in a nutch-site.xml...

> <property>
>   <name>mapred.map.tasks</name>
>   <value>51</value>
>   <description>The default number of map tasks per job.  Typically set
>   to a prime several times greater than number of available hosts.
>   Ignored when mapred.job.tracker is "local".
>   </description>
> </property>
> 
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>5</value>
>   <description>The default number of reduce tasks per job.  Typically set
>   to a prime close to the number of available hosts.  Ignored when
>   mapred.job.tracker is "local".
>   </description>
> </property>

These values should not be placed in nutch-site, since that causes them 
to override job-specified values.  They should instead be in 
mapred-default,xml, so that jobs can sometimes override them.  For 
example, the generate task manipulates the number of reduce tasks in 
order to generate the appropriate number of input directories for 
fetching (as described above).

> <property>
>   <name>searcher.max.hits</name>
>   <value>200</value>
>   <description>If positive, search stops after this many hits are
>   found.  Setting this to small, positive values (e.g., 1000) can make
>   searches much faster.  With a sorted index, the quality of the hits
>   suffers little.</description>
> </property>

Make sure you're using a sorted indexer if you're using this.  Otherwise 
your results could suffer greatly.

Doug