You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Doug Cutting <cu...@apache.org> on 2006/04/05 23:47:01 UTC
Re: Tuning nutch-0.8-dev (rev-374745 of 2006-02-03)
monu.ogbe@richmondinformatics.com wrote:
> Also, on the subject of tuning for speed, I am confused about the relevance of
> the "-numFetchers n" flag in the "generate" command. I understand that it
> causes that "n" segments to be created, but, when using mapred, does the
> "fetch" command then understand that it should allocate one fetcher per
> segment?
In 0.8 this determines the number of input directories that will be
generated in each segment, and, consequently, the number of map tasks
when fetching. Urls are hashed into these so that they are hostwise
disjoint.
> If so, is the benefit -
>
> - resilience so that failed fetches can be re-started individually
>
> - performance; or
Both of these. Multiple fetchlists can be fetched in parallel, and, if
they crash, can be restarted. But if you use too many and don't have
very many unique hosts, then each will be performance limited by
politeness (if a task has urls from only 2 hosts, and it waits a second
between accesses, then it can maximally fetch only 2 pages/second).
> PS. For completeness, the following is my nutch-site.xml. mapred-site.xml is
> an exact copy of it.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>
> <!-- Do not modify this file directly. Instead, copy entries that you -->
> <!-- wish to modify from this file into nutch-site.xml and change them -->
> <!-- there. If nutch-site.xml does not already exist, create it. -->
This comment is confusing in a nutch-site.xml...
> <property>
> <name>mapred.map.tasks</name>
> <value>51</value>
> <description>The default number of map tasks per job. Typically set
> to a prime several times greater than number of available hosts.
> Ignored when mapred.job.tracker is "local".
> </description>
> </property>
>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>5</value>
> <description>The default number of reduce tasks per job. Typically set
> to a prime close to the number of available hosts. Ignored when
> mapred.job.tracker is "local".
> </description>
> </property>
These values should not be placed in nutch-site, since that causes them
to override job-specified values. They should instead be in
mapred-default,xml, so that jobs can sometimes override them. For
example, the generate task manipulates the number of reduce tasks in
order to generate the appropriate number of input directories for
fetching (as described above).
> <property>
> <name>searcher.max.hits</name>
> <value>200</value>
> <description>If positive, search stops after this many hits are
> found. Setting this to small, positive values (e.g., 1000) can make
> searches much faster. With a sorted index, the quality of the hits
> suffers little.</description>
> </property>
Make sure you're using a sorted indexer if you're using this. Otherwise
your results could suffer greatly.
Doug