You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/08/13 18:28:35 UTC
nutch-xml.conf
I've lost the thread, but someone here had recently asked for our nutch
xml configuration file. Our developer's back from holidays so I've got
the info now. Note that some of the configuration variables are not in
the default file as we've made modifications. On our dual xeon, 8gigs
of ram, scsi raid 0 server this config will fill about a 10mbs line. If
the number of threads is increased to about 50, it'll fill a 40mbs pipe
while crawling.
We also exclude quite a number of different file types that nutch by
default would crawl (some rather obscure program files and the like).
That helped us initially as well, as did cutting down the size of our
page sizes. There's a lot of 3/5/20meg pdf's and word documents out
there that'll really slow things down.
without further ado, here's our current config file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>address.ip.file</name>
<value>ip-address.txt</value>
<description>Name of file on CLASSPATH containing ip addresses used by
urlfilter
-ip (IPURLFilter) plugin. (Keren added)</description>
</property>
<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>The maximum number of times a url that has encountered
recoverable
errors is generated for fetch.</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from the
different
host are ignored. (Keren added) </description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>true</value>
<description>If true, when adding new links to a page, links from the
same host
are ignored. This is an effective way to limit the size of the link
database, kee
ping the only the highest quality links. </description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a
page. </des
cription>
</property>
<property>
<name>db.score.injected</name>
<value>1.0</value>
<description>The score of new pages added by the injector.</description>
</property>
<property>
<name>db.score.link.external</name>
<value>1.0</value>
<description>The score factor for new pages added due to a link from
another hos
t relative to the referencing page's score.
</description>
</property>
<property>
<name>db.score.link.internal</name>
<value>1.0</value>
<description>The score factor for pages added due to a link from the
same host,
relative to the referencing page's score.
</description>
</property>
<property>
<name>dropped.url.file</name>
<value>/home/xxx/xxxx/nutch/dropped_urls.out</value>
<description>Name of file containing dropped urls used by urlfilter-ip
(IPURLFil
ter) plugin. (Keren added)</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
successive req
uests to the same server.</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>20</value>
<description>The number of FetcherThreads the fetcher should use. This
is also d
etermines the maximum number of requests that are made at once (each
FetcherThread
handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>3</value>
<description>This number is the maximum number of threads that should
be allowed
to access a host at one time.</description>
</property>
<property>
<name>http.agent.email</name>
<value>xxxxxxxxx</value>
<description>An email address to advertise in the HTTP 'From' request
header and
User-Agent header.</description>
</property>
<property>
<name>http.agent.url</name>
<value>xxxxxxxxxxxxxxxxxxx</value>
<description>A URL to advertise in the User-Agent header. This will
appear in pa
renthesis after the agent name. </description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes. If this
value is
nonnegative (>=0), content longer than it will be truncated; otherwise,
no trunca
tion at all. </description>
</property>
<property>
<name>http.max.delays</name>
<value>3</value>
<description>The number of times a thread will delay when trying to
fetch a page
. Each time it finds that a host is busy, it will wait
fetcher.server.delay. Aft
er http.max.delays attepts, it will give up on the page for
now.</description>
</property>
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow
when trying
to fetch a page.</description>
</property>
<property>
<name>indexer.boost.by.link.count</name>
<value>true</value>
<description>When true scores for a page are multipled by the log of
the number
of incoming links to the page.</description>
</property>
<property>
<name>indexer.boost.link.count.weight</name>
<value>100.0</value>
<description>Scores for a page are multipled by the log (the number of
incomingl
inks * this parameter) to the page. (Keren added)</description>
</property>
<property>
<name>indexer.score.power</name>
<value>0.5</value>
<description>Determines the power of link analyis scores. Each pages's
boost is
set to <I>score<SUP>scorePower</SUP></I> where <I>score</I> is its link
analysis
<value>0.5</value>
<description>Determines the power of link analyis scores. Each pages's
boost is
set to <I>score<SUP>scorePower</SUP></I> where <I>score</I> is its link
analysis
score and <I>scorePower</I> is the value of this parameter. This is
compiled into
indexes, so, when this is changed, pages must be re-indexed for it to
take effect
.</description>
</property>
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-ip|parse-(text|html|p
df|msword)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any p
lugin not matching this expression is excluded. In any case you need at
least incl
ude the nutch-extensionpoints plugin. By default Nutch includes crawling
just HTML
and plain text via HTTP, and basic indexing and search plugins.
</description>
</property>
<property>
<name>urlfilter.ip.file</name>
<value>ip-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by ur
lfilter-ip (IPURLFilter) plugin. (Keren added)</description>
</property>
</nutch-conf>