You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/08/13 18:28:35 UTC

nutch-xml.conf

I've lost the thread, but someone here had recently asked for our nutch 
xml configuration file.  Our developer's back from holidays so I've got 
the info now.  Note that some of the configuration variables are not in 
the default file as we've made modifications.  On our dual xeon, 8gigs 
of ram, scsi raid 0 server this config will fill about a 10mbs line.  If 
the number of threads is increased to about 50, it'll fill a 40mbs pipe 
while crawling.

We also exclude quite a number of different file types that nutch by 
default would crawl (some rather obscure program files and the like).  
That helped us initially as well, as did cutting down the size of our 
page sizes.  There's a lot of 3/5/20meg pdf's and word documents out 
there that'll really slow things down.

without further ado, here's our current config file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<nutch-conf>

<property>
 <name>address.ip.file</name>
 <value>ip-address.txt</value>
 <description>Name of file on CLASSPATH containing ip addresses used by 
urlfilter
-ip (IPURLFilter) plugin. (Keren added)</description>
</property>

<property>
 <name>db.fetch.retry.max</name>
 <value>3</value>
 <description>The maximum number of times a url that has encountered 
recoverable
errors is generated for fetch.</description>
</property>

<property>
 <name>db.ignore.external.links</name>
 <value>false</value>
 <description>If true, when adding new links to a page, links from the 
different
host are ignored. (Keren added) </description>
</property>

<property>
 <name>db.ignore.internal.links</name>
 <value>true</value>
 <description>If true, when adding new links to a page, links from the 
same host
are ignored.  This is an effective way to limit the size of the link 
database, kee
ping the only the highest quality links. </description>
</property>

<property>
 <name>db.max.outlinks.per.page</name>
 <value>100</value>
 <description>The maximum number of outlinks that we'll process for a 
page. </des
cription>
</property>

<property>
 <name>db.score.injected</name>
 <value>1.0</value>
 <description>The score of new pages added by the injector.</description>
</property>

<property>
 <name>db.score.link.external</name>
 <value>1.0</value>
 <description>The score factor for new pages added due to a link from 
another hos
t relative to the referencing page's score.
 </description>
</property>

<property>
 <name>db.score.link.internal</name>
 <value>1.0</value>
 <description>The score factor for pages added due to a link from the 
same host,
relative to the referencing page's score.
 </description>
</property>

<property>
 <name>dropped.url.file</name>
 <value>/home/xxx/xxxx/nutch/dropped_urls.out</value>
 <description>Name of file containing dropped urls used by urlfilter-ip 
(IPURLFil
ter) plugin. (Keren added)</description>
</property>

<property>
 <name>fetcher.server.delay</name>
 <value>5.0</value>
 <description>The number of seconds the fetcher will delay between 
successive req
uests to the same server.</description>
</property>

<property>
 <name>fetcher.threads.fetch</name>
 <value>20</value>
 <description>The number of FetcherThreads the fetcher should use. This 
is also d
etermines the maximum number of requests that are made at once (each 
FetcherThread
handles one connection).</description>
</property>

<property>
 <name>fetcher.threads.per.host</name>
 <value>3</value>
 <description>This number is the maximum number of threads that should 
be allowed
to access a host at one time.</description>
</property>

<property>
 <name>http.agent.email</name>
 <value>xxxxxxxxx</value>
 <description>An email address to advertise in the HTTP 'From' request 
header and
User-Agent header.</description>
</property>

<property>
 <name>http.agent.url</name>
 <value>xxxxxxxxxxxxxxxxxxx</value>
 <description>A URL to advertise in the User-Agent header. This will 
appear in pa
renthesis after the agent name. </description>
</property>

<property>
 <name>http.content.limit</name>
 <value>65536</value>
 <description>The length limit for downloaded content, in bytes. If this 
value is
nonnegative (>=0), content longer than it will be truncated; otherwise, 
no trunca
tion at all. </description>
</property>

<property>
 <name>http.max.delays</name>
 <value>3</value>
 <description>The number of times a thread will delay when trying to 
fetch a page
.  Each time it finds that a host is busy, it will wait 
fetcher.server.delay.  Aft
er http.max.delays attepts, it will give up on the page for 
now.</description>
</property>

<property>
 <name>http.redirect.max</name>
 <value>0</value>
 <description>The maximum number of redirects the fetcher will follow 
when trying
to fetch a page.</description>
</property>

<property>
 <name>indexer.boost.by.link.count</name>
 <value>true</value>
 <description>When true scores for a page are multipled by the log of 
the number
of incoming links to the page.</description>
</property>

<property>
 <name>indexer.boost.link.count.weight</name>
 <value>100.0</value>
 <description>Scores for a page are multipled by the log (the number of 
incomingl
inks * this parameter) to the page. (Keren added)</description>
</property>

<property>
 <name>indexer.score.power</name>
 <value>0.5</value>
 <description>Determines the power of link analyis scores.  Each pages's 
boost is
set to <I>score<SUP>scorePower</SUP></I> where <I>score</I> is its link 
analysis
 <value>0.5</value>
 <description>Determines the power of link analyis scores.  Each pages's 
boost is
set to <I>score<SUP>scorePower</SUP></I> where <I>score</I> is its link 
analysis
score and <I>scorePower</I> is the value of this parameter.  This is 
compiled into
indexes, so, when this is changed, pages must be re-indexed for it to 
take effect
.</description>
</property>

<property>
 <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-ip|parse-(text|html|p 

df|msword)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to 
include.  Any p
lugin not matching this expression is excluded. In any case you need at 
least incl
ude the nutch-extensionpoints plugin. By default Nutch includes crawling 
just HTML
and plain text via HTTP, and basic indexing and search plugins. 
</description>
</property>

<property>
 <name>urlfilter.ip.file</name>
 <value>ip-urlfilter.txt</value>
 <description>Name of file on CLASSPATH containing regular expressions 
used by ur
lfilter-ip (IPURLFilter) plugin. (Keren added)</description>
</property>
</nutch-conf>