You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Audrey Liu <au...@gmail.com> on 2007/07/20 22:56:52 UTC

tweaking config files for better performance

Hi,

I am using Nutch 0.9, and I'm trying to crawl our Intranet site (~60,000
pages, ~28,000 htmls). I've seen other posts where people mentioned they can
get their crawler to do 20pages/sec, and the best I've seen so far is only 8
pages/sec.

I've also read that the fetcher threads tend to block when it tries to fetch
pages from the same host. So I'm wondering what kind of configurations
should I set to get the best performance, my current configurations in
nutch-site.xml is as follows:

<property>
  <name>fetcher.threads.fetch</name>
  <value>200</value>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>50</value>
</property>

<property>
  <name>http.max.delays</name>
  <value>1</value>
</property>

Any pointers are greatly appreciated!! Thanks in advance.

AL
-- 
View this message in context: http://www.nabble.com/tweaking-config-files-for-better-performance-tf4119552.html#a11715927
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: tweaking config files for better performance

Posted by Audrey Liu <au...@gmail.com>.
Hi,

Thanks for the reply!

I've tried the configurations that is in the link, it didn't seem to help
much, at least not to get it up to 20 pages/sec. Could it be that I'm doing
an Intranet search?

I just really want to know how did other people get their performance to be
so fast??

Any pointers are appreciated! Thanks!!

Audrey


Audrey Liu wrote:
> 
> Hi,
> 
> I am using Nutch 0.9, and I'm trying to crawl our Intranet site (~60,000
> pages, ~28,000 htmls). I've seen other posts where people mentioned they
> can get their crawler to do 20pages/sec, and the best I've seen so far is
> only 8 pages/sec.
> 
> I've also read that the fetcher threads tend to block when it tries to
> fetch pages from the same host. So I'm wondering what kind of
> configurations should I set to get the best performance, my current
> configurations in nutch-site.xml is as follows:
> 
> <property>
>   <name>fetcher.threads.fetch</name>
>   <value>200</value>
> </property>
> 
> <property>
>   <name>fetcher.threads.per.host</name>
>   <value>50</value>
> </property>
> 
> <property>
>   <name>http.max.delays</name>
>   <value>1</value>
> </property>
> 
> Any pointers are greatly appreciated!! Thanks in advance.
> 
> AL
> 

-- 
View this message in context: http://www.nabble.com/tweaking-config-files-for-better-performance-tf4119552.html#a11750336
Sent from the Nutch - User mailing list archive at Nabble.com.