You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Feng Ji <fe...@gmail.com> on 2006/09/04 16:14:20 UTC

how to speed up crawling procedure

hi there,

By using nutch 08, it costs me more than 1 day to crawl down 30,000 pages
from 1 crawldb list. I am using linux and java 1.5, in a dual CPU dell
server.

My fetching setting is from default, means the file size is limited.

I wonder if other things I can do to speed up the crawling process? I am
considering to split one crawldb list into multiple ones, will that helps
me?

thanks your time,

Michael,

Re: how to speed up crawling procedure

Posted by Frank Kempf <fl...@2112portals.com>.
No,
take this and put it into your nutch-site.xml and adjust the threads.
Take as much as your OS, Network, RAM and CPU can afford.

<property>
   <name>fetcher.threads.fetch</name>
   <value>34</value>
   <description>The number of FetcherThreads the fetcher should use.
     This is also determines the maximum number of requests that are
     made at once (each FetcherThread handles one connection).</description>
</property>

Kind Regards

   Frank

Re: how to speed up crawling procedure

Posted by Feng Ji <fe...@gmail.com>.
hi Frank:

Is the following config for your thread setup?

fetcher.threads.per.host in nutch-default.xml

thanks,

Michael,


On 9/4/06, Frank Kempf <fl...@2112portals.com> wrote:
>
> Hi,
> this sure is a question about scaling an application in general.
> You could be either bottlenecked by
> 1. Network
> 2. RAM
> 3. CPU
>
> Maybe you do not use enough threads on crawling.
> Check the logs, maybe you get hundreds of timeouts from unavailable
> servers.
> Also the logs have timestamps which may help to track time consuming
> operations.
>
> I use about 32 Threads to crawl on a 512 MB RAM Machine AMD 2300+, with a
> fast
> Network on a VServer and fetching 25000 Urls takes me a few hours
> (fetching HTML
> only).
> I think 30000 Urls are not the deal that makes it absolutely neccessary to
> split .
>
> Kind regards
>
>    Frank
>

Re: how to speed up crawling procedure

Posted by Frank Kempf <fl...@2112portals.com>.
Hi,
this sure is a question about scaling an application in general.
You could be either bottlenecked by
1. Network
2. RAM
3. CPU

Maybe you do not use enough threads on crawling.
Check the logs, maybe you get hundreds of timeouts from unavailable servers.
Also the logs have timestamps which may help to track time consuming operations.

I use about 32 Threads to crawl on a 512 MB RAM Machine AMD 2300+, with a fast 
Network on a VServer and fetching 25000 Urls takes me a few hours (fetching HTML 
only).
I think 30000 Urls are not the deal that makes it absolutely neccessary to split .

  Kind regards

    Frank