You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/03/06 21:03:03 UTC

Optimizing crawling for small number of domains/sites (aka. intranet crawling)

Is there a guide to optmizing nutch/hadoop for crawling intranet sites? 

Most of what I need to crawl are large stores of data (databases exposed
through html), share drive content, etc.  I have a very very small number of
"sites" to crawl (two dbs and one share drive).  The file share crawling is
PAINFULY slow. I am reading the code as we speak and trying to figure out
why the protocol-file plugin is so slow. Based on the following entry in the
wiki I don't think i am going to be able to increase the fetching time
because I am crawling just a few sites.   

>From http://wiki.apache.org/nutch/OptimizingCrawls :

"Fetching a lot of pages from a single site or a lot of pages from a few
sites will slow down fetching dramatically. For full web crawls you want an
even distribution so all fetching threads can be active. Setting
generate.max.per.host to a value > 0 will limit the number of pages from a
single host/domain to fetch."

Could code changes or property changes help speed things up? If so, could
someone give me a hint?

Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-crawling-for-small-number-of-domains-sites-aka-intranet-crawling-tp3804830p3804830.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Optimizing crawling for small number of domains/sites (aka. intranet crawling)

Posted by webdev1977 <we...@gmail.com>.
Well, running it with 200 fetcher threads and no delay works for about 20
minutes.. then the file server  crashed .

So.... I think that the DNS queries are the issue.  I am not able to setup
my own DNS server, but I did find this setting in java.security:
networkaddress.cache.ttl.  Since I am using Java 1.6 it was only caching the
entries for 30 seconds.  I hope that setting this to -1 will help with all
the unecessary calls to DNS. I don't even really need to use DNS... the Ip
address and host names of our servers do not change (and if they do I know
about it before hand).  



--
View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-crawling-for-small-number-of-domains-sites-aka-intranet-crawling-tp3804830p3818986.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Optimizing crawling for small number of domains/sites (aka. intranet crawling)

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

You want to use the property:
fetcher.threads.per.queue --> Set this to more than 1. This will turn off
host blocking and simply fetch whenever a thread is available. It will use
the fetcher.server.min.delay value between requests of the same host, which
is default set to 0. (So that is the fastest). Note that the property
description of fetcher.server.min.delay refers to
property fetcher.threads.per.host, this is incorrect as it should
be fetcher.threads.per.queue.

Ferdy.

On Tue, Mar 6, 2012 at 9:03 PM, webdev1977 <we...@gmail.com> wrote:

> Is there a guide to optmizing nutch/hadoop for crawling intranet sites?
>
> Most of what I need to crawl are large stores of data (databases exposed
> through html), share drive content, etc.  I have a very very small number
> of
> "sites" to crawl (two dbs and one share drive).  The file share crawling is
> PAINFULY slow. I am reading the code as we speak and trying to figure out
> why the protocol-file plugin is so slow. Based on the following entry in
> the
> wiki I don't think i am going to be able to increase the fetching time
> because I am crawling just a few sites.
>
> From http://wiki.apache.org/nutch/OptimizingCrawls :
>
> "Fetching a lot of pages from a single site or a lot of pages from a few
> sites will slow down fetching dramatically. For full web crawls you want an
> even distribution so all fetching threads can be active. Setting
> generate.max.per.host to a value > 0 will limit the number of pages from a
> single host/domain to fetch."
>
> Could code changes or property changes help speed things up? If so, could
> someone give me a hint?
>
> Thanks!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Optimizing-crawling-for-small-number-of-domains-sites-aka-intranet-crawling-tp3804830p3804830.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>