You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2014/06/22 17:51:15 UTC

Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Hi All,

I would like to know the relationship between the two config properties
*fetcher.threads.fetch* and *fetcher.threads.per.host*.


   1. If lets say I am crawling 10 hosts in my seed file and set the
   fetcher.threads.per.host property to 3 , should I set the
   fetcher.threads.fetch property to 10*3 i.e >=30 ?
   2. I can understand the *fetcher.threads.per.host *property as it is
   self explanatory , which means number to concurrent connections to a
   particular host , however , I am not able to clearly follow what
*fetcher.threads.fetch
   *does.
   3. Also I would like to know how the *fetcher.threads.per.host* property
   comes into play in a distributed mode  ?



Thanks in advance.

Re: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Posted by "Meraj A. Khan" <me...@gmail.com>.
Sebastian,

Thanks for the clear explanation , I have a similar questions .


   1. If I set the fetcher.threads.per.host or the renamed
   fetcher.threads.per.queue property to more than the edefault 1 , would my
   cralwer still be with in the crawl-delay limits for each host as specified
   in its robots.txt ?
   2. Looks like the max value we set in fetcher.threads.per.host value
   only comes into play when the total number of threads for the map task are
   less than the value we specify in the fetcher.threads.fetch property ?

Thanks.


On Sun, Jun 22, 2014 at 2:13 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi,
>
> > 1. fetcher.threads.per.host: 10*3 = 30
> Correct. But if there are 1000 hosts you hardly
> would set it to 3000, see question 2.
>
> Keep in mind, that the property has been renamed into
> fetcher.threads.per.queue with Nutch 1.4!
> A queue can be defined by host or ip, see fetcher.queue.mode.
>
> > 2. fetcher.threads.fetch
> If there are many hosts you would set fetcher.threads.per.host
> to 1 (the default), and use fetcher.threads.fetch to limit the
> load on your system (esp. to limit the network load).
>
> > 3. in distributed mode
> All URLs from the same host are placed in the same partition.
> This ensures that host-level blocking can be done in one single
> JVM.
>
> Sebastian
>
>
> On 06/22/2014 05:51 PM, S.L wrote:
> > Hi All,
> >
> > I would like to know the relationship between the two config properties
> > *fetcher.threads.fetch* and *fetcher.threads.per.host*.
> >
> >
> >    1. If lets say I am crawling 10 hosts in my seed file and set the
> >    fetcher.threads.per.host property to 3 , should I set the
> >    fetcher.threads.fetch property to 10*3 i.e >=30 ?
> >    2. I can understand the *fetcher.threads.per.host *property as it is
> >    self explanatory , which means number to concurrent connections to a
> >    particular host , however , I am not able to clearly follow what
> > *fetcher.threads.fetch
> >    *does.
> >    3. Also I would like to know how the *fetcher.threads.per.host*
> property
> >    comes into play in a distributed mode  ?
> >
> >
> >
> > Thanks in advance.
> >
>
>

Re: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> 1. fetcher.threads.per.host: 10*3 = 30
Correct. But if there are 1000 hosts you hardly
would set it to 3000, see question 2.

Keep in mind, that the property has been renamed into
fetcher.threads.per.queue with Nutch 1.4!
A queue can be defined by host or ip, see fetcher.queue.mode.

> 2. fetcher.threads.fetch
If there are many hosts you would set fetcher.threads.per.host
to 1 (the default), and use fetcher.threads.fetch to limit the
load on your system (esp. to limit the network load).

> 3. in distributed mode
All URLs from the same host are placed in the same partition.
This ensures that host-level blocking can be done in one single
JVM.

Sebastian


On 06/22/2014 05:51 PM, S.L wrote:
> Hi All,
> 
> I would like to know the relationship between the two config properties
> *fetcher.threads.fetch* and *fetcher.threads.per.host*.
> 
> 
>    1. If lets say I am crawling 10 hosts in my seed file and set the
>    fetcher.threads.per.host property to 3 , should I set the
>    fetcher.threads.fetch property to 10*3 i.e >=30 ?
>    2. I can understand the *fetcher.threads.per.host *property as it is
>    self explanatory , which means number to concurrent connections to a
>    particular host , however , I am not able to clearly follow what
> *fetcher.threads.fetch
>    *does.
>    3. Also I would like to know how the *fetcher.threads.per.host* property
>    comes into play in a distributed mode  ?
> 
> 
> 
> Thanks in advance.
>