You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manish Verma <m_...@apple.com> on 2016/01/06 20:51:48 UTC

Concurrency And Crawl Delay ?

Hi,
I am using Nutch 1.10 and have some confusion over concurrency over crawl deal.

For e.g 

fetcher.server.min.delay = .300
fetcher.threads.per.queue = 10
fetcher.queue.mode = byHost (for simplicity lets assume there is only one host)



Now we have defined 10 threads, how this will behave, 10 request will be sent to host same time or first thread will hit and then after 300 ms second thread will hit.
If thread can not hit at same time then whats the use of having multiple threads as each thread has to wait 300 ms.


Thanks MV

Re: Concurrency And Crawl Delay ?

Posted by Manish Verma <m_...@apple.com>.
So In given below scenario 10 request will be sent same time and then there would be delay or 300 ms and then 10 more request right ?

But fetcher.server.min.delay property description says “the minimum number of seconds the fetcher will delay between 
successive requests to the same server."

If above stmt true how come 10 request can be sent in same time ?


> On Jan 6, 2016, at 1:32 PM, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> The property fetcher.threads.per.queue allows
> to let multiple threads fetch content from the same
> host in parallel.
> 
> Note that with fetcher.threads.per.queue > 1
> the delay is configured by fetcher.server.min.delay
> Of course, due to the concurrent nature
> that there could be still open connections
> and fetches in progress during the "delay".
> 
> On 01/06/2016 09:54 PM, Manish Verma wrote:
>> Thanks for replying Sebastian,
>> 
>> Just wanted to be clear here , I have multiple urls to crawl but number of hosts is one, I hope you mean even in this case only one thread will be working.
>> If this is the case then what is significance of  property fetcher.threads.per.queue
>> 
>> In my case there would be only one queue as all URLs reside on same host then whats the use of fetcher.threads.per.queue ?
>> 
>> Thanks Manish
>> 
>> 
>> 
>>> On Jan 6, 2016, at 12:40 PM, Sebastian Nagel <wa...@googlemail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> all requests to the same host are processed in the same
>>> fetch queue which also takes care that the configured
>>> delay (or that specified in robots.txt) is observed.
>>> With 10 threads and only one host to be crawled
>>> 9 of the threads are just doing nothing. Things are
>>> different if there are multiple hosts to crawl (>=10).
>>> 
>>> Cheers,
>>> Sebastian
>>> 
>>> On 01/06/2016 08:51 PM, Manish Verma wrote:
>>>> Hi,
>>>> I am using Nutch 1.10 and have some confusion over concurrency over crawl deal.
>>>> 
>>>> For e.g 
>>>> 
>>>> fetcher.server.min.delay = .300
>>>> fetcher.threads.per.queue = 10
>>>> fetcher.queue.mode = byHost (for simplicity lets assume there is only one host)
>>>> 
>>>> 
>>>> 
>>>> Now we have defined 10 threads, how this will behave, 10 request will be sent to host same time or first thread will hit and then after 300 ms second thread will hit.
>>>> If thread can not hit at same time then whats the use of having multiple threads as each thread has to wait 300 ms.
>>>> 
>>>> 
>>>> Thanks MV
>>>> 
>>> 
>> 
>> 
> 


Re: Concurrency And Crawl Delay ?

Posted by Sebastian Nagel <wa...@googlemail.com>.
The property fetcher.threads.per.queue allows
to let multiple threads fetch content from the same
host in parallel.

Note that with fetcher.threads.per.queue > 1
the delay is configured by fetcher.server.min.delay
Of course, due to the concurrent nature
that there could be still open connections
and fetches in progress during the "delay".

On 01/06/2016 09:54 PM, Manish Verma wrote:
> Thanks for replying Sebastian,
> 
> Just wanted to be clear here , I have multiple urls to crawl but number of hosts is one, I hope you mean even in this case only one thread will be working.
> If this is the case then what is significance of  property fetcher.threads.per.queue
> 
> In my case there would be only one queue as all URLs reside on same host then whats the use of fetcher.threads.per.queue ?
> 
> Thanks Manish
> 
> 
> 
>> On Jan 6, 2016, at 12:40 PM, Sebastian Nagel <wa...@googlemail.com> wrote:
>>
>> Hi,
>>
>> all requests to the same host are processed in the same
>> fetch queue which also takes care that the configured
>> delay (or that specified in robots.txt) is observed.
>> With 10 threads and only one host to be crawled
>> 9 of the threads are just doing nothing. Things are
>> different if there are multiple hosts to crawl (>=10).
>>
>> Cheers,
>> Sebastian
>>
>> On 01/06/2016 08:51 PM, Manish Verma wrote:
>>> Hi,
>>> I am using Nutch 1.10 and have some confusion over concurrency over crawl deal.
>>>
>>> For e.g 
>>>
>>> fetcher.server.min.delay = .300
>>> fetcher.threads.per.queue = 10
>>> fetcher.queue.mode = byHost (for simplicity lets assume there is only one host)
>>>
>>>
>>>
>>> Now we have defined 10 threads, how this will behave, 10 request will be sent to host same time or first thread will hit and then after 300 ms second thread will hit.
>>> If thread can not hit at same time then whats the use of having multiple threads as each thread has to wait 300 ms.
>>>
>>>
>>> Thanks MV
>>>
>>
> 
> 


Re: Concurrency And Crawl Delay ?

Posted by Manish Verma <m_...@apple.com>.
Thanks for replying Sebastian,

Just wanted to be clear here , I have multiple urls to crawl but number of hosts is one, I hope you mean even in this case only one thread will be working.
If this is the case then what is significance of  property fetcher.threads.per.queue

In my case there would be only one queue as all URLs reside on same host then whats the use of fetcher.threads.per.queue ?

Thanks Manish



> On Jan 6, 2016, at 12:40 PM, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> Hi,
> 
> all requests to the same host are processed in the same
> fetch queue which also takes care that the configured
> delay (or that specified in robots.txt) is observed.
> With 10 threads and only one host to be crawled
> 9 of the threads are just doing nothing. Things are
> different if there are multiple hosts to crawl (>=10).
> 
> Cheers,
> Sebastian
> 
> On 01/06/2016 08:51 PM, Manish Verma wrote:
>> Hi,
>> I am using Nutch 1.10 and have some confusion over concurrency over crawl deal.
>> 
>> For e.g 
>> 
>> fetcher.server.min.delay = .300
>> fetcher.threads.per.queue = 10
>> fetcher.queue.mode = byHost (for simplicity lets assume there is only one host)
>> 
>> 
>> 
>> Now we have defined 10 threads, how this will behave, 10 request will be sent to host same time or first thread will hit and then after 300 ms second thread will hit.
>> If thread can not hit at same time then whats the use of having multiple threads as each thread has to wait 300 ms.
>> 
>> 
>> Thanks MV
>> 
> 


Re: Concurrency And Crawl Delay ?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

all requests to the same host are processed in the same
fetch queue which also takes care that the configured
delay (or that specified in robots.txt) is observed.
With 10 threads and only one host to be crawled
9 of the threads are just doing nothing. Things are
different if there are multiple hosts to crawl (>=10).

Cheers,
Sebastian

On 01/06/2016 08:51 PM, Manish Verma wrote:
> Hi,
> I am using Nutch 1.10 and have some confusion over concurrency over crawl deal.
> 
> For e.g 
> 
> fetcher.server.min.delay = .300
> fetcher.threads.per.queue = 10
> fetcher.queue.mode = byHost (for simplicity lets assume there is only one host)
> 
> 
> 
> Now we have defined 10 threads, how this will behave, 10 request will be sent to host same time or first thread will hit and then after 300 ms second thread will hit.
> If thread can not hit at same time then whats the use of having multiple threads as each thread has to wait 300 ms.
> 
> 
> Thanks MV
>