You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dustine Rene Bernasor <du...@thecyberguardian.com> on 2012/05/24 06:12:44 UTC

nutch hadoop only one slave is crawling

I have a 3-slaves hadoop cluster and I am performing a crawl on a single 
website. However, only 1 slave is performing fetching (though the other 
slaves are still alive). Is this normal behavior if only 1 domain is 
crawled? Is there any way to force the other slaves to fetch?

Thanks.

Re: nutch hadoop only one slave is crawling

Posted by "nutch.buddy@gmail.com" <nu...@gmail.com>.

The queue runs on a single datanode?



Rémy Amouroux wrote
> 
> Hi
> 
> The fetcher threads (10 by default as configured in nutch-default.xml
> through fetcher.threads.fetch) are taking theirs jobs from a queue.
> By default, there is only one queue per host (defined by property
> fetcher.queue.mode) and there is a property configuration limiting the
> number of threads allowed to access a queue (property
> fetcher.threads.per.queue, default is 1 in nutch-default.xml ) in order to
> be polite with the crawled web site.
> 
> So, you are crawling only one website, then you have one queue, and only
> one thread allowed to fetch at a given moment.
> 
> By modifying fetcher.threads.per.queue in nutch-site.xml, you can have
> more threads doing fetching at the same time, capped by
> fetcher.threads.fetch.
> 
> Regards
> 
> PS: be careful and think of the impact of the new configuration on this
> website :-)
> 
> RemyA
> 
> Le 24 mai 2012 à 06:12, Dustine Rene Bernasor a écrit :
> 
>> I have a 3-slaves hadoop cluster and I am performing a crawl on a single
>> website. However, only 1 slave is performing fetching (though the other
>> slaves are still alive). Is this normal behavior if only 1 domain is
>> crawled? Is there any way to force the other slaves to fetch?
>> 
>> Thanks.
>>
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-hadoop-only-one-slave-is-crawling-tp3985825p3985886.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch hadoop only one slave is crawling

Posted by Rémy Amouroux <re...@teorem.fr>.

Hi

The fetcher threads (10 by default as configured in nutch-default.xml through fetcher.threads.fetch) are taking theirs jobs from a queue.
By default, there is only one queue per host (defined by property fetcher.queue.mode) and there is a property configuration limiting the number of threads allowed to access a queue (property fetcher.threads.per.queue, default is 1 in nutch-default.xml ) in order to be polite with the crawled web site.

So, you are crawling only one website, then you have one queue, and only one thread allowed to fetch at a given moment.

By modifying fetcher.threads.per.queue in nutch-site.xml, you can have more threads doing fetching at the same time, capped by fetcher.threads.fetch.

Regards

PS: be careful and think of the impact of the new configuration on this website :-)

RemyA

Le 24 mai 2012 à 06:12, Dustine Rene Bernasor a écrit :

> I have a 3-slaves hadoop cluster and I am performing a crawl on a single website. However, only 1 slave is performing fetching (though the other slaves are still alive). Is this normal behavior if only 1 domain is crawled? Is there any way to force the other slaves to fetch?
> 
> Thanks.
>