You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dustine Rene Bernasor <du...@thecyberguardian.com> on 2012/05/24 06:12:44 UTC
nutch hadoop only one slave is crawling
I have a 3-slaves hadoop cluster and I am performing a crawl on a single
website. However, only 1 slave is performing fetching (though the other
slaves are still alive). Is this normal behavior if only 1 domain is
crawled? Is there any way to force the other slaves to fetch?
Thanks.
Re: nutch hadoop only one slave is crawling
Posted by "nutch.buddy@gmail.com" <nu...@gmail.com>.
The queue runs on a single datanode?
Rémy Amouroux wrote
>
> Hi
>
> The fetcher threads (10 by default as configured in nutch-default.xml
> through fetcher.threads.fetch) are taking theirs jobs from a queue.
> By default, there is only one queue per host (defined by property
> fetcher.queue.mode) and there is a property configuration limiting the
> number of threads allowed to access a queue (property
> fetcher.threads.per.queue, default is 1 in nutch-default.xml ) in order to
> be polite with the crawled web site.
>
> So, you are crawling only one website, then you have one queue, and only
> one thread allowed to fetch at a given moment.
>
> By modifying fetcher.threads.per.queue in nutch-site.xml, you can have
> more threads doing fetching at the same time, capped by
> fetcher.threads.fetch.
>
> Regards
>
> PS: be careful and think of the impact of the new configuration on this
> website :-)
>
> RemyA
>
> Le 24 mai 2012 à 06:12, Dustine Rene Bernasor a écrit :
>
>> I have a 3-slaves hadoop cluster and I am performing a crawl on a single
>> website. However, only 1 slave is performing fetching (though the other
>> slaves are still alive). Is this normal behavior if only 1 domain is
>> crawled? Is there any way to force the other slaves to fetch?
>>
>> Thanks.
>>
>
--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-hadoop-only-one-slave-is-crawling-tp3985825p3985886.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch hadoop only one slave is crawling
Posted by Rémy Amouroux <re...@teorem.fr>.
Hi
The fetcher threads (10 by default as configured in nutch-default.xml through fetcher.threads.fetch) are taking theirs jobs from a queue.
By default, there is only one queue per host (defined by property fetcher.queue.mode) and there is a property configuration limiting the number of threads allowed to access a queue (property fetcher.threads.per.queue, default is 1 in nutch-default.xml ) in order to be polite with the crawled web site.
So, you are crawling only one website, then you have one queue, and only one thread allowed to fetch at a given moment.
By modifying fetcher.threads.per.queue in nutch-site.xml, you can have more threads doing fetching at the same time, capped by fetcher.threads.fetch.
Regards
PS: be careful and think of the impact of the new configuration on this website :-)
RemyA
Le 24 mai 2012 à 06:12, Dustine Rene Bernasor a écrit :
> I have a 3-slaves hadoop cluster and I am performing a crawl on a single website. However, only 1 slave is performing fetching (though the other slaves are still alive). Is this normal behavior if only 1 domain is crawled? Is there any way to force the other slaves to fetch?
>
> Thanks.
>