You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Talat UYARER <ta...@agmlab.com> on 2013/07/25 08:40:29 UTC

Duplicate Fetches for Fetch Job

Hi,

We are using nutch for high volume crawls. We noticed that FetcherJob 
ReduceTask fetches some websites multiple times for long lasting queues. 
I have discovered the reason of this is 
mapred.reduce.tasks.speculative.execution settings in hadoop. This comes 
true as default. I suggest this value should be false for FetcherJob. 
What do you think?

Talat

Re: Duplicate Fetches for Fetch Job

Posted by Talat UYARER <ta...@agmlab.com>.
Thanks Tejas. I had some hesitation at first; I will go on and open an 
issue and upload patch.

25-07-2013 10:12 tarihinde, Tejas Patil yazdı:
> 1.x has speculative execution turned off:
> Fetcher.java:1328:    job.setSpeculativeExecution(false);
>
> but 2.x doesn't. It makes sense to do that. I don't see any good reason to
> not have it in 2.x. Could you open a jira for this and upload a patch ?
>
>
> On Wed, Jul 24, 2013 at 11:40 PM, Talat UYARER <ta...@agmlab.com>wrote:
>
>> Hi,
>>
>> We are using nutch for high volume crawls. We noticed that FetcherJob
>> ReduceTask fetches some websites multiple times for long lasting queues. I
>> have discovered the reason of this is mapred.reduce.tasks.**speculative.execution
>> settings in hadoop. This comes true as default. I suggest this value should
>> be false for FetcherJob. What do you think?
>>
>> Talat
>>


Re: Duplicate Fetches for Fetch Job

Posted by Tejas Patil <te...@gmail.com>.
1.x has speculative execution turned off:
Fetcher.java:1328:    job.setSpeculativeExecution(false);

but 2.x doesn't. It makes sense to do that. I don't see any good reason to
not have it in 2.x. Could you open a jira for this and upload a patch ?


On Wed, Jul 24, 2013 at 11:40 PM, Talat UYARER <ta...@agmlab.com>wrote:

> Hi,
>
> We are using nutch for high volume crawls. We noticed that FetcherJob
> ReduceTask fetches some websites multiple times for long lasting queues. I
> have discovered the reason of this is mapred.reduce.tasks.**speculative.execution
> settings in hadoop. This comes true as default. I suggest this value should
> be false for FetcherJob. What do you think?
>
> Talat
>