You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mat Kelcey <ma...@gmail.com> on 2011/11/20 22:31:48 UTC

is there a way to just abandon a map task?

Hi,

I have a largish job running that, due to the quirks of the third
party input format I'm using, has 280,000 map tasks. ( I know this is
far from ideal but it's it'll do for me )

I'm passing this data (the common crawl web crawl dataset) through a
visible-text-from-html extraction library (boilerpipe) which is
struggling with _1_ particular task. It's hits a sequence of records
that are _insanely_ slow to parse for some reason. Rather than a few
minutes per split it's took 7+ hrs before I started explicitly trying
to fail the task (hadoop job -fail-task). Since I'm running with bad
record skipping I was hoping I could issue -fail-task a few times and
ride over the bad records but it looks like there's quite a few there.
Since it's only 1 of the 280,000 I'm actually happy to just give up on
the entire split.

Now if I was running a map only job I'd just kill the job since I'd
have the output of the other 279,999. This job has a no-op reduce step
though since I wanted to take the chance to compact the output into a
much smaller number of sequence files ( I regret that decision now) As
such I can't just kill the job since I'd lose the rest of the
processed data (if I understand correctly?)

So does anyone know a way to just abandon the entire split?

Cheers,
Mat

Re: is there a way to just abandon a map task?

Posted by Arun C Murthy <ac...@hortonworks.com>.
On Nov 20, 2011, at 5:18 PM, Mat Kelcey wrote:

> Thanks for the suggestion Arun, I hadn't seen these params before.
> 
> No way to do it for a job in flight though I guess?
> 

Unfortunately, no. You'll need to re-run the job.

Also, you want to use 'bin/mapred job -fail-task <taskattemptid>' 4 times to abandon the task. If you use '-kill-task' it will continue to be re-run.

Arun

> Cheers,
> Mat
> 
> On 20 November 2011 16:43, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Mat,
>> 
>>  Take a look at mapred.max.(map|reduce).failures.percent.
>> 
>>  See:
>>  http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)
>>  http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)
>> 
>> hth,
>> Arun
>> 
>> On Nov 20, 2011, at 1:31 PM, Mat Kelcey wrote:
>> 
>>> Hi,
>>> 
>>> I have a largish job running that, due to the quirks of the third
>>> party input format I'm using, has 280,000 map tasks. ( I know this is
>>> far from ideal but it's it'll do for me )
>>> 
>>> I'm passing this data (the common crawl web crawl dataset) through a
>>> visible-text-from-html extraction library (boilerpipe) which is
>>> struggling with _1_ particular task. It's hits a sequence of records
>>> that are _insanely_ slow to parse for some reason. Rather than a few
>>> minutes per split it's took 7+ hrs before I started explicitly trying
>>> to fail the task (hadoop job -fail-task). Since I'm running with bad
>>> record skipping I was hoping I could issue -fail-task a few times and
>>> ride over the bad records but it looks like there's quite a few there.
>>> Since it's only 1 of the 280,000 I'm actually happy to just give up on
>>> the entire split.
>>> 
>>> Now if I was running a map only job I'd just kill the job since I'd
>>> have the output of the other 279,999. This job has a no-op reduce step
>>> though since I wanted to take the chance to compact the output into a
>>> much smaller number of sequence files ( I regret that decision now) As
>>> such I can't just kill the job since I'd lose the rest of the
>>> processed data (if I understand correctly?)
>>> 
>>> So does anyone know a way to just abandon the entire split?
>>> 
>>> Cheers,
>>> Mat
>> 
>> 


Re: is there a way to just abandon a map task?

Posted by Mat Kelcey <ma...@gmail.com>.
Thanks for the suggestion Arun, I hadn't seen these params before.

No way to do it for a job in flight though I guess?

Cheers,
Mat

On 20 November 2011 16:43, Arun C Murthy <ac...@hortonworks.com> wrote:
> Mat,
>
>  Take a look at mapred.max.(map|reduce).failures.percent.
>
>  See:
>  http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)
>  http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)
>
> hth,
> Arun
>
> On Nov 20, 2011, at 1:31 PM, Mat Kelcey wrote:
>
>> Hi,
>>
>> I have a largish job running that, due to the quirks of the third
>> party input format I'm using, has 280,000 map tasks. ( I know this is
>> far from ideal but it's it'll do for me )
>>
>> I'm passing this data (the common crawl web crawl dataset) through a
>> visible-text-from-html extraction library (boilerpipe) which is
>> struggling with _1_ particular task. It's hits a sequence of records
>> that are _insanely_ slow to parse for some reason. Rather than a few
>> minutes per split it's took 7+ hrs before I started explicitly trying
>> to fail the task (hadoop job -fail-task). Since I'm running with bad
>> record skipping I was hoping I could issue -fail-task a few times and
>> ride over the bad records but it looks like there's quite a few there.
>> Since it's only 1 of the 280,000 I'm actually happy to just give up on
>> the entire split.
>>
>> Now if I was running a map only job I'd just kill the job since I'd
>> have the output of the other 279,999. This job has a no-op reduce step
>> though since I wanted to take the chance to compact the output into a
>> much smaller number of sequence files ( I regret that decision now) As
>> such I can't just kill the job since I'd lose the rest of the
>> processed data (if I understand correctly?)
>>
>> So does anyone know a way to just abandon the entire split?
>>
>> Cheers,
>> Mat
>
>

Re: is there a way to just abandon a map task?

Posted by Arun C Murthy <ac...@hortonworks.com>.
Mat,

 Take a look at mapred.max.(map|reduce).failures.percent.

 See: 
 http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int) 
 http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)

hth,
Arun

On Nov 20, 2011, at 1:31 PM, Mat Kelcey wrote:

> Hi,
> 
> I have a largish job running that, due to the quirks of the third
> party input format I'm using, has 280,000 map tasks. ( I know this is
> far from ideal but it's it'll do for me )
> 
> I'm passing this data (the common crawl web crawl dataset) through a
> visible-text-from-html extraction library (boilerpipe) which is
> struggling with _1_ particular task. It's hits a sequence of records
> that are _insanely_ slow to parse for some reason. Rather than a few
> minutes per split it's took 7+ hrs before I started explicitly trying
> to fail the task (hadoop job -fail-task). Since I'm running with bad
> record skipping I was hoping I could issue -fail-task a few times and
> ride over the bad records but it looks like there's quite a few there.
> Since it's only 1 of the 280,000 I'm actually happy to just give up on
> the entire split.
> 
> Now if I was running a map only job I'd just kill the job since I'd
> have the output of the other 279,999. This job has a no-op reduce step
> though since I wanted to take the chance to compact the output into a
> much smaller number of sequence files ( I regret that decision now) As
> such I can't just kill the job since I'd lose the rest of the
> processed data (if I understand correctly?)
> 
> So does anyone know a way to just abandon the entire split?
> 
> Cheers,
> Mat