You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/12/26 20:44:42 UTC

Reducers fail without messages on 20.205.0

Hi,

We sometimes see reducers fail just when all mappers are finishing. All 
mappers finish roughly at the same time. The reducers only dump the following 
exception:

java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

The reducers own log output also shows nothing that gives a clue, this is the 
last part of the log:

2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got 
brand-new decompressor
2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got 
brand-new decompressor
2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got 
brand-new decompressor
2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging on-
disk files
2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging on-
disk files
2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging in 
memory files
2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where 0 is 
already in progress
2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling Map 
Completion Events
2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts and0 
dup hosts)
2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts and0 
dup hosts)
2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0 
dup hosts)
2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where 0 is 
already in progress
2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0 
dup hosts)
2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0 
dup hosts)

Is there any advice?
Thanks

Re: Reducers fail without messages on 20.205.0

Posted by Markus Jelsma <ma...@openindex.io>.

It seems indeed. We had good results with 20.203 and a large value for reuse 
configured. What changed?

Thanks

> I wouldn't use jvm reuse at this point. It's had a number of issues over
> time and I've consistently switched it off for a long while now.
> 
> Arun
> 
> On Dec 26, 2011, at 2:50 PM, Markus Jelsma wrote:
> > Hi,
> > 
> >> Markus,
> >> 
> >> Good to know you fixed it now :)
> >> 
> >> Also consider raising reduce slowstart to a more suitable value, like
> >> 80-85% -- this is pretty beneficial in both heap consumption and more
> >> performant Reduce slots. The property name is
> >> "mapred.reduce.slowstart.completed.maps" and carries a float value
> >> (percentage).
> > 
> > Is that going to help if one has a non-zero value for reusing JVM
> > instances? We have little concurrent jobs running because the process
> > flow is linear and we prefer the benefit of copying map output while
> > it's running.
> > For this particular job, however, it is not the case since all mappers
> > finish almost simultaneously so changing that value would not make a
> > difference right? 80% finishes within a matter of seconds.
> > 
> > For the record, it's an Apache Nutch fetcher job configured to terminate
> > after 1 hour. This downloads URL's, hence the disabling of speculative
> > execution.

Re: Reducers fail without messages on 20.205.0

Posted by Arun C Murthy <ac...@hortonworks.com>.

I wouldn't use jvm reuse at this point. It's had a number of issues over time and I've consistently switched it off for a long while now.

Arun

On Dec 26, 2011, at 2:50 PM, Markus Jelsma wrote:

> Hi,
> 
>> Markus,
>> 
>> Good to know you fixed it now :)
>> 
>> Also consider raising reduce slowstart to a more suitable value, like
>> 80-85% -- this is pretty beneficial in both heap consumption and more
>> performant Reduce slots. The property name is
>> "mapred.reduce.slowstart.completed.maps" and carries a float value
>> (percentage).
> 
> Is that going to help if one has a non-zero value for reusing JVM instances? 
> We have little concurrent jobs running because the process flow is linear and 
> we prefer the benefit of copying map output while it's running.
> For this particular job, however, it is not the case since all mappers finish 
> almost simultaneously so changing that value would not make a difference 
> right? 80% finishes within a matter of seconds.
> 
> For the record, it's an Apache Nutch fetcher job configured to terminate after 
> 1 hour. This downloads URL's, hence the disabling of speculative execution.

Re: Reducers fail without messages on 20.205.0

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

> Markus,
> 
> Good to know you fixed it now :)
> 
> Also consider raising reduce slowstart to a more suitable value, like
> 80-85% -- this is pretty beneficial in both heap consumption and more
> performant Reduce slots. The property name is
> "mapred.reduce.slowstart.completed.maps" and carries a float value
> (percentage).

Is that going to help if one has a non-zero value for reusing JVM instances? 
We have little concurrent jobs running because the process flow is linear and 
we prefer the benefit of copying map output while it's running.
For this particular job, however, it is not the case since all mappers finish 
almost simultaneously so changing that value would not make a difference 
right? 80% finishes within a matter of seconds.

For the record, it's an Apache Nutch fetcher job configured to terminate after 
1 hour. This downloads URL's, hence the disabling of speculative execution.

> 
> ( P.s. You may also want to watch this general take on the topic:
> http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-hadoop
> -troubleshooting-101 )

Something interesting for tomorrow :)

Many thanks for sharing thoughts!
Cheers

> 
> On 27-Dec-2011, at 2:29 AM, Markus Jelsma wrote:
> > Spot on!
> > 
> > The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two
> > reducers and only 60MB for datanode and tasktracker daemons. I had
> > already accounted for Java processes taking more than -Xmx of physical
> > RAM, it seems an edge case caused the mappers to take a bit more RAM
> > than usual. Then when the reducers start (all at the same time) it was
> > not enough.
> > 
> > Thanks!
> > 
> >> Markus,
> >> 
> >> Thanks for clarifying! What heap size does this task launch with and
> >> what's the machine RAM and the TaskTracker slots? Can you rule out the
> >> Linux OOM killer taking part here?
> >> 
> >> On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
> >>> Hi,
> >>> 
> >>> They were indeed reported as failed and speculative execution is
> >>> disabled because this particular job is not side-effect free. We've
> >>> seen this before but it's rare.
> >>> 
> >>> Thanks
> >>> 
> >>>> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
> >>>> failed task and not as a killed task? If not, it may just be a result
> >>>> of speculative reducer execution at work and is nothing to worry
> >>>> about.
> >>>> 
> >>>> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> We sometimes see reducers fail just when all mappers are finishing.
> >>>>> All mappers finish roughly at the same time. The reducers only dump
> >>>>> the following exception:
> >>>>> 
> >>>>> java.lang.Throwable: Child Error
> >>>>> 
> >>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> >>>>> 
> >>>>> Caused by: java.io.IOException: Task process exit with nonzero status
> >>>>> of 137.
> >>>>> 
> >>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> >>>>> 
> >>>>> The reducers own log output also shows nothing that gives a clue,
> >>>>> this is the last part of the log:
> >>>>> 
> >>>>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
> >>>>> Got brand-new decompressor
> >>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
> >>>>> Got brand-new decompressor
> >>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
> >>>>> Got brand-new decompressor
> >>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for
> >>>>> merging on- disk files
> >>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for
> >>>>> merging on- disk files
> >>>>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for
> >>>>> merging in memory files
> >>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
> >>>>> where 0 is already in progress
> >>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for
> >>>>> polling Map Completion Events
> >>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow
> >>>>> hosts and0 dup hosts)
> >>>>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow
> >>>>> hosts and0 dup hosts)
> >>>>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow
> >>>>> hosts and0 dup hosts)
> >>>>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
> >>>>> where 0 is already in progress
> >>>>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow
> >>>>> hosts and0 dup hosts)
> >>>>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow
> >>>>> hosts and0 dup hosts)
> >>>>> 
> >>>>> Is there any advice?
> >>>>> Thanks

Re: Reducers fail without messages on 20.205.0

Posted by Harsh J <ha...@cloudera.com>.

Markus,

Good to know you fixed it now :)

Also consider raising reduce slowstart to a more suitable value, like 80-85% -- this is pretty beneficial in both heap consumption and more performant Reduce slots. The property name is "mapred.reduce.slowstart.completed.maps" and carries a float value (percentage).

( P.s. You may also want to watch this general take on the topic: http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-hadoop-troubleshooting-101 )

On 27-Dec-2011, at 2:29 AM, Markus Jelsma wrote:

> Spot on!
> 
> The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two 
> reducers and only 60MB for datanode and tasktracker daemons. I had already 
> accounted for Java processes taking more than -Xmx of physical RAM, it seems 
> an edge case caused the mappers to take a bit more RAM than usual. Then when 
> the reducers start (all at the same time) it was not enough.
> 
> Thanks!
> 
>> Markus,
>> 
>> Thanks for clarifying! What heap size does this task launch with and what's
>> the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM
>> killer taking part here?
>> 
>> On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
>>> Hi,
>>> 
>>> They were indeed reported as failed and speculative execution is disabled
>>> because this particular job is not side-effect free. We've seen this
>>> before but it's rare.
>>> 
>>> Thanks
>>> 
>>>> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
>>>> failed task and not as a killed task? If not, it may just be a result of
>>>> speculative reducer execution at work and is nothing to worry about.
>>>> 
>>>> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
>>>>> Hi,
>>>>> 
>>>>> We sometimes see reducers fail just when all mappers are finishing. All
>>>>> mappers finish roughly at the same time. The reducers only dump the
>>>>> following exception:
>>>>> 
>>>>> java.lang.Throwable: Child Error
>>>>> 
>>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>>> 
>>>>> Caused by: java.io.IOException: Task process exit with nonzero status
>>>>> of 137.
>>>>> 
>>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>> 
>>>>> The reducers own log output also shows nothing that gives a clue, this
>>>>> is the last part of the log:
>>>>> 
>>>>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
>>>>> Got brand-new decompressor
>>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
>>>>> Got brand-new decompressor
>>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
>>>>> Got brand-new decompressor
>>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
>>>>> on- disk files
>>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
>>>>> on- disk files
>>>>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
>>>>> in memory files
>>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
>>>>> where 0 is already in progress
>>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
>>>>> Map Completion Events
>>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
>>>>> where 0 is already in progress
>>>>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 
>>>>> Is there any advice?
>>>>> Thanks

Re: Reducers fail without messages on 20.205.0

Posted by Markus Jelsma <ma...@openindex.io>.

Spot on!

The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two 
reducers and only 60MB for datanode and tasktracker daemons. I had already 
accounted for Java processes taking more than -Xmx of physical RAM, it seems 
an edge case caused the mappers to take a bit more RAM than usual. Then when 
the reducers start (all at the same time) it was not enough.

Thanks!

> Markus,
> 
> Thanks for clarifying! What heap size does this task launch with and what's
> the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM
> killer taking part here?
> 
> On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
> > Hi,
> > 
> > They were indeed reported as failed and speculative execution is disabled
> > because this particular job is not side-effect free. We've seen this
> > before but it's rare.
> > 
> > Thanks
> > 
> >> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
> >> failed task and not as a killed task? If not, it may just be a result of
> >> speculative reducer execution at work and is nothing to worry about.
> >> 
> >> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
> >>> Hi,
> >>> 
> >>> We sometimes see reducers fail just when all mappers are finishing. All
> >>> mappers finish roughly at the same time. The reducers only dump the
> >>> following exception:
> >>> 
> >>> java.lang.Throwable: Child Error
> >>> 
> >>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> >>> 
> >>> Caused by: java.io.IOException: Task process exit with nonzero status
> >>> of 137.
> >>> 
> >>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> >>> 
> >>> The reducers own log output also shows nothing that gives a clue, this
> >>> is the last part of the log:
> >>> 
> >>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
> >>> Got brand-new decompressor
> >>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
> >>> Got brand-new decompressor
> >>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
> >>> Got brand-new decompressor
> >>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
> >>> on- disk files
> >>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
> >>> on- disk files
> >>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
> >>> in memory files
> >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
> >>> where 0 is already in progress
> >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
> >>> Map Completion Events
> >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
> >>> where 0 is already in progress
> >>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 
> >>> Is there any advice?
> >>> Thanks

Re: Reducers fail without messages on 20.205.0

Posted by Harsh J <ha...@cloudera.com>.

Markus,

Thanks for clarifying! What heap size does this task launch with and what's the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM killer taking part here?

On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:

> Hi,
> 
> They were indeed reported as failed and speculative execution is disabled 
> because this particular job is not side-effect free. We've seen this before 
> but it's rare.
> 
> Thanks
> 
>> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
>> failed task and not as a killed task? If not, it may just be a result of
>> speculative reducer execution at work and is nothing to worry about.
>> 
>> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
>>> Hi,
>>> 
>>> We sometimes see reducers fail just when all mappers are finishing. All
>>> mappers finish roughly at the same time. The reducers only dump the
>>> following exception:
>>> 
>>> java.lang.Throwable: Child Error
>>> 
>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>> 
>>> Caused by: java.io.IOException: Task process exit with nonzero status of
>>> 137.
>>> 
>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>> 
>>> The reducers own log output also shows nothing that gives a clue, this is
>>> the last part of the log:
>>> 
>>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got
>>> brand-new decompressor
>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
>>> brand-new decompressor
>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
>>> brand-new decompressor
>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
>>> on- disk files
>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
>>> on- disk files
>>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
>>> in memory files
>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where
>>> 0 is already in progress
>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
>>> Map Completion Events
>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
>>> and0 dup hosts)
>>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
>>> and0 dup hosts)
>>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>> and0 dup hosts)
>>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where
>>> 0 is already in progress
>>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>> and0 dup hosts)
>>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>> and0 dup hosts)
>>> 
>>> Is there any advice?
>>> Thanks

Re: Reducers fail without messages on 20.205.0

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

They were indeed reported as failed and speculative execution is disabled 
because this particular job is not side-effect free. We've seen this before 
but it's rare.

Thanks

> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
> failed task and not as a killed task? If not, it may just be a result of
> speculative reducer execution at work and is nothing to worry about.
> 
> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
> > Hi,
> > 
> > We sometimes see reducers fail just when all mappers are finishing. All
> > mappers finish roughly at the same time. The reducers only dump the
> > following exception:
> > 
> > java.lang.Throwable: Child Error
> > 
> > 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> > 
> > Caused by: java.io.IOException: Task process exit with nonzero status of
> > 137.
> > 
> > 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> > 
> > The reducers own log output also shows nothing that gives a clue, this is
> > the last part of the log:
> > 
> > 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got
> > brand-new decompressor
> > 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
> > brand-new decompressor
> > 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
> > brand-new decompressor
> > 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
> > on- disk files
> > 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
> > on- disk files
> > 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
> > in memory files
> > 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where
> > 0 is already in progress
> > 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
> > Map Completion Events
> > 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0 dup hosts)
> > 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
> > and0 dup hosts)
> > 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0 dup hosts)
> > 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where
> > 0 is already in progress
> > 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0 dup hosts)
> > 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0 dup hosts)
> > 
> > Is there any advice?
> > Thanks

Re: Reducers fail without messages on 20.205.0

Posted by Harsh J <ha...@cloudera.com>.

Exit code 137 would mean a SIGKILL. Are you positive its reported as a failed task and not as a killed task? If not, it may just be a result of speculative reducer execution at work and is nothing to worry about.

On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:

> Hi,
> 
> We sometimes see reducers fail just when all mappers are finishing. All 
> mappers finish roughly at the same time. The reducers only dump the following 
> exception:
> 
> java.lang.Throwable: Child Error
> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> Caused by: java.io.IOException: Task process exit with nonzero status of 137.
> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> 
> The reducers own log output also shows nothing that gives a clue, this is the 
> last part of the log:
> 
> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging on-
> disk files
> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging on-
> disk files
> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging in 
> memory files
> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where 0 is 
> already in progress
> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling Map 
> Completion Events
> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts and0 
> dup hosts)
> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts and0 
> dup hosts)
> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0 
> dup hosts)
> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where 0 is 
> already in progress
> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0 
> dup hosts)
> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0 
> dup hosts)
> 
> Is there any advice?
> Thanks