You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andreas Kostyrka <an...@kostyrka.org> on 2008/07/24 10:12:27 UTC

hadoop 0.17.1 reducer not fetching map output problem

Hi!

I'm experiencing hung reducers, with the following symptoms:

> Task Logs: 'task_200807230647_0008_r_000009_1'
>
>
> stdout logs
>
>
>
> stderr logs
>
>
>
> syslog logs
>
> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:16,074 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:21,084 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:26,094 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:31,104 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:36,114 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:41,126 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)


Notice how it needs 6 map outputs, all map tasks have finished, and it still 
just hangs there.

The second speculative copy of that reducer task needs 14 map outputs with the 
same messages :(

Other observations:

killing the reduce tasks via job -killtask ends up with restarting the job on 
the same node, and curiously the new job gets jammed at the same position  
(6/14 maps needed).

The only remedy to this problem seems to be a complete restart of the cluster 
and reprocessing. That gets really boring with jobs that took a day to 
process first :(

Andreas

Re: hadoop 0.17.1 reducer not fetching map output problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
> On 7/25/08 12:09 AM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> >> Could you try to kill the tasktracker hosting the task the next time
> >> when it happens? I just want to isolate the problem - whether it is a
> >> problem in the TT-JT communication or in the Task-TT communication. From
> >> your description it looks like the problem is between the JT-TT
> >> communication. But pls run the experiment when it happens again and let
> >> us know what happens.
> >
> > Well, I did restart the tasktracker where the reduce job was running, but
> > that lead only to a situation where the jobtracker did not restart the
> > job, showed it as still running, and was not able to kill the reduce task
> > via hadoop job -kill-task nor -fail-task.
>
> The reduce task would eventually be reexecuted (after some timeout,
> defaulting to 10 minutes, the tasktracker would be assumed as lost and all
> reducers that were running on that node would be reexecuted).
>
> > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
> > peer at another startup confirmed the whole batch of problems I've been
> > experiencing, and for him 0.15 works for production.
> >
> > <rant-mode>
> > No question, 0.17 is way better than 0.16, on the other hand I wonder how
> > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
> > introduced reducing to our workloads, and before 0.16 failed >80% of the
> > jobs with reducers not being able to get their output. 0.17.0 improved
> > that to a point where one can, with some pain, e.g. restarting the
> > cluster daily, not storing anything important on HDFS, only temporary
> > data, ..., use it somehow for production, at least for small jobs.) So
> > one wonders how 0.16 got released? Or was it meant only as developer-only
> > bug fixing series?
> > </rant-mode>
>
> Pls raise jiras for the specific problems.

I know, that's why I bracketed it as rantmode. OTOH, many of these issues had 
either this creepy feeling where you wondered if you did something wrong or 
were issues where I had to react relatively quickly, which usually destroys 
the faulty state. (I know, as a developer having reproduced a bug is golden. 
As an admin asked about processing lag, it's rather to opposite)

Plus fixing the issue in the next release or even via a patch means that I 
have a non-working cluster till then. Now I that means I would need to start 
debugging the cluster utility software instead of our apps. ;(

Andreas

Re: hadoop 0.17.1 reducer not fetching map output problem

Posted by Devaraj Das <dd...@yahoo-inc.com>.


On 7/25/08 12:09 AM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:

> On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
>> Could you try to kill the tasktracker hosting the task the next time when
>> it happens? I just want to isolate the problem - whether it is a problem in
>> the TT-JT communication or in the Task-TT communication. From your
>> description it looks like the problem is between the JT-TT communication.
>> But pls run the experiment when it happens again and let us know what
>> happens.
> 
> Well, I did restart the tasktracker where the reduce job was running, but that
> lead only to a situation where the jobtracker did not restart the job, showed
> it as still running, and was not able to kill the reduce task via hadoop
> job -kill-task nor -fail-task.

The reduce task would eventually be reexecuted (after some timeout,
defaulting to 10 minutes, the tasktracker would be assumed as lost and all
reducers that were running on that node would be reexecuted).

> 
> I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
> at another startup confirmed the whole batch of problems I've been
> experiencing, and for him 0.15 works for production.
> 
> <rant-mode>
> No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
> could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
> reducing to our workloads, and before 0.16 failed >80% of the jobs with
> reducers not being able to get their output. 0.17.0 improved that to a point
> where one can, with some pain, e.g. restarting the cluster daily, not storing
> anything important on HDFS, only temporary data, ..., use it somehow for
> production, at least for small jobs.) So one wonders how 0.16 got released?
> Or was it meant only as developer-only bug fixing series?
> </rant-mode>
>
Pls raise jiras for the specific problems.
 
> Sorry, this has been driving me up the walls into an asylum till I compared
> notes with a collegue, and decided that I'm not crazy ;)
> 
> Andreas
> 
>> 
>> Thanks,
>> Devaraj
>> 
>> On 7/24/08 1:42 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
>>> Hi!
>>> 
>>> I'm experiencing hung reducers, with the following symptoms:
>>>> Task Logs: 'task_200807230647_0008_r_000009_1'
>>>> 
>>>> 
>>>> stdout logs
>>>> 
>>>> 
>>>> 
>>>> stderr logs
>>>> 
>>>> 
>>>> 
>>>> syslog logs
>>>> 
>>>> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
>>>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:16,074 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:21,084 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:26,094 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:31,104 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:36,114 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:41,126 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
>>> 
>>> Notice how it needs 6 map outputs, all map tasks have finished, and it
>>> still just hangs there.
>>> 
>>> The second speculative copy of that reducer task needs 14 map outputs
>>> with the same messages :(
>>> 
>>> Other observations:
>>> 
>>> killing the reduce tasks via job -killtask ends up with restarting the
>>> job on the same node, and curiously the new job gets jammed at the same
>>> position (6/14 maps needed).
>>> 
>>> The only remedy to this problem seems to be a complete restart of the
>>> cluster and reprocessing. That gets really boring with jobs that took a
>>> day to process first :(
>>> 
>>> Andreas
> 
> 



Re: hadoop 0.17.1 reducer not fetching map output problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> Could you try to kill the tasktracker hosting the task the next time when
> it happens? I just want to isolate the problem - whether it is a problem in
> the TT-JT communication or in the Task-TT communication. From your
> description it looks like the problem is between the JT-TT communication.
> But pls run the experiment when it happens again and let us know what
> happens.

Well, I did restart the tasktracker where the reduce job was running, but that 
lead only to a situation where the jobtracker did not restart the job, showed 
it as still running, and was not able to kill the reduce task via hadoop 
job -kill-task nor -fail-task.

I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer 
at another startup confirmed the whole batch of problems I've been 
experiencing, and for him 0.15 works for production.

<rant-mode>
No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 
could get released? (I'm using streaming.jar, and with 0.16.x I've introduced 
reducing to our workloads, and before 0.16 failed >80% of the jobs with 
reducers not being able to get their output. 0.17.0 improved that to a point 
where one can, with some pain, e.g. restarting the cluster daily, not storing 
anything important on HDFS, only temporary data, ..., use it somehow for 
production, at least for small jobs.) So one wonders how 0.16 got released? 
Or was it meant only as developer-only bug fixing series?
</rant-mode>

Sorry, this has been driving me up the walls into an asylum till I compared 
notes with a collegue, and decided that I'm not crazy ;)

Andreas

>
> Thanks,
> Devaraj
>
> On 7/24/08 1:42 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> > Hi!
> >
> > I'm experiencing hung reducers, with the following symptoms:
> >> Task Logs: 'task_200807230647_0008_r_000009_1'
> >>
> >>
> >> stdout logs
> >>
> >>
> >>
> >> stderr logs
> >>
> >>
> >>
> >> syslog logs
> >>
> >> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
> >> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:16,074 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:21,084 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:26,094 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:31,104 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:36,114 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:41,126 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> >
> > Notice how it needs 6 map outputs, all map tasks have finished, and it
> > still just hangs there.
> >
> > The second speculative copy of that reducer task needs 14 map outputs
> > with the same messages :(
> >
> > Other observations:
> >
> > killing the reduce tasks via job -killtask ends up with restarting the
> > job on the same node, and curiously the new job gets jammed at the same
> > position (6/14 maps needed).
> >
> > The only remedy to this problem seems to be a complete restart of the
> > cluster and reprocessing. That gets really boring with jobs that took a
> > day to process first :(
> >
> > Andreas



Re: hadoop 0.17.1 reducer not fetching map output problem

Posted by Devaraj Das <dd...@yahoo-inc.com>.
Could you try to kill the tasktracker hosting the task the next time when it
happens? I just want to isolate the problem - whether it is a problem in the
TT-JT communication or in the Task-TT communication. From your description
it looks like the problem is between the JT-TT communication. But pls run
the experiment when it happens again and let us know what happens.

Thanks,
Devaraj


On 7/24/08 1:42 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:

> Hi!
> 
> I'm experiencing hung reducers, with the following symptoms:
> 
>> Task Logs: 'task_200807230647_0008_r_000009_1'
>> 
>> 
>> stdout logs
>> 
>> 
>> 
>> stderr logs
>> 
>> 
>> 
>> syslog logs
>> 
>> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:16,074 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:21,084 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:26,094 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:31,104 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:36,114 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:41,126 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 
> 
> Notice how it needs 6 map outputs, all map tasks have finished, and it still
> just hangs there.
> 
> The second speculative copy of that reducer task needs 14 map outputs with the
> same messages :(
> 
> Other observations:
> 
> killing the reduce tasks via job -killtask ends up with restarting the job on
> the same node, and curiously the new job gets jammed at the same position
> (6/14 maps needed).
> 
> The only remedy to this problem seems to be a complete restart of the cluster
> and reprocessing. That gets really boring with jobs that took a day to
> process first :(
> 
> Andreas