You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andreas Kostyrka <an...@kostyrka.org> on 2008/07/24 10:12:27 UTC
hadoop 0.17.1 reducer not fetching map output problem
Hi!
I'm experiencing hung reducers, with the following symptoms:
> Task Logs: 'task_200807230647_0008_r_000009_1'
>
>
> stdout logs
>
>
>
> stderr logs
>
>
>
> syslog logs
>
> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:16,074 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:21,084 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:26,094 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:31,104 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:36,114 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:41,126 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
Notice how it needs 6 map outputs, all map tasks have finished, and it still
just hangs there.
The second speculative copy of that reducer task needs 14 map outputs with the
same messages :(
Other observations:
killing the reduce tasks via job -killtask ends up with restarting the job on
the same node, and curiously the new job gets jammed at the same position
(6/14 maps needed).
The only remedy to this problem seems to be a complete restart of the cluster
and reprocessing. That gets really boring with jobs that took a day to
process first :(
Andreas
Re: hadoop 0.17.1 reducer not fetching map output problem
Posted by Andreas Kostyrka <an...@kostyrka.org>.
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
> On 7/25/08 12:09 AM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> >> Could you try to kill the tasktracker hosting the task the next time
> >> when it happens? I just want to isolate the problem - whether it is a
> >> problem in the TT-JT communication or in the Task-TT communication. From
> >> your description it looks like the problem is between the JT-TT
> >> communication. But pls run the experiment when it happens again and let
> >> us know what happens.
> >
> > Well, I did restart the tasktracker where the reduce job was running, but
> > that lead only to a situation where the jobtracker did not restart the
> > job, showed it as still running, and was not able to kill the reduce task
> > via hadoop job -kill-task nor -fail-task.
>
> The reduce task would eventually be reexecuted (after some timeout,
> defaulting to 10 minutes, the tasktracker would be assumed as lost and all
> reducers that were running on that node would be reexecuted).
>
> > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
> > peer at another startup confirmed the whole batch of problems I've been
> > experiencing, and for him 0.15 works for production.
> >
> > <rant-mode>
> > No question, 0.17 is way better than 0.16, on the other hand I wonder how
> > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
> > introduced reducing to our workloads, and before 0.16 failed >80% of the
> > jobs with reducers not being able to get their output. 0.17.0 improved
> > that to a point where one can, with some pain, e.g. restarting the
> > cluster daily, not storing anything important on HDFS, only temporary
> > data, ..., use it somehow for production, at least for small jobs.) So
> > one wonders how 0.16 got released? Or was it meant only as developer-only
> > bug fixing series?
> > </rant-mode>
>
> Pls raise jiras for the specific problems.
I know, that's why I bracketed it as rantmode. OTOH, many of these issues had
either this creepy feeling where you wondered if you did something wrong or
were issues where I had to react relatively quickly, which usually destroys
the faulty state. (I know, as a developer having reproduced a bug is golden.
As an admin asked about processing lag, it's rather to opposite)
Plus fixing the issue in the next release or even via a patch means that I
have a non-working cluster till then. Now I that means I would need to start
debugging the cluster utility software instead of our apps. ;(
Andreas
Re: hadoop 0.17.1 reducer not fetching map output problem
Posted by Devaraj Das <dd...@yahoo-inc.com>.
On 7/25/08 12:09 AM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
>> Could you try to kill the tasktracker hosting the task the next time when
>> it happens? I just want to isolate the problem - whether it is a problem in
>> the TT-JT communication or in the Task-TT communication. From your
>> description it looks like the problem is between the JT-TT communication.
>> But pls run the experiment when it happens again and let us know what
>> happens.
>
> Well, I did restart the tasktracker where the reduce job was running, but that
> lead only to a situation where the jobtracker did not restart the job, showed
> it as still running, and was not able to kill the reduce task via hadoop
> job -kill-task nor -fail-task.
The reduce task would eventually be reexecuted (after some timeout,
defaulting to 10 minutes, the tasktracker would be assumed as lost and all
reducers that were running on that node would be reexecuted).
>
> I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
> at another startup confirmed the whole batch of problems I've been
> experiencing, and for him 0.15 works for production.
>
> <rant-mode>
> No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
> could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
> reducing to our workloads, and before 0.16 failed >80% of the jobs with
> reducers not being able to get their output. 0.17.0 improved that to a point
> where one can, with some pain, e.g. restarting the cluster daily, not storing
> anything important on HDFS, only temporary data, ..., use it somehow for
> production, at least for small jobs.) So one wonders how 0.16 got released?
> Or was it meant only as developer-only bug fixing series?
> </rant-mode>
>
Pls raise jiras for the specific problems.
> Sorry, this has been driving me up the walls into an asylum till I compared
> notes with a collegue, and decided that I'm not crazy ;)
>
> Andreas
>
>>
>> Thanks,
>> Devaraj
>>
>> On 7/24/08 1:42 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
>>> Hi!
>>>
>>> I'm experiencing hung reducers, with the following symptoms:
>>>> Task Logs: 'task_200807230647_0008_r_000009_1'
>>>>
>>>>
>>>> stdout logs
>>>>
>>>>
>>>>
>>>> stderr logs
>>>>
>>>>
>>>>
>>>> syslog logs
>>>>
>>>> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
>>>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:16,074 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:21,084 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:26,094 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:31,104 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:36,114 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:41,126 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
>>>
>>> Notice how it needs 6 map outputs, all map tasks have finished, and it
>>> still just hangs there.
>>>
>>> The second speculative copy of that reducer task needs 14 map outputs
>>> with the same messages :(
>>>
>>> Other observations:
>>>
>>> killing the reduce tasks via job -killtask ends up with restarting the
>>> job on the same node, and curiously the new job gets jammed at the same
>>> position (6/14 maps needed).
>>>
>>> The only remedy to this problem seems to be a complete restart of the
>>> cluster and reprocessing. That gets really boring with jobs that took a
>>> day to process first :(
>>>
>>> Andreas
>
>
Re: hadoop 0.17.1 reducer not fetching map output problem
Posted by Andreas Kostyrka <an...@kostyrka.org>.
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> Could you try to kill the tasktracker hosting the task the next time when
> it happens? I just want to isolate the problem - whether it is a problem in
> the TT-JT communication or in the Task-TT communication. From your
> description it looks like the problem is between the JT-TT communication.
> But pls run the experiment when it happens again and let us know what
> happens.
Well, I did restart the tasktracker where the reduce job was running, but that
lead only to a situation where the jobtracker did not restart the job, showed
it as still running, and was not able to kill the reduce task via hadoop
job -kill-task nor -fail-task.
I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
at another startup confirmed the whole batch of problems I've been
experiencing, and for him 0.15 works for production.
<rant-mode>
No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
reducing to our workloads, and before 0.16 failed >80% of the jobs with
reducers not being able to get their output. 0.17.0 improved that to a point
where one can, with some pain, e.g. restarting the cluster daily, not storing
anything important on HDFS, only temporary data, ..., use it somehow for
production, at least for small jobs.) So one wonders how 0.16 got released?
Or was it meant only as developer-only bug fixing series?
</rant-mode>
Sorry, this has been driving me up the walls into an asylum till I compared
notes with a collegue, and decided that I'm not crazy ;)
Andreas
>
> Thanks,
> Devaraj
>
> On 7/24/08 1:42 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> > Hi!
> >
> > I'm experiencing hung reducers, with the following symptoms:
> >> Task Logs: 'task_200807230647_0008_r_000009_1'
> >>
> >>
> >> stdout logs
> >>
> >>
> >>
> >> stderr logs
> >>
> >>
> >>
> >> syslog logs
> >>
> >> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
> >> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:16,074 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:21,084 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:26,094 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:31,104 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:36,114 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:41,126 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> >
> > Notice how it needs 6 map outputs, all map tasks have finished, and it
> > still just hangs there.
> >
> > The second speculative copy of that reducer task needs 14 map outputs
> > with the same messages :(
> >
> > Other observations:
> >
> > killing the reduce tasks via job -killtask ends up with restarting the
> > job on the same node, and curiously the new job gets jammed at the same
> > position (6/14 maps needed).
> >
> > The only remedy to this problem seems to be a complete restart of the
> > cluster and reprocessing. That gets really boring with jobs that took a
> > day to process first :(
> >
> > Andreas
Re: hadoop 0.17.1 reducer not fetching map output problem
Posted by Devaraj Das <dd...@yahoo-inc.com>.
Could you try to kill the tasktracker hosting the task the next time when it
happens? I just want to isolate the problem - whether it is a problem in the
TT-JT communication or in the Task-TT communication. From your description
it looks like the problem is between the JT-TT communication. But pls run
the experiment when it happens again and let us know what happens.
Thanks,
Devaraj
On 7/24/08 1:42 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> Hi!
>
> I'm experiencing hung reducers, with the following symptoms:
>
>> Task Logs: 'task_200807230647_0008_r_000009_1'
>>
>>
>> stdout logs
>>
>>
>>
>> stderr logs
>>
>>
>>
>> syslog logs
>>
>> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:16,074 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:21,084 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:26,094 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:31,104 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:36,114 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:41,126 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
>
>
> Notice how it needs 6 map outputs, all map tasks have finished, and it still
> just hangs there.
>
> The second speculative copy of that reducer task needs 14 map outputs with the
> same messages :(
>
> Other observations:
>
> killing the reduce tasks via job -killtask ends up with restarting the job on
> the same node, and curiously the new job gets jammed at the same position
> (6/14 maps needed).
>
> The only remedy to this problem seems to be a complete restart of the cluster
> and reprocessing. That gets really boring with jobs that took a day to
> process first :(
>
> Andreas