You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2007/09/10 12:50:29 UTC
[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526123 ] 

Arun C Murthy commented on HADOOP-1862:
---------------------------------------

Hmm... one straw to clutch:

{noformat}


$ cat 1862-event.log | grep task_200709041519_0023_m_001149
OBSOLETE task_200709041519_0023_m_001149_0 http://a.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_0
FAILED task_200709041519_0023_m_001149_0 null
SUCCEEDED task_200709041519_0023_m_001149_1 http://b.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_1
SUCCEEDED task_200709041519_0023_m_001149_2 http://c.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_2


$ cat 1862-event.log | grep task_200709041519_0023_m_001816
OBSOLETE task_200709041519_0023_m_001816_0 http://x.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_0
FAILED task_200709041519_0023_m_001816_0 null
SUCCEEDED task_200709041519_0023_m_001816_1 http://y.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_1
SUCCEEDED task_200709041519_0023_m_001816_2 http://z.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_2


{noformat}


Essentially, in {{JobInProgress.updateTaskStatuses(TaskInProgress, TaskStatus, JobTrackerMetrics)}} the {{TaskCompletionEvent.Status.SUCCEEDED}} is added irrespective of whether the TIP is already complete or not, leading to each reducer seeing 2 {{TaskCompletionEvent.Status.SUCCEEDED}} events as above... clearly the fetch from one of them will fail since either _1 or _2 will be {{KILLED}}, not a happy situation. 

Like I said, I'll try to dig deeper, maybe this could help someone beat me to it. *smile*

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.