You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2007/09/07 23:40:31 UTC

[jira] Created: (HADOOP-1862) reduces are getting stuck trying to find map outputs

reduces are getting stuck trying to find map outputs
----------------------------------------------------

                 Key: HADOOP-1862
                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.14.1
            Reporter: Owen O'Malley
            Assignee: Owen O'Malley
            Priority: Blocker
             Fix For: 0.15.0


Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_debug.patch

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530947 ] 

Devaraj Das commented on HADOOP-1862:
-------------------------------------

The patch looks good.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_1_20070912.patch

Here is a first take on this one, with very preliminary testing done.

I know Devaraj has some second-thoughts, but yet... *smile*

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_3_20071002.patch

Updated patch, also resets the 'size' variable {{MapOutputCopier.run}} on fetch-failures.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526123 ] 

Arun C Murthy commented on HADOOP-1862:
---------------------------------------

Hmm... one straw to clutch:

{noformat}


$ cat 1862-event.log | grep task_200709041519_0023_m_001149
OBSOLETE task_200709041519_0023_m_001149_0 http://a.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_0
FAILED task_200709041519_0023_m_001149_0 null
SUCCEEDED task_200709041519_0023_m_001149_1 http://b.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_1
SUCCEEDED task_200709041519_0023_m_001149_2 http://c.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_2


$ cat 1862-event.log | grep task_200709041519_0023_m_001816
OBSOLETE task_200709041519_0023_m_001816_0 http://x.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_0
FAILED task_200709041519_0023_m_001816_0 null
SUCCEEDED task_200709041519_0023_m_001816_1 http://y.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_1
SUCCEEDED task_200709041519_0023_m_001816_2 http://z.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_2


{noformat}


Essentially, in {{JobInProgress.updateTaskStatuses(TaskInProgress, TaskStatus, JobTrackerMetrics)}} the {{TaskCompletionEvent.Status.SUCCEEDED}} is added irrespective of whether the TIP is already complete or not, leading to each reducer seeing 2 {{TaskCompletionEvent.Status.SUCCEEDED}} events as above... clearly the fetch from one of them will fail since either _1 or _2 will be {{KILLED}}, not a happy situation. 

Like I said, I'll try to dig deeper, maybe this could help someone beat me to it. *smile*

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment:     (was: HADOOP-1862_debug.patch)

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_debug.patch

Ok, I was fortunate to be able to look at an actual job exhibiting this same behaviour (thanks to Christian and HADOOP-1874) and here are some insights:

Essentially there seems like a issue with {{TaskCompletionEvent}}s received by the {{ReduceTask}}. More than one *stuck* reducer (looking for map outputs) never actually scheduled a copy from the maps from which it is missing outputs, which leads me to believe that there is an issue with missing {{TaskCompletionEvents}}s.

The other minor bug is the one I pointed out in my previous comment, so I've attached another patch which incorporates:
a) previous fix
b) a debug statement to help track the received {{TaskCompletionEvent}}s at the {{Reduce}} task.

Christian: I'd appreciate if you could try this out... Thanks!

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531949 ] 

Hudson commented on HADOOP-1862:
--------------------------------

Integrated in Hadoop-Nightly #258 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/258/])

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_branch14_1_20071003.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_branch14_1_20071003.patch

Attached patch for branch-0.14

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_branch14_1_20071003.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Status: Patch Available  (was: Open)

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1, 0.14.0, 0.13.1, 0.13.0
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527954 ] 

Devaraj Das commented on HADOOP-1862:
-------------------------------------

Sorry, I am not convinced that the patch solves the issue at hand. I would like to wait for the problem to be replicated with a version of hadoop that includes the patch for HADOOP-1874. Then I would like to see whether we still run into the situation reported by this bug, apply this patch on that version, and see whether it solves the problem. This is a significant change and I just want to make sure that this patch really addresses the reported bug.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1862:
----------------------------------

        Fix Version/s:     (was: 0.15.0)
                       14.2
    Affects Version/s: 0.13.0
                       0.13.1
                       0.14.0

Ok, I just found this one and it is a bad one. The problem is that the TaskTracker in fetchMapCompletionEvents stores a cache of the completion events indexed by TIP id instead TASK id. So there is a event race condition between tasks in the same tip and if the last event is the failed one, then the reduces get stuck, because that map is marked as never completed. I've marked this as a 0.14.2 fix, but we might need a 0.13 fix too.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 14.2
>
>         Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das resolved HADOOP-1862.
---------------------------------

    Resolution: Fixed

I just committed this. Thanks, Arun!

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_branch14_1_20071003.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_2_20070927.patch

Here is a well-tested patch - works with and without speculative-execution, lots of lost tasktrackers etc.

However, I want to wait for Devaraj to run a few tests with this *and* HADOOP-1874 before we get this in... hence I'm not making it PA right-away. Clearly any reviews/ideas on this are welcome.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531176 ] 

Hadoop QA commented on HADOOP-1862:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12366638/HADOOP-1862_2_20070927.patch
against trunk revision r580461.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests -1.  The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/console

This message is automatically generated.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Attachment: HADOOP-1862_prelim.patch

Illustrative fix...

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1862:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Arun!

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1862:
--------------------------------

    Status: Patch Available  (was: Open)

Submitting the patch on behalf of Arun to get it through Hudson.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1, 0.14.0, 0.13.1, 0.13.0
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531847 ] 

Hadoop QA commented on HADOOP-1862:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12366927/HADOOP-1862_3_20071002.patch
against trunk revision r581101.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/console

This message is automatically generated.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Status: Open  (was: Patch Available)

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1, 0.14.0, 0.13.1, 0.13.0
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das reopened HADOOP-1862:
---------------------------------


Reopening the issue since the patch doesn't apply to 0.14. Please submit a separate patch for that.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1862_1_20070912.patch, HADOOP-1862_2_20070927.patch, HADOOP-1862_3_20071002.patch, HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1862) reduces are getting stuck trying to find map outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1862:
----------------------------------

    Assignee: Arun C Murthy  (was: Owen O'Malley)

Ok, here is the proposal to fix this one:

a) Maintain a straight cache at the {{TaskTracker}} i.e. straight copy of the events sent by the {{JobTracker}} and don't manipulate them.
b) Process {{OBSOLETE}} {{TaskCompletionEvent}}s at the {{ReduceTask}}.
c) Keep my earlier fix to {{TaskInProgress.completedTask}} i.e. do not add a {{SUCCESS}} event until it is clear that this is the *true* successful event, and that the TIP hadn't already been completed by some other task-attempt.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 14.2
>
>         Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.