You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/03/29 14:20:25 UTC

[jira] Created: (HADOOP-1183) MapTask completion event lost

MapTask completion event lost
-----------------------------

                 Key: HADOOP-1183
                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Devaraj Das
         Assigned To: Devaraj Das
            Priority: Critical


A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that TT successfully reexecuted elsewhere, the tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1183:
----------------------------------

    Fix Version/s:     (was: 0.12.3)
                   0.13.0
           Status: Open  (was: Patch Available)

I'm uneasy about this patch. The underlying code is very complex, the patch is adding substantial complexity, and it isn't clear to me that this is the right direction. I think we should post-pone this fix and likely redesign the fetcher in 0.13.

One possible approach to simplifying this section of code would be to make an array of states for each of the map outputs (INITIAL, LOCATED, FETCHING, DONE, FAILED) and process the map outputs using a DFA. Another structure that might make sense is an array of the best MapOutputLocation for each map.

Thoughts?

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1183.new.patch, 1183.new1.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1183:
--------------------------------

    Status: Patch Available  (was: Open)

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>             Fix For: 0.12.3
>
>         Attachments: 1183.new.patch, 1183.new1.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486268 ] 

Devaraj Das commented on HADOOP-1183:
-------------------------------------

I agree with Owen that the fetcher needs some redesigning on those lines. But in any case, should anyone see reduce(s) hanging (with the fetcher continuously trying to fetch output from a failed/lost map), this patch should be applied.

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1183.new.patch, 1183.new1.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1183:
--------------------------------

    Attachment: 1183.patch

Retrials of map output fetches might overwrite the new events got from the JT for the same maps. Lets assume that a tasktracker is lost while we are in the process of fetching map outputs from it. There is a timing issue between when a mapoutput fetch completes with a failure, and when a new event for the same map task is obtained. If the latter is got before the former, and if the fetch corresponding to the new event is not scheduled before the former, then it will lead to loss of this new event (overwritten with the retrial for the old failed fetch).

The attached patch should handle this issue - here the FAILED events are explicitly handled. Please review it (while i am testing it on a big cluster).

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>         Attachments: 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das resolved HADOOP-1183.
---------------------------------

    Resolution: Duplicate

The original bug is a duplicate of HADOOP-1270. The comment on redesigning shuffle is now a new issue - HADOOP-1337.

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1183.new.patch, 1183.new1.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1183:
--------------------------------

    Attachment: 1183.new.patch

This patch does a slightly better handling of failed maps. It records only those failed maps whose outputs we haven't fetched yet.

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>         Attachments: 1183.new.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1183:
--------------------------------

    Description: A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.  (was: A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that TT successfully reexecuted elsewhere, the tasktrackers didn't correctly note those events.)
        Summary: MapTask completion not recorded properly at the Reducer's end  (was: MapTask completion event lost)

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-1183:
--------------------------------

        Fix Version/s: 0.12.3
    Affects Version/s: 0.12.2

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>             Fix For: 0.12.3
>
>         Attachments: 1183.new.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1183) MapTask completion not recorded properly at the Reducer's end

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1183:
--------------------------------

    Attachment: 1183.new1.patch

Indentation changes (as per Owen's comments. Thanks Owen).

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>             Fix For: 0.12.3
>
>         Attachments: 1183.new.patch, 1183.new1.patch, 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a lost tasktracker. Although the tasks running on that lost TT successfully reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.