You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/03/07 10:22:24 UTC

[jira] Created: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Race condition in fetching map outputs (might lead to hung reduces)
-------------------------------------------------------------------

                 Key: HADOOP-1077
                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Devaraj Das
         Assigned To: Devaraj Das


Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.

This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478722 ] 

Devaraj Das commented on HADOOP-1077:
-------------------------------------

A clarification (the description should read as follows):
Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has, *in the meantime*, successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.


> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1077:
--------------------------------

    Attachment: 1077.patch

This patch does the following:
1) Adds a new hashmap to maintain the fetches-in-progress. This is used to track fetches and to take of the problem where we receive a new map output location for a given mapId while we are fetching (and will probably fail) the output for that same mapId from some other location.
2) An entry for a mapId is made in the fetchInProgress map as soon as we receive a map output location from the JobTracker if the neededOutputs has the mapId. The entry for the mapId is not deleted until we have successfully copied the output. 
3) During the time the output is copied, it might so happen that the TT is lost and we come to know about it after a while (after connect times out, etc.). In the meanwhile, if another execution of the lost task happened somewhere, we get the event for that and schedule that fetch as well. The output of first successful copier is considered as valid and the other becomes obsolete. In the existing code, this second event would be lost since we depended only on the neededOutputs list (from where entries are removed as soon as the fetches are scheduled). Now an additional check is done to see whether the mapId exists in fetchesInProgress.

> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>         Attachments: 1077.patch
>
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478836 ] 

David Bowen commented on HADOOP-1077:
-------------------------------------

It is nice to see a patch with such good comments!

At the risk of being a coding-style bore, here are a couple of very minor suggestions: (1) long synchronized blocks are a bit hard to read given the two-space indentation style - it may be preferable to break them out into separate methods; (2) some may disagree, but I see no need to write method arguments like "new Integer(loc.getMapId())" when you can now write just "loc.getMapId()" and the compiler will automatically do the conversion.


> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>         Attachments: 1077.patch
>
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1077:
----------------------------------

    Attachment: 1077.2.patch

Attaching Devaraj's patch since he is asleep by now... :)

> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.12.1
>
>         Attachments: 1077.2.patch, 1077.patch
>
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1077:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Devaraj!

> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.12.1
>
>         Attachments: 1077.2.patch, 1077.patch
>
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1077:
----------------------------------

    Status: Patch Available  (was: Open)

> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.12.1
>
>         Attachments: 1077.2.patch, 1077.patch
>
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1077) Race condition in fetching map outputs (might lead to hung reduces)

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-1077:
--------------------------------

    Fix Version/s: 0.12.1
         Priority: Blocker  (was: Major)

> Race condition in fetching map outputs (might lead to hung reduces)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-1077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1077
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.12.1
>
>         Attachments: 1077.patch
>
>
> Sometimes when a map task is lost while the map-output fetch is happening from the TT for that task, and the lost map has successfully executed on some other node, the event for that successful execution is lost at the fetching TT. The fetching TT might eventually fail to fetch the output for the lost task, but then since the event for the new run of the lost map might also have been lost, the fetching TT might hang.
> This "hung" problem was discovered while working on HADOOP-1060.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.