You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/12/03 22:09:59 UTC

[jira] [Created] (MAPREDUCE-4842) Shuffle race can hang reducer

Jason Lowe created MAPREDUCE-4842:
-------------------------------------

             Summary: Shuffle race can hang reducer
                 Key: MAPREDUCE-4842
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.5, 2.0.3-alpha
            Reporter: Jason Lowe


Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------

    Priority: Blocker  (was: Major)
    
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Mariappan Asokan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511039#comment-13511039 ] 

Mariappan Asokan commented on MAPREDUCE-4842:
---------------------------------------------

Hi Jason, Arun, and Alejandro,
  I came up with a simpler solution to solve this nasty problem.  Instead of a single list {{inputs}} in {{MergeThread,}} we can keep a FIFO list of these lists.  This will make sure that more than one merge can be pending.  The {{run()}} method in {{MergeThread}} will keep pulling out the map output lists from the FIFO list to merge them(this is a typical producer-consumer scenario.)

I will outline the changes below:

In {{MergeThread}},

* A {{LinkedList<List<T>>}} type member({{pendingToBeMerged}}) is added and the member {{inputs}} is removed.

* The {{isInProgress()}} method is removed.

* The {{startMerge()}} method will no longer be {{synchronized.}}  It will add the passed list to the tail of {{pendingToBeMerged}} and it will {{notifyAll()}} on the monitor of {{pendingToBeMerged.}}

* The {{run()}} method will sit in a tight loop.  So long as there is an item(list of map outputs) to be consumed, it will consume(merge) the item and remove it from {{pendingToBeMerged.}}  If {pendingToBeMerged}} has no more item, it will {{notifyAll()}} on the object's monitor after setting
{{inProgress}} to {{false.}}

In {{MergeManager}},

* All calls to {{isInProgress()}} are removed.

* Unnecessary {{synchronized}} clauses on merge thread objects are removed since the methods where they are in themselves are {{synchronized.}}

I created a patch with the above changes and tested it on my laptop.  The mapreduce tests seem to run without any problem.  However, I do not claim that it is completely tested.  It has to go through the rigorous testing that Jason did.

If you are interested in taking a look at the patch, I will post it to this Jira.  I welcome your questions and suggestions on the idea of the patch.

-- Asokan

                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4842:
----------------------------------

            Assignee: Arun C Murthy
    Target Version/s: 2.0.3-alpha, 0.23.6
              Status: Patch Available  (was: Open)
    
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.5, 2.0.2-alpha
>            Reporter: Jason Lowe
>            Assignee: Arun C Murthy
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------

    Status: Open  (was: Patch Available)
    
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.5, 2.0.2-alpha
>            Reporter: Jason Lowe
>            Assignee: Arun C Murthy
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------

    Attachment: MAPREDUCE-4842.patch

Great catch Jason! Thanks!

It seems like we are missing a hook in MergeThread.run to re-check the condition and trigger another merge at the end of the merge itself.

Here is an illustrative patch.

Thoughts?
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>         Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510664#comment-13510664 ] 

Alejandro Abdelnur commented on MAPREDUCE-4842:
-----------------------------------------------

One minor NIT,  the scope of exceptionReporter instance var has been changed from private to protected for testing purposes. It should be package private instead. And preferable, we should add a getter method instead, package private (it could be annotated with the visiblefortesting guava annotation). Other than that looks good to me.
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509066#comment-13509066 ] 

Jason Lowe commented on MAPREDUCE-4842:
---------------------------------------

Here's the sequence of events that I believe led to the hang during shuffle.  See {{MergeManager}} for context of variable references.

# Fetchers started fetching data
# Enough data finishes transferring to reach the {{commitMemory}} threshold and an in-memory merge starts
# While the merge takes place some of the output data is freed before the merge completes, lowering {{commitMemory}} and {{usedMemory}} which allows more data to be fetched
# Eventually we try to fetch too much data because {{usedMemory}} exceeds {{memoryLimit}} and further fetchers are told to WAIT
# All of the outstanding fetches complete and call {{closeInMemoryFile}}, but we don't start a merge because the previous merge is still marked in progress
# Merge completes, allowing a new merge to be started on the next {{closeInMemoryFile}} call
# With no outstanding fetches and no new fetches allowed, we never call {{closeInMemoryFile}} again and never start the next merge
# With no merge in progress and therefore nothing to wait upon, fetcher threads proceed to pummel the {{MergeManager}} asking for merge data reservations that are never given, and the reducer log grows rather rapidly
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------

    Affects Version/s:     (was: 2.0.3-alpha)
                       2.0.2-alpha
    
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned MAPREDUCE-4842:
----------------------------------------

    Assignee: Jason Lowe  (was: Arun C Murthy)
    
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4842:
----------------------------------

    Attachment: MAPREDUCE-4842.patch

Thanks for the reviews, Alejandro and Arun.  I updated the patch to address Alejandro's comment and also added a comment clarifying why the merge callback occurs outside of the lock and after inProgress is cleared per a side discussion with Arun.
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510027#comment-13510027 ] 

Jason Lowe commented on MAPREDUCE-4842:
---------------------------------------

I think this approach will work.  One nit is we may want to rename checkAndRestartMerge() to something like onSuccessfulMerge() since that's a more general concept and accurately reflects when the method will be called.
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------

    Attachment: MAPREDUCE-4842.patch

Jason, nice unit test! Thanks!

I've modified it a little to have 2 barriers (mergeStart and mergeComplete) rather than use the same 4 times (confused me a lot when I was reviewing it).

Other than that, it looks great. +1

Also, if you don't mind, I'll assign the jira to you - since you've done all the heavy lifting and deserve way more credit than I do. Thanks again!
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Arun C Murthy
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4842:
----------------------------------

    Attachment: MAPREDUCE-4842.patch

Updated the patch to add a test case and rename checkAndRestartMerge to onSuccessfulMerge
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira