You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/12/03 22:09:59 UTC
[jira] [Created] (MAPREDUCE-4842) Shuffle race can hang reducer
Jason Lowe created MAPREDUCE-4842:
-------------------------------------
Summary: Shuffle race can hang reducer
Key: MAPREDUCE-4842
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mrv2
Affects Versions: 0.23.5, 2.0.3-alpha
Reporter: Jason Lowe
Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------
Priority: Blocker (was: Major)
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.3-alpha, 0.23.5
> Reporter: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Mariappan Asokan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511039#comment-13511039 ]
Mariappan Asokan commented on MAPREDUCE-4842:
---------------------------------------------
Hi Jason, Arun, and Alejandro,
I came up with a simpler solution to solve this nasty problem. Instead of a single list {{inputs}} in {{MergeThread,}} we can keep a FIFO list of these lists. This will make sure that more than one merge can be pending. The {{run()}} method in {{MergeThread}} will keep pulling out the map output lists from the FIFO list to merge them(this is a typical producer-consumer scenario.)
I will outline the changes below:
In {{MergeThread}},
* A {{LinkedList<List<T>>}} type member({{pendingToBeMerged}}) is added and the member {{inputs}} is removed.
* The {{isInProgress()}} method is removed.
* The {{startMerge()}} method will no longer be {{synchronized.}} It will add the passed list to the tail of {{pendingToBeMerged}} and it will {{notifyAll()}} on the monitor of {{pendingToBeMerged.}}
* The {{run()}} method will sit in a tight loop. So long as there is an item(list of map outputs) to be consumed, it will consume(merge) the item and remove it from {{pendingToBeMerged.}} If {pendingToBeMerged}} has no more item, it will {{notifyAll()}} on the object's monitor after setting
{{inProgress}} to {{false.}}
In {{MergeManager}},
* All calls to {{isInProgress()}} are removed.
* Unnecessary {{synchronized}} clauses on merge thread objects are removed since the methods where they are in themselves are {{synchronized.}}
I created a patch with the above changes and tested it on my laptop. The mapreduce tests seem to run without any problem. However, I do not claim that it is completely tested. It has to go through the rigorous testing that Jason did.
If you are interested in taking a look at the patch, I will post it to this Jira. I welcome your questions and suggestions on the idea of the patch.
-- Asokan
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4842:
----------------------------------
Assignee: Arun C Murthy
Target Version/s: 2.0.3-alpha, 0.23.6
Status: Patch Available (was: Open)
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.5, 2.0.2-alpha
> Reporter: Jason Lowe
> Assignee: Arun C Murthy
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------
Status: Open (was: Patch Available)
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.5, 2.0.2-alpha
> Reporter: Jason Lowe
> Assignee: Arun C Murthy
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------
Attachment: MAPREDUCE-4842.patch
Great catch Jason! Thanks!
It seems like we are missing a hook in MergeThread.run to re-check the condition and trigger another merge at the end of the merge itself.
Here is an illustrative patch.
Thoughts?
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.3-alpha, 0.23.5
> Reporter: Jason Lowe
> Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510664#comment-13510664 ]
Alejandro Abdelnur commented on MAPREDUCE-4842:
-----------------------------------------------
One minor NIT, the scope of exceptionReporter instance var has been changed from private to protected for testing purposes. It should be package private instead. And preferable, we should add a getter method instead, package private (it could be annotated with the visiblefortesting guava annotation). Other than that looks good to me.
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509066#comment-13509066 ]
Jason Lowe commented on MAPREDUCE-4842:
---------------------------------------
Here's the sequence of events that I believe led to the hang during shuffle. See {{MergeManager}} for context of variable references.
# Fetchers started fetching data
# Enough data finishes transferring to reach the {{commitMemory}} threshold and an in-memory merge starts
# While the merge takes place some of the output data is freed before the merge completes, lowering {{commitMemory}} and {{usedMemory}} which allows more data to be fetched
# Eventually we try to fetch too much data because {{usedMemory}} exceeds {{memoryLimit}} and further fetchers are told to WAIT
# All of the outstanding fetches complete and call {{closeInMemoryFile}}, but we don't start a merge because the previous merge is still marked in progress
# Merge completes, allowing a new merge to be started on the next {{closeInMemoryFile}} call
# With no outstanding fetches and no new fetches allowed, we never call {{closeInMemoryFile}} again and never start the next merge
# With no merge in progress and therefore nothing to wait upon, fetcher threads proceed to pummel the {{MergeManager}} asking for merge data reservations that are never given, and the reducer log grows rather rapidly
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.3-alpha, 0.23.5
> Reporter: Jason Lowe
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------
Affects Version/s: (was: 2.0.3-alpha)
2.0.2-alpha
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy reassigned MAPREDUCE-4842:
----------------------------------------
Assignee: Jason Lowe (was: Arun C Murthy)
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4842:
----------------------------------
Attachment: MAPREDUCE-4842.patch
Thanks for the reviews, Alejandro and Arun. I updated the patch to address Alejandro's comment and also added a comment clarifying why the merge callback occurs outside of the lock and after inProgress is cleared per a side discussion with Arun.
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510027#comment-13510027 ]
Jason Lowe commented on MAPREDUCE-4842:
---------------------------------------
I think this approach will work. One nit is we may want to rename checkAndRestartMerge() to something like onSuccessfulMerge() since that's a more general concept and accurately reflects when the method will be called.
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated MAPREDUCE-4842:
-------------------------------------
Attachment: MAPREDUCE-4842.patch
Jason, nice unit test! Thanks!
I've modified it a little to have 2 barriers (mergeStart and mergeComplete) rather than use the same 4 times (confused me a lot when I was reviewing it).
Other than that, it looks great. +1
Also, if you don't mind, I'll assign the jira to you - since you've done all the heavy lifting and deserve way more credit than I do. Thanks again!
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Arun C Murthy
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4842) Shuffle race can hang reducer
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4842:
----------------------------------
Attachment: MAPREDUCE-4842.patch
Updated the patch to add a test case and rename checkAndRestartMerge to onSuccessfulMerge
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira