You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/10/18 01:48:03 UTC

[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478496#comment-13478496 ] 

Jason Lowe commented on MAPREDUCE-4730:
---------------------------------------

Here's what I have gathered so far from a heap dump of an AM attempt that was just about to run out of memory.  Most of the memory was consumed by byte buffers, specifically it looked like most of them were RPC response buffers.

I think there might be a flow control issue in the IPC layer that lead to this.  More than half of the mappers finished before the first reducer started, and thousands of reducers all launched within a few seconds of each other.  They all asked the AM for map completion task events, which currently caps the response to 10000 events per query.  Since more than 10000 maps completed before the first reducers started, each reducer saw a full event list which took around 900K for each response buffer.  There were many IPC Handler threads to service the calls, but only one Responder thread to send out the rather large response buffers.  I see there's a blocking queue to prevent too many calls from coming in at once, but I didn't see any flow control between the Handlers and the Responder thread.  It appears that as long as the Handler threads can keep up with call queue relatively low, they can continue to queue up call response data faster than the Responder thread can send it out.  Eventually this will exhaust available memory leading to an OOM.
                
> AM crashes due to OOM while serving up map task completion events
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4730
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> We're seeing a repeatable OOM crash in the AM for a task with around 30000 maps and 3000 reducers.  Details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira