You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2013/10/15 17:36:43 UTC

[jira] [Updated] (MAPREDUCE-5584) ShuffleHandler becomes unresponsive during gridmix runs and can leak file descriptors

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-5584:
----------------------------------

    Priority: Blocker  (was: Major)

The reducers were timing out attempting to contact certain nodes for their map inputs.  Simple GET probes to the shuffle port on these nodes showed that they were indeed totally unresponsive.  Examination of the nodes showed that they had leaked a significant number of file descriptors with sockets in the CLOSE_WAIT state.

The jstacks of the NodeManager processes on these nodes also showed that all of the Netty handlers were stuck somewhere in LocalDirAllocator.getLocalPathToRead.  They were either stuck on the synchronized lock or waiting for the results of fs.exists() to return which now forks and execs {{stat}} since HADOOP-9652.

> ShuffleHandler becomes unresponsive during gridmix runs and can leak file descriptors
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5584
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5584
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> While running gridmix on 2.3 we noticed that jobs are running much slower than normal.  We tracked this down to reducers having difficulties shuffling data from maps.  Details to follow.



--
This message was sent by Atlassian JIRA
(v6.1#6144)