You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/05/31 21:19:23 UTC

[jira] [Created] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Jason Lowe created MAPREDUCE-4298:
-------------------------------------

             Summary: NodeManager crashed after running out of file descriptors
                 Key: MAPREDUCE-4298
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.3
            Reporter: Jason Lowe


A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291492#comment-13291492 ] 

Jason Lowe commented on MAPREDUCE-4298:
---------------------------------------

This occurred again on one of our clusters.  Turns out I was mistaken earlier, the file descriptor ulimit for our nodemanager daemons is set to 32768, not 8192.  Fortunately this time we were able to examine some nodemanagers that had leaked numerous file descriptors but had not fallen over yet.

Almost all of the file descriptors were referencing map outputs for the shuffle, often hundreds of file descriptors open to the same file.  Interestingly almost all of the map files corresponded to just one job.  Examining the NM log around the time that job ran, I found numerous exceptions in it showing things had not gone smoothly during the shuffle for that job.  For example:

{noformat}
 [New I/O server worker #1-5]java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
        at sun.nio.ch.IOUtil.write(IOUtil.java:56)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$PooledSendBuffer.transferTo(SocketSendBufferPool.java:239)
        at org.jboss.netty.channel.socket.nio.NioWorker.write0(NioWorker.java:470)
        at org.jboss.netty.channel.socket.nio.NioWorker.writeFromUserCode(NioWorker.java:388)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:137)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:76)
        at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:68)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:253)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleDownstream(ChunkedWriteHandler.java:123)
        at org.jboss.netty.channel.Channels.write(Channels.java:611)
        at org.jboss.netty.channel.Channels.write(Channels.java:578)
        at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:259)
        at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.sendMapOutput(ShuffleHandler.java:477)
        at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:397)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:144)
        at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:523)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:507)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:444)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:350)
        at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201)
        at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
{noformat}

Looking closer at the job, I could see that it had run with 15000 maps and 2000 reduces.  Hundreds of the reducers had failed running out of heap space during the shuffle phase, which lead to broken pipe and connection reset errors on the nodemanagers trying to serve up shuffle data to those reducers when they died.

I was able to reproduce the broken pipe issue and step through the code with a debugger.  Normally the file descriptor is closed by adding a ChannelFuture after the map data is written, and that future's operationComplete() callback closes the file.  However when there is an I/O error sending the shuffle header, Netty closes down the channel automatically (plus we explicitly close it in a channel exception handler).  By the time we try to write the map file data to the channel, the channel is already closed.  And I was able to see that if we write to a closed channel, the ChannelFuture's operationComplete() callback is never invoked.  No operationComplete() callback means we leak the file descriptor for the map file.  If multiple map files are being sent to the reducer, we leak multiple file descriptors for the same error.

I searched around and discovered this is a known issue in Netty 3.2.3.Final, the version we're currently using.  See https://issues.jboss.org/browse/NETTY-374.  It's fixed in version 3.2.4.Final.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286869#comment-13286869 ] 

Jason Lowe commented on MAPREDUCE-4298:
---------------------------------------

The relevant portion of the NM logfile:

{noformat}
 [New I/O server worker #1-2]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-15]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-23]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-2]2012-05-25 21:30:36,720 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out not found
 [New I/O server worker #1-17]2012-05-25 21:30:36,720 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-24]2012-05-25 21:30:36,721 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/10/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000030_0/file.out not found
 [New I/O server worker #1-30]2012-05-25 21:30:36,722 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000314_0/file.out not found
 [New I/O server worker #1-18]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1337019383242_29334_01_000148 succeeded 
 [ContainersLauncher #2677]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1337019383242_29334_01_000148 transitioned from RUNNING to EXITED_WITH_SUCCESS
 [AsyncDispatcher event handler]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1337019383242_29334_01_000148
 [AsyncDispatcher event handler]2012-05-25 21:30:36,727 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out not found
 [New I/O server worker #1-7]2012-05-25 21:30:36,727 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out not found
 [New I/O server worker #1-27]2012-05-25 21:30:36,728 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out not found
 [New I/O server worker #1-13]2012-05-25 21:30:36,728 FATAL org.apache.hadoop.conf.Configuration: error parsing conf file: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too many open files)
  [AsyncDispatcher event handler]2012-05-25 21:30:36,728 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
 [AsyncDispatcher event handler]java.lang.RuntimeException: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too many open files)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1723)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1588)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1534)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:581)
        at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:598)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1332)
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:142)
        at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:233)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:315)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:313)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
        at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:313)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:426)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:406)
        at org.apache.hadoop.fs.FileContext.getLocalFSFileContext(FileContext.java:392)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:353)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:144)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:54)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at java.io.FileInputStream.<init>(FileInputStream.java:66)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1637)
        ... 22 more
2012-05-25 21:30:36,729 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
{noformat}

It's not clear yet exactly why the NM hit the open file ulimit which was set to 8192.  There were jobs running on the cluster with 4000+ reducers launched (potentially multiple of these types of jobs simultaneously).

I checked the ShuffleHandler which uses Netty, and I didn't see any limit to the number of connections Netty would accept.  I wrote a quick Python script that simply connects to the shuffle port as fast as it can, keeping around all the sockets, and shortly afterwards the Netty boss thread had crashed with this in the NM logfile:

{noformat}
Exception in thread "ShuffleHandler Netty Boss #0" java.lang.InternalError
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:755)
        at sun.misc.URLClassPath.getResource(URLClassPath.java:169)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:435)
        at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2289)
        at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1364)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1328)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
        at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1224)
        at java.util.ResourceBundle.getBundle(ResourceBundle.java:705)
        at java.util.logging.Level.getLocalizedName(Level.java:223)
        at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:64)
        at java.util.logging.StreamHandler.publish(StreamHandler.java:179)
        at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:88)
        at java.util.logging.Logger.log(Logger.java:481)
        at java.util.logging.Logger.doLog(Logger.java:503)
        at java.util.logging.Logger.logp(Logger.java:703)
        at org.jboss.netty.logging.JdkLogger.warn(JdkLogger.java:86)
        at org.jboss.netty.logging.InternalLoggerFactory$1.warn(InternalLoggerFactory.java:133)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:260)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.jar.JarFile.<init>(JarFile.java:135)
        at java.util.jar.JarFile.<init>(JarFile.java:72)
        at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:646)
        at sun.misc.URLClassPath$JarLoader.access$600(URLClassPath.java:540)
        at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:607)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:599)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:753)
        ... 31 more
{noformat}

So in this case, Netty probably got an EMFILE error when trying to accept a new connection after hitting the socket limit, and when it tried to log a warning it ended up invoking the class loader which was unable to get a new file descriptor for the jar file.

Given there were a lot of reducers running, in theory these could all hit a single nodemanager near the same time when waiting for the last map to finish.  If for some reason the reducer shuffle connections lingered around long enough, we'd exhaust the file descriptors on the nodemanager given enough reducers within a short enough time span.  I can't tell from the logs if this is exactly what happened in this case, but it is concerning that there appears to be no limit to the number of connections the shuffle handler will accept before we hit hard OS limits that will crash the NM in various ways.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4298:
----------------------------------

    Status: Open  (was: Patch Available)

Patch for Netty version change is against Hadoop Common code, so I filed HADOOP-8495.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4298.patch
>
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4298:
----------------------------------

            Assignee: Jason Lowe
    Target Version/s: 0.23.3
              Status: Patch Available  (was: Open)
    
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-4298.patch
>
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4298:
----------------------------------

    Priority: Critical  (was: Major)

Bumping priority since one misconfigured job can take down nodemanagers for a significant portion of the cluster.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4298.patch
>
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves resolved MAPREDUCE-4298.
--------------------------------------

    Resolution: Duplicate

dup of HADOOP-8495
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.3, 2.0.0-alpha, 3.0.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4298.patch
>
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4298:
----------------------------------

          Component/s: nodemanager
    Affects Version/s: 3.0.0
                       2.0.0-alpha
    
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.3, 2.0.0-alpha, 3.0.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4298.patch
>
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4298:
----------------------------------

    Attachment: MAPREDUCE-4298.patch

Patch to update Netty version from 3.2.3.Final to 3.2.4.Final.  I manually tested it on a single-node cluster to verify a broken pipe during the write of the shuffle header doesn't lead to a leaked file descriptor.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>         Attachments: MAPREDUCE-4298.patch
>
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira