You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/05/31 21:45:23 UTC
[jira] [Commented] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286869#comment-13286869 ] 

Jason Lowe commented on MAPREDUCE-4298:
---------------------------------------

The relevant portion of the NM logfile:

{noformat}
 [New I/O server worker #1-2]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-15]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-23]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-2]2012-05-25 21:30:36,720 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out not found
 [New I/O server worker #1-17]2012-05-25 21:30:36,720 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out not found
 [New I/O server worker #1-24]2012-05-25 21:30:36,721 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/10/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000030_0/file.out not found
 [New I/O server worker #1-30]2012-05-25 21:30:36,722 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000314_0/file.out not found
 [New I/O server worker #1-18]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1337019383242_29334_01_000148 succeeded 
 [ContainersLauncher #2677]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1337019383242_29334_01_000148 transitioned from RUNNING to EXITED_WITH_SUCCESS
 [AsyncDispatcher event handler]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1337019383242_29334_01_000148
 [AsyncDispatcher event handler]2012-05-25 21:30:36,727 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out not found
 [New I/O server worker #1-7]2012-05-25 21:30:36,727 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out not found
 [New I/O server worker #1-27]2012-05-25 21:30:36,728 INFO org.apache.hadoop.mapred.ShuffleHandler: /grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out not found
 [New I/O server worker #1-13]2012-05-25 21:30:36,728 FATAL org.apache.hadoop.conf.Configuration: error parsing conf file: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too many open files)
  [AsyncDispatcher event handler]2012-05-25 21:30:36,728 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
 [AsyncDispatcher event handler]java.lang.RuntimeException: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too many open files)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1723)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1588)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1534)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:581)
        at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:598)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1332)
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:142)
        at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:233)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:315)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:313)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
        at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:313)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:426)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:406)
        at org.apache.hadoop.fs.FileContext.getLocalFSFileContext(FileContext.java:392)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:353)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:144)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:54)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at java.io.FileInputStream.<init>(FileInputStream.java:66)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1637)
        ... 22 more
2012-05-25 21:30:36,729 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
{noformat}

It's not clear yet exactly why the NM hit the open file ulimit which was set to 8192.  There were jobs running on the cluster with 4000+ reducers launched (potentially multiple of these types of jobs simultaneously).

I checked the ShuffleHandler which uses Netty, and I didn't see any limit to the number of connections Netty would accept.  I wrote a quick Python script that simply connects to the shuffle port as fast as it can, keeping around all the sockets, and shortly afterwards the Netty boss thread had crashed with this in the NM logfile:

{noformat}
Exception in thread "ShuffleHandler Netty Boss #0" java.lang.InternalError
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:755)
        at sun.misc.URLClassPath.getResource(URLClassPath.java:169)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:435)
        at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2289)
        at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1364)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1328)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
        at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1224)
        at java.util.ResourceBundle.getBundle(ResourceBundle.java:705)
        at java.util.logging.Level.getLocalizedName(Level.java:223)
        at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:64)
        at java.util.logging.StreamHandler.publish(StreamHandler.java:179)
        at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:88)
        at java.util.logging.Logger.log(Logger.java:481)
        at java.util.logging.Logger.doLog(Logger.java:503)
        at java.util.logging.Logger.logp(Logger.java:703)
        at org.jboss.netty.logging.JdkLogger.warn(JdkLogger.java:86)
        at org.jboss.netty.logging.InternalLoggerFactory$1.warn(InternalLoggerFactory.java:133)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:260)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.jar.JarFile.<init>(JarFile.java:135)
        at java.util.jar.JarFile.<init>(JarFile.java:72)
        at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:646)
        at sun.misc.URLClassPath$JarLoader.access$600(URLClassPath.java:540)
        at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:607)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:599)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:753)
        ... 31 more
{noformat}

So in this case, Netty probably got an EMFILE error when trying to accept a new connection after hitting the socket limit, and when it tried to log a warning it ended up invoking the class loader which was unable to get a new file descriptor for the jar file.

Given there were a lot of reducers running, in theory these could all hit a single nodemanager near the same time when waiting for the last map to finish.  If for some reason the reducer shuffle connections lingered around long enough, we'd exhaust the file descriptors on the nodemanager given enough reducers within a short enough time span.  I can't tell from the logs if this is exactly what happened in this case, but it is concerning that there appears to be no limit to the number of connections the shuffle handler will accept before we hit hard OS limits that will crash the NM in various ways.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> A node on one of our clusters fell over because it ran out of open file descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira