You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2018/06/11 22:59:00 UTC

[jira] [Commented] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent

    [ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508922#comment-16508922 ] 

Eric Yang commented on YARN-8414:
---------------------------------

Timeline collector retries HBase write without any pause in between failures. This creates a lot of TCP sockets in CLOSE_WAIT state. TCP socket doesn't get released until the socket reaches TCP socket timeout (20 seconds default on LInux). The current rate of retries far exceeded TCP socket timeout, and caused node manager to exhaust system resource to bring down node manager process.

> Nodemanager crashes soon if ATSv2 HBase is either down or absent
> ----------------------------------------------------------------
>
>                 Key: YARN-8414
>                 URL: https://issues.apache.org/jira/browse/YARN-8414
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.1.0
>            Reporter: Eric Yang
>            Priority: Critical
>
> Test cluster has 1000 apps running, and a user trigger capacity scheduler queue changes.  This crashes all node managers.  It looks like node manager encounter too many files open while aggregating logs for containers:
> {code}
> 2018-06-07 21:17:59,307 WARN  server.AbstractConnector (AbstractConnector.java:handleAcceptFailure(544)) -
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>         at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
>         at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; can't determine memory settings
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; can't determine memory settings
> 2018-06-07 21:18:00,842 WARN  client.ConnectionUtils (ConnectionUtils.java:getStubKey(236)) - Can not resolve y012.l42scl.hortonworks.com, please check your network
> java.net.UnknownHostException: y012.l42scl.hortonworks.com: System error
>         at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
>         at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>         at java.net.InetAddress.getByName(InetAddress.java:1076)
>         at org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
>         at org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
>         at org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
>         at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
>         at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
>         at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Timeline service has thousands of exceptions:
> {code}
> 2018-06-07 21:18:34,182 ERROR client.AsyncProcess (AsyncProcess.java:submit(291)) - Failed to get region location
> java.io.InterruptedIOException
>         at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
>         at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
>         at org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
>         at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
>         at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
>         at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
>         at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
>         at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
>         at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307)
>         at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212)
>         at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170)
>         at org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54)
>         at org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153)
>         at org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107)
>         at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395)
>         at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198)
>         at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164)
>         at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196)
>         at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173)
>         at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>         at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>         at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>         at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>         at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>         at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>         at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>         at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>         at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>         at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>         at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>         at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>         at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>         at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>         at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>         at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>         at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
>         at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
>         at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>         at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>         at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>         at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>         at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>         at org.eclipse.jetty.server.Server.handle(Server.java:534)
>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>         at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:18:36,266 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "y001.l42scl.hortonworks.com":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getServerDefaults over y001.l42scl.hortonworks.com:8020 after 10 failover attempts. Trying to failover after sleeping for 9634ms.
> 2018-06-07 21:18:36,612 WARN  storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: flowName=null appId=application_1528316765723_0030 userId=csingh clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-06-07 21:18:38,396 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6, retries=6, started=4213 ms ago, cancelled=false, msg=Call to y012.l42scl.hortonworks.com/172.26.32.112:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: y012.l42scl.hortonworks.com/172.26.32.112:17020, details=row 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=y012.l42scl.hortonworks.com,17020,1528302866813, seqNum=-1
> 2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully
> {code}
> Nodes were temporarily unable to resolve hostname to IP mapping.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org