You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Botong Huang (JIRA)" <ji...@apache.org> on 2018/10/08 17:31:00 UTC

[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

    [ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642204#comment-16642204 ] 

Botong Huang commented on YARN-8855:
------------------------------------

Thanks [~rahulanand90] for reporting it! Which federation policy (yarn.federation.policy-manager) and code version are you using? This should have been fixed in latest trunk and branch-2.

> Application fails if one of the sublcluster is down.
> ----------------------------------------------------
>
>                 Key: YARN-8855
>                 URL: https://issues.apache.org/jira/browse/YARN-8855
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Rahul Anand
>            Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times and then it fails About 30 failover attempts found in the logs. Below is the detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container container_e03_1538297667953_0005_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing container_e03_1538297667953_0005_01_000001 from application application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping resource-monitoring for container_e03_1538297667953_0005_01_000001 | ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering container container_e03_1538297667953_0005_01_000001 for log-aggregation | AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping container container_e03_1538297667953_0005_01_000001 | YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find container container_e03_1538297667953_0005_01_000001 while processing FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed containers from NM context: [container_e03_1538297667953_0005_01_000001] | NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 28 failover attempts. Trying to failover after sleeping for 15261ms. | RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 29 failover attempts. Trying to failover after sleeping for 21175ms. | RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
> 2018-10-08 14:22:03,186 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:22:03,189 | ERROR | pool-16-thread-1 | Failed to register application master: cluster2 Application: appattempt_1538297667953_0005_000001 | FederationInterceptor.java:1106
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1459)
> {code}
> cc [~botong] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org