You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Yuqi Wang (JIRA)" <ji...@apache.org> on 2018/12/19 11:29:00 UTC
[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

    [ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724903#comment-16724903 ] 

Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM:
------------------------------------------------------------

BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code:

*Background exception was not retry-able or retry gave up for UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:461)
	at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
	at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
	at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
	at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface.

So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash.


was (Author: yqwang):
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code:

*Background exception was not retry-able or retry gave up for UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:461)
	at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
	at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
	at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
	at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface.

So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk connect string config.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader election
> --------------------------------------------------------------------------------------
>
>                 Key: YARN-9151
>                 URL: https://issues.apache.org/jira/browse/YARN-9151
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.9.2
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>            Priority: Major
>              Labels: patch
>             Fix For: 3.1.1
>
>         Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
>     Start RMActiveServices 
>     Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
>     Stop CommonNodeLabelsManager
>     Stop RMActiveServices
>     Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return
>    
> (The standby RM failed to rejoin the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever.
> And as the author [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
> {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc.
> For more details, please check the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org