You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Neil Joshi (Jira)" <ji...@apache.org> on 2023/05/06 17:16:00 UTC
[jira] [Commented] (HDDS-8558) [SCM HA] NotLeaderExceptions after SCM transfer leader to new node.

    [ https://issues.apache.org/jira/browse/HDDS-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720238#comment-17720238 ] 

Neil Joshi commented on HDDS-8558:
----------------------------------

Found while creating docker cluster dev environment tests for SCM decommissioning.  See PR:

[https://github.com/apache/ozone/pull/4649,]

HDDS-8518.

> [SCM HA] NotLeaderExceptions after SCM transfer leader to new node.
> -------------------------------------------------------------------
>
>                 Key: HDDS-8558
>                 URL: https://issues.apache.org/jira/browse/HDDS-8558
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: Neil Joshi
>            Priority: Major
>
> With SCMHA, if a new SCM node is added to the quorum and leadership is manually transferred to the new node, we get NotALeaderExceptions with RPC calls to the SCM.  Failover _never_ resolves and _never_ failover to newly added node.
>  
> Reproducible, 
> i.) start SCMHA cluster
> ii.) add new SCM to quorum
> iii.) manually transfer leader to newly added node
> iv.) perform RPC call to SCM from client 
> {code:java}
> Transfer leadership successfully to 00bd9308-3467-4229-8587-3b4576834c72.
> bash-4.2$ ozone admin scm roles
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:f319b7a5-c4b5-48ec-bfef-ed61e6c2e082 is not the leader. Suggested leader is Server:scm4.org:9860.
>     at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
>     at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
>     at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
>     at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> , while invoking $Proxy20.submitRequest over nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860 after 3 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 3.
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:c8019701-b4ea-42f9-bff5-86087900efe3 is not the leader. Suggested leader is Server:scm4.org:9860.
>     at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
>     at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
>     at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
>     at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> , while invoking $Proxy20.submitRequest over nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860 after 4 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 4.
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:10496103-c9cf-4275-8b08-c44e08fbc0a6 is not the leader. Suggested leader is Server:scm4.org:9860.
>     at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
>     at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
>     at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
>     at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> , while invoking $Proxy20.submitRequest over nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860 after 5 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 5.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org