You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Neil Joshi (Jira)" <ji...@apache.org> on 2023/05/08 23:39:00 UTC

[jira] [Resolved] (HDDS-8558) [SCM HA] NotLeaderExceptions after SCM transfer leader to new node.

     [ https://issues.apache.org/jira/browse/HDDS-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neil Joshi resolved HDDS-8558.
------------------------------
    Resolution: Workaround

With a running SCM-HA cluster, when a new node is added to the quorum its nodeid and host address:port is not available in the pre-existing running SCM nodes unless all nodes are configured with the newly added node.

 

Since the pre-existing nodes in SCM HA cluster are not configured with the newly added node, when the newly added node is elected leader of the quorum any RPC calls to the pre-existing SCM nodes raise a NOT LEADER exception and failover will never try the newly added node.

 

To resolve this and properly configure the SCM HA cluster need to update each SCM node ozone-site.xml with the following properties:

 
{code:java}
<property>
<name>ozone.scm.nodes.scmservice</name>
<value>scm1,scm2,scm3, new_scm_node_id</value>
</property>
{code}
 

And,   

 
{code:java}
<property>
<name>ozone.scm.address.scmservice.new_scm_host</name>
<value>scm3.org</value>
</property>
{code}
 

cc [~nanda]

 

> [SCM HA] NotLeaderExceptions after SCM transfer leader to new node.
> -------------------------------------------------------------------
>
>                 Key: HDDS-8558
>                 URL: https://issues.apache.org/jira/browse/HDDS-8558
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: Neil Joshi
>            Assignee: Neil Joshi
>            Priority: Major
>
> With SCMHA, if a new SCM node is added to the quorum and leadership is manually transferred to the new node, we get NotALeaderExceptions with RPC calls to the SCM.  Failover _never_ resolves and _never_ failover to newly added node.
>  
> Reproducible, 
> i.) start SCMHA cluster
> ii.) add new SCM to quorum
> iii.) manually transfer leader to newly added node
> iv.) perform RPC call to SCM from client 
> {code:java}
> Transfer leadership successfully to 00bd9308-3467-4229-8587-3b4576834c72.
> bash-4.2$ ozone admin scm roles
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:f319b7a5-c4b5-48ec-bfef-ed61e6c2e082 is not the leader. Suggested leader is Server:scm4.org:9860.
>     at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
>     at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
>     at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
>     at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> , while invoking $Proxy20.submitRequest over nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860 after 3 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 3.
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:c8019701-b4ea-42f9-bff5-86087900efe3 is not the leader. Suggested leader is Server:scm4.org:9860.
>     at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
>     at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
>     at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
>     at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> , while invoking $Proxy20.submitRequest over nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860 after 4 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 4.
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:10496103-c9cf-4275-8b08-c44e08fbc0a6 is not the leader. Suggested leader is Server:scm4.org:9860.
>     at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
>     at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
>     at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
>     at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> , while invoking $Proxy20.submitRequest over nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860 after 5 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 5.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org