You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Sammi Chen (Jira)" <ji...@apache.org> on 2023/08/30 14:50:00 UTC

[jira] [Comment Edited] (HDDS-9156) cert-rotation acceptance test is failing randomly

    [ https://issues.apache.org/jira/browse/HDDS-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760475#comment-17760475 ] 

Sammi Chen edited comment on HDDS-9156 at 8/30/23 2:49 PM:
-----------------------------------------------------------

Here is how the scm4.org check failed.   On scm4.org side, it started at 11:11:50, 


{code:java}
2023-08-14 11:11:50,970 [main] INFO server.StorageContainerManagerStarter: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG:   host = scm4.org/172.25.0.120
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.4.0-SNAPSHOT
{code}

It joined the SCM HA raft ring at round 11:12:08


{code:java}
2023-08-14 11:12:08,527 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: receive installSnapshot: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c->39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f#0-t3,notify:(t:1, i:0)
2023-08-14 11:12:08,661 [grpc-default-executor-0] INFO ha.SCMStateMachine: leader changed, yet current SCM is still follower.
2023-08-14 11:12:08,662 [grpc-default-executor-0] INFO server.RaftServer$Division: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: change Leader from null to 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c at term 3 for installSnapshot, leader elected after 11538ms
2023-08-14 11:12:08,745 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: Received notification to install snapshot at index 0
2023-08-14 11:12:08,755 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: InstallSnapshot notification result: ALREADY_INSTALLED, current snapshot index: -1
{code}

It received the new set configuration ratis command at 11:12:11, this new configuration includes scm4.org


{code:java}
2023-08-14 11:12:11,541 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f-server-thread1] INFO server.RaftServer$Division: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: set configuration 407: peers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], old=null
{code}

Then at 11:12:16, it received the rotation preparation command for the root certificate 4,


{code:java}
2023-08-14 11:12:16,034 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare command of root certificate 4
....
2023-08-14 11:12:16,052 [RootCARotationManager] INFO security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - started.
{code}

And later, it received these four requests,


{code:java}
2023-08-14 11:12:16,063 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root certificate 4 from scm 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c
2023-08-14 11:12:16,067 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root certificate 4 from scm 85b682d7-8fe7-4853-84d4-764f5b803819
2023-08-14 11:12:16,094 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root certificate 4 from scm 24a88f8e-8ac3-49fd-946b-8313e286b62e


2023-08-14 11:12:16,098 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation commit command of root certificate 4
{code}

And right after that, it failed during rotation commit command handling and stopped itself. 

{code:java}
2023-08-14 11:12:19,651 [RootCARotationManager] INFO security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - scm key generated.
2023-08-14 11:12:19,653 [RootCARotationManager] INFO security.RootCARotationManager: In-progress root CA directory /data/metadata/scm/ca-next-progress is deleted for 'RootCARotationManager is interrupted'
2023-08-14 11:12:19,654 [RootCARotationManager] INFO client.SCMCertificateClient: Creating csr for SCM->hostName:scm4.org,scmId:39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f,clusterId:CID-516c56b1-3e02-48ae-a9af-e8c128b9df30,subject:scm-sub-827283986661@scm4.org
2023-08-14 11:12:19,694 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: Adding ip:172.25.0.120,host:scm4.org
2023-08-14 11:12:19,697 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: ip:127.0.0.1 not returned.
2023-08-14 11:12:19,868 [RootCARotationManager] ERROR client.SCMCertificateClient: Error while fetching/storing SCM signed certificate.
java.io.InterruptedIOException: Call interrupted
	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1561)
	at org.apache.hadoop.ipc.Client.call(Client.java:1513)
	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
	at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
	at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
	at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.getSCMCertChain(SCMSecurityProtocolClientSideTranslatorPB.java:233)
	at org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient.signAndStoreCertificate(SCMCertificateClient.java:208)
	at org.apache.hadoop.hdds.scm.security.RootCARotationManager$SubCARotationPrepareTask.run(RootCARotationManager.java:587)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,102 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] ERROR security.RootCARotationHandlerImpl: Failed to move /data/metadata/scm/sub-ca-next to /data/metadata/scm/sub-ca
java.nio.file.NoSuchFileException: /data/metadata/scm/sub-ca-next -> /data/metadata/scm/sub-ca
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
	at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
	at java.base/java.nio.file.Files.move(Files.java:1422)
	at org.apache.hadoop.hdds.scm.security.RootCARotationHandlerImpl.rotationCommit(RootCARotationHandlerImpl.java:140)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
	at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
	at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
	at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
	at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
	at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,103 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Container Balancer is not running.
2023-08-14 11:12:16,109 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Stopping Replication Manager Service.
2023-08-14 11:12:16,109 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO replication.ReplicationManager: Stopping Replication Monitor Thread.
2023-08-14 11:12:16,109 [Under Replicated Processor] WARN replication.UnhealthyReplicationProcessor: Under Replicated Processor interrupted. Exiting...
2023-08-14 11:12:16,110 [Over Replicated Processor] WARN replication.UnhealthyReplicationProcessor: Over Replicated Processor interrupted. Exiting...
2023-08-14 11:12:16,112 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Stopping the Datanode Admin Monitor.
2023-08-14 11:12:16,114 [ReplicationMonitor] INFO replication.ReplicationManager: Replication Monitor Thread is stopped
2023-08-14 11:12:16,115 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Stopping datanode service RPC server
2023-08-14 11:12:16,115 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.SCMDatanodeProtocolServer: Stopping the RPC server for DataNodes

{code}

scm2.org was the leader when scm4.org was starting. Here is some logs from scm2, it accepted scm4.org, and sent out the set configuration request on 11:12:05. Since scm4.org was added, there were 4 peers in the SCM HA raft ring now. So in next root ca rotation, scm2 should expect 4 acks of rotation preparation from all SCMs(including itself).

{code:java}
2023-08-14 11:12:05,833 [IPC Server handler 76 on default port 9863] INFO ha.SCMRatisServerImpl: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c: Submitting SetConfiguration request to Ratis server with new SCM peers list: [61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO server.RaftServer$Division: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30: receive setConfiguration SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30, cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER], listeners:[]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO server.RaftServer$Division: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-LeaderStateImpl: startSetConfiguration SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30, cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER], listeners:[]
{code}

But on 11:12:10, it sent out the rotation commit request after received only 3 SCM's acks for rotation preparation.  

{code:java}
2023-08-14 11:12:10,373 [RootCARotationManager-Active] INFO security.RootCARotationManager: numFromHADetails 3, numFromRatisServer 3
2023-08-14 11:12:10,382 [61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation commit command of root certificate 4
{code}

In the current implementation, RaftServer.Division.getRaftConf().getCurrentPeers().size() is called to get the current peers count in the ratis group. It looks like ratis doesn't count scm4.org in this failed acceptance test run. 



 




was (Author: sammi):
Here is how the scm4.org check failed.   On scm4.org side, it started at 11:11:50, 


{code:java}
2023-08-14 11:11:50,970 [main] INFO server.StorageContainerManagerStarter: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG:   host = scm4.org/172.25.0.120
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.4.0-SNAPSHOT
{code}

It joined the SCM HA raft ring at round 11:12:08


{code:java}
2023-08-14 11:12:08,527 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: receive installSnapshot: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c->39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f#0-t3,notify:(t:1, i:0)
2023-08-14 11:12:08,661 [grpc-default-executor-0] INFO ha.SCMStateMachine: leader changed, yet current SCM is still follower.
2023-08-14 11:12:08,662 [grpc-default-executor-0] INFO server.RaftServer$Division: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: change Leader from null to 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c at term 3 for installSnapshot, leader elected after 11538ms
2023-08-14 11:12:08,745 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: Received notification to install snapshot at index 0
2023-08-14 11:12:08,755 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: InstallSnapshot notification result: ALREADY_INSTALLED, current snapshot index: -1
{code}

It received the new set configuration ratis command at 11:12:11, this new configuration includes scm4.org


{code:java}
2023-08-14 11:12:11,541 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f-server-thread1] INFO server.RaftServer$Division: 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: set configuration 407: peers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], old=null
{code}

Then at 11:12:16, it received the rotation preparation command for the root certificate 4,


{code:java}
2023-08-14 11:12:16,034 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare command of root certificate 4
....
2023-08-14 11:12:16,052 [RootCARotationManager] INFO security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - started.
{code}

And later, it received these four requests,


{code:java}
2023-08-14 11:12:16,063 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root certificate 4 from scm 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c
2023-08-14 11:12:16,067 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root certificate 4 from scm 85b682d7-8fe7-4853-84d4-764f5b803819
2023-08-14 11:12:16,094 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root certificate 4 from scm 24a88f8e-8ac3-49fd-946b-8313e286b62e


2023-08-14 11:12:16,098 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation commit command of root certificate 4
{code}

And right after that, it failed during rotation commit command handling and stopped itself. 

{code:java}
2023-08-14 11:12:19,651 [RootCARotationManager] INFO security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - scm key generated.
2023-08-14 11:12:19,653 [RootCARotationManager] INFO security.RootCARotationManager: In-progress root CA directory /data/metadata/scm/ca-next-progress is deleted for 'RootCARotationManager is interrupted'
2023-08-14 11:12:19,654 [RootCARotationManager] INFO client.SCMCertificateClient: Creating csr for SCM->hostName:scm4.org,scmId:39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f,clusterId:CID-516c56b1-3e02-48ae-a9af-e8c128b9df30,subject:scm-sub-827283986661@scm4.org
2023-08-14 11:12:19,694 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: Adding ip:172.25.0.120,host:scm4.org
2023-08-14 11:12:19,697 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: ip:127.0.0.1 not returned.
2023-08-14 11:12:19,868 [RootCARotationManager] ERROR client.SCMCertificateClient: Error while fetching/storing SCM signed certificate.
java.io.InterruptedIOException: Call interrupted
	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1561)
	at org.apache.hadoop.ipc.Client.call(Client.java:1513)
	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
	at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
	at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
	at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.getSCMCertChain(SCMSecurityProtocolClientSideTranslatorPB.java:233)
	at org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient.signAndStoreCertificate(SCMCertificateClient.java:208)
	at org.apache.hadoop.hdds.scm.security.RootCARotationManager$SubCARotationPrepareTask.run(RootCARotationManager.java:587)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,102 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] ERROR security.RootCARotationHandlerImpl: Failed to move /data/metadata/scm/sub-ca-next to /data/metadata/scm/sub-ca
java.nio.file.NoSuchFileException: /data/metadata/scm/sub-ca-next -> /data/metadata/scm/sub-ca
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
	at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
	at java.base/java.nio.file.Files.move(Files.java:1422)
	at org.apache.hadoop.hdds.scm.security.RootCARotationHandlerImpl.rotationCommit(RootCARotationHandlerImpl.java:140)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
	at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
	at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
	at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
	at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
	at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,103 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Container Balancer is not running.
2023-08-14 11:12:16,109 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Stopping Replication Manager Service.
2023-08-14 11:12:16,109 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO replication.ReplicationManager: Stopping Replication Monitor Thread.
2023-08-14 11:12:16,109 [Under Replicated Processor] WARN replication.UnhealthyReplicationProcessor: Under Replicated Processor interrupted. Exiting...
2023-08-14 11:12:16,110 [Over Replicated Processor] WARN replication.UnhealthyReplicationProcessor: Over Replicated Processor interrupted. Exiting...
2023-08-14 11:12:16,112 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Stopping the Datanode Admin Monitor.
2023-08-14 11:12:16,114 [ReplicationMonitor] INFO replication.ReplicationManager: Replication Monitor Thread is stopped
2023-08-14 11:12:16,115 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.StorageContainerManager: Stopping datanode service RPC server
2023-08-14 11:12:16,115 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] INFO server.SCMDatanodeProtocolServer: Stopping the RPC server for DataNodes

{code}

scm2.org was the leader when scm4.org was starting. Here is some logs from scm2, it accepted scm4.org, and sent out the set configuration request on 11:12:05. Since scm4.org was added, there were 4 peers in the SCM HA raft ring now. So in next root ca rotation, scm2 should expect 4 acks of rotation preparation from all SCMs(including itself).

{code:java}
2023-08-14 11:12:05,833 [IPC Server handler 76 on default port 9863] INFO ha.SCMRatisServerImpl: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c: Submitting SetConfiguration request to Ratis server with new SCM peers list: [61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO server.RaftServer$Division: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30: receive setConfiguration SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30, cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER], listeners:[]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO server.RaftServer$Division: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-LeaderStateImpl: startSetConfiguration SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30, cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER, 85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, 39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER], listeners:[]
{code}

But on 11:12:10, it sent out the rotation commit request after received only 3 SCM's acks for rotation preparation.  

{code:java}
2023-08-14 11:12:10,373 [RootCARotationManager-Active] INFO security.RootCARotationManager: numFromHADetails 3, numFromRatisServer 3
2023-08-14 11:12:10,382 [61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-StateMachineUpdater] INFO security.RootCARotationHandlerImpl: Received rotation commit command of root certificate 4
{code}

In the current implementation, RaftServer.Division.getRaftConf().getCurrentPeers().size() is called to get the current peers count in the ratis group. It looks like ratis doesn't count scm4.org in.  



 



> cert-rotation acceptance test is failing randomly
> -------------------------------------------------
>
>                 Key: HDDS-9156
>                 URL: https://issues.apache.org/jira/browse/HDDS-9156
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Certificates, test
>            Reporter: Siyao Meng
>            Assignee: Sammi Chen
>            Priority: Major
>              Labels: pull-request-available
>
> -Despite there being no "FAIL"ed test items:-
> https://github.com/apache/ozone/actions/runs/5825414635/job/15797755458?pr=5172#logs
> ah. {{ozone admin scm roles}} command timed out:
> {code}
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> Timed out waiting on ozone admin scm roles | grep scm4.org to be successful
> {code}
> cc [~pifta]
> Another failure:
> https://github.com/apache/ozone/actions/runs/5827776199/job/15804891746
> {code}
> Creating ozonesecure-ha_scm4.org_1 ... done
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is available on scm4.org
> ==============================================================================
> Kinit :: Kinit test user                                                      
> ==============================================================================
> Kinit                                                                 | PASS |
> ------------------------------------------------------------------------------
> Kinit :: Kinit test user                                              | PASS |
> 1 test, 1 passed, 0 failed
> ==============================================================================
> Output:  /tmp/smoketest/ozonesecure-ha/result/robot-5.xml
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> Timed out waiting on ozone admin scm roles | grep scm4.org to be successful
> Stopping ozonesecure-ha_scm4.org_1  ... 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org