You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/06/01 11:05:12 UTC

[GitHub] [ozone] GlenGeng commented on pull request #2295: HDDS-5284. [SCM-HA] SCM start failed with PipelineNotFoundException.

GlenGeng commented on pull request #2295:
URL: https://github.com/apache/ozone/pull/2295#issuecomment-852035731


   Hey guys, in our production cluster, we meet the similar contention problem, and the consequence is worse: all SCMs crashed at the same time, and can not be restarted again.
   
   The pipeline is `fa2906e2-d989-402f-8148-d9bef3ac5258`,
   
   1, pipeline is created at `14:38:06`
   
   > 2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to datanode:71935d59-fadc-4951-a08f-035450e70737
   2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to datanode:0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
   2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to datanode:160ad09d-73b5-4392-b08b-3b93befd71d2
   2021-05-30 14:38:06,653 [9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack5827
 88, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z].
   
   
   2, pipeline action and pipeline scrubber is triggered at the same time, at `14:44:06`, and the pipeline is moved to CLOSED state twice, and removed by one of them.
   
   > 2021-05-30 14:44:06,650 [EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 from datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7. Reason : 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7 is in candidate state for 304440ms
   2021-05-30 14:44:06,654 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Scrubbing pipeline: id: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 since it stays at ALLOCATED stage for -6 mins.
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Pipeline Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persiste
 dOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z] moved to CLOSED state
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to datanode 71935d59-fadc-4951-a08f-035450e70737
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to datanode 160ad09d-73b5-4392-b08b-3b93befd71d2
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Pipeline Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persistedOpState: IN_SERVICE, 
 persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z] moved to CLOSED state
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to datanode 71935d59-fadc-4951-a08f-035450e70737
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to datanode 160ad09d-73b5-4392-b08b-3b93befd71d2
   2021-05-30 14:44:06,659 [9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, STANDALONE=9859], networkLocation: /rack582788, cert
 SerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z] removed.
   
   
   3, when the second removePipeline is applied by SCMStateMachine, the PipelineNotFoundException crashed the SCM, and SCM refuses to start again.
   
   > 2021-05-30 14:44:06,659 [EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler: Invoking method public abstract void org.apache.hadoop.hdds.scm.pipeline.StateManager.removePipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$PipelineID) throws java.io.IOException on target org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl@335aa187, cost 1709.382us
   2021-05-30 14:44:06,662 [9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater] ERROR org.apache.ratis.statemachine.StateMachine: Terminating with exit status 1: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 not found
   org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 not found
           at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:131)
           at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.removePipeline(PipelineStateMap.java:343)
           at org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerV2Impl.removePipeline(PipelineStateManagerV2Impl.java:228)
           at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:155)
           at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:132)
           at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1661)
           at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:234)
           at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:179)
           at java.lang.Thread.run(Thread.java:748)
   2021-05-30 14:44:06,664 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: 
   /************************************************************
   SHUTDOWN_MSG: Shutting down StorageContainerManager at tdw-11-0-119-77/11.0.119.77
   ************************************************************/
   
   
   I was planing to create a jira for the contention issue, seems your guys meet it as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org