You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Bharat Viswanadham (Jira)" <ji...@apache.org> on 2021/06/01 10:24:00 UTC

[jira] [Commented] (HDDS-5263) SCM may stay in safe mode forever due to incorrect open pipeline count

    [ https://issues.apache.org/jira/browse/HDDS-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354991#comment-17354991 ] 

Bharat Viswanadham commented on HDDS-5263:
------------------------------------------

SCM went into safe mode and never come out of it after SCM restart. 
|INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one datanode reported count is 1, required at least one datanode reported per pipeline count is 6|

However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10 open Ratie(1) pipelines.

 


{code:java}
When SCM Started it has 6 pipelines in open state, we read from DB and get this.
{code:java}
 783833 2021-05-20 18:00:54,613 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 6, pipeline's with at         least one datanode reported threshold count is 6

{code}
But once the SCM Ratis server started it will replay logs from Transactioninfo last applied Index, so after that I see all pipelines are removed. (might be due to close pipeline)

Because this SafeMode rule is not successfully validated, SCM never came out of safe mode.

https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to consider open pipelines this can work for non-HA, as DB updates immediately written to DB. But in HA, we write to DBTransactionBuffer, so lets say pipelines are closed but not applied to DB. And now SCM is restarted, first PipelineManager is initialized it reads from DB, and get 6 pipeline count, and then SCM replays its transaction which removes them if pipeline close happened before. Because of this SCM safemode rule cannot be successfully validated.

 
 783875 2021-05-20 18:00:55,963 INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes: d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6

{ip: xxxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS=9858, RA TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt ateExpiryEpochSec: 0}

ea53e24e-3d10-4d41-93c9-a568a1627cca

{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS =9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, pers istedOpStateExpiryEpochSec: 0}

9416da18-1fc4-4cb3-8200-6a71698c808e

{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERV ICE, persistedOpStateExpiryEpochSec: 0}

, ReplicationConfig: RATIS/THREE, State:CLOSED, leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT imestamp2021-05-20T18:00:54.497Z] removed.

783882 2021-05-20 18:00:55,970 INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes: 8fd99eff-7f50-4b56-ad03-1e796030268d

:

:

{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERV ICE, persistedOpStateExpiryEpochSec: 0}

, ReplicationConfig: RATIS/THREE, State:CLOSED, leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT imestamp2021-05-20T18:00:54.497Z] removed.
{code}


> SCM may stay in safe mode forever due to incorrect open pipeline count
> ----------------------------------------------------------------------
>
>                 Key: HDDS-5263
>                 URL: https://issues.apache.org/jira/browse/HDDS-5263
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: George Huang
>            Assignee: Bharat Viswanadham
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: SCM HA SafeMode .pdf
>
>
> After an unclean shutdown, SCM may never come out of the safe mode.
> Attached a document to explain the problem and the proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org