You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Glen Geng (Jira)" <ji...@apache.org> on 2021/03/23 06:40:00 UTC

[jira] [Commented] (HDDS-5015) SequenceID is not consistent when setup a multi node SCM HA cluster.

    [ https://issues.apache.org/jira/browse/HDDS-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306806#comment-17306806 ] 

Glen Geng commented on HDDS-5015:
---------------------------------

*The root cause here is*:

when localId is not set in the sequenceId table, SCM will initialize it to be UniqueId.next(). When setup 3 SCM from scratch, each of them will individually set their localId to be their own UniqueId.next(). The sequenceId is diverged from the very beginning.

*Short term solution is:*

make the 3 SCM has an agreement about the localId.

*Long tem solutos is:*

There will be a short term solution, and the long-term solution will be  HDDS-5016.  During bootstrap, always download checkpoint from leader SCM, and replace their own scm.db with that of leader.

 

*The short term solution is safe:*

upgrade in-memory scm to bypass-ratis scm: not affected.

upgrade in-memory scm to single-node scm: not affected.

upgrade in-memory scm to three-node scm cluster: not support yet.

setup a bypass-ratis scm: not affected.

setup a three-node scm cluster from scratch: fix by the short term solution.

 

> SequenceID is not consistent when setup a multi node SCM HA cluster.
> --------------------------------------------------------------------
>
>                 Key: HDDS-5015
>                 URL: https://issues.apache.org/jira/browse/HDDS-5015
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM HA
>            Reporter: Xu Shao Hong
>            Assignee: Glen Geng
>            Priority: Major
>
> We set up the three node SCM HA cluster for test purpose.
> From ozone dbug ldb tool, we found that the sequenceIDs are not same between the three SCM. The reason is due to localID, which is initialized based on each machines own timestamp. 
> The ldb result fetch from scm.db on 3 SCMs. 
> *scm1*
> 17000 END 
>  8000 END 
>  105898712280731336 END
> *scm2*
> 17000 END
>  8000 END
>  105898723592162080 END
> *scm3*
> 17000 END
>  8000 END
>  105898724336720504 END



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org