You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Glen Geng (Jira)" <ji...@apache.org> on 2021/02/02 12:11:00 UTC

[jira] [Updated] (HDDS-4740) admin command should be regardless of leadership of SCM.

     [ https://issues.apache.org/jira/browse/HDDS-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Glen Geng updated HDDS-4740:
----------------------------
    Summary: admin command should be regardless of leadership of SCM.  (was: bin/ozone admin replicationmanager start|stop|status should be replicated over Ratis)

> admin command should be regardless of leadership of SCM.
> --------------------------------------------------------
>
>                 Key: HDDS-4740
>                 URL: https://issues.apache.org/jira/browse/HDDS-4740
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM HA
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>
> *Requirement*
> 1, When admin stops rm, rm in all SCM should stop, re-election should not trigger rm to start in the new leader.
> 2, When admin starts rm, only rm in leader and out of safe mode should take effect. Given leader is in safe mode, even if admin starts rm explicitly, it does not take effect.
> 3, This admin rm start/stop can not survive restart for a SCM instance. When admin decides to stop rm of the SCM cluster, he should pay attention if any of the SCM crashes.
>  
> *Status*
> 1, For now, admin rm start/stop will create/destroy the rm thread.
> 2, SCMContainerLocationFailoverProxyProvider has been proxied by FailoverProxyProvider, it will round robin SCMs in ozone.scm.names, until it is successfully handled. In ServerSide, whenever receiving a client request, it do isLeader check first, return nle to trigger fpp to failover to the next SCM.
> 3, SCMService decides the next iteration of rm to take effect or not by changing RUNNING and PAUSING.
>  
> *Solution:*
> When receiving a rm stop/start request on the server side, SCM skip the isLeader check, just destroys/creates rm thread, client side fake an exception to trigger fpp to try the next SCM in a round robin way.
> The Running and PAUSING status and rm start/stop can be treated separately. The admin operations and the raft status are requirements of two dimensions.
>  
> *We can achieve above requirements:*
> 1, When admin stops rm, rm in all SCM should stop, re-election should not trigger rm to start in the new leader.
> Meet, admin rm start destroy rm thread in all SCM.
>  
> 2, When admin starts rm, only rm in leader and out of safe mode should take effect. Given leader is in safe mode, even if admin starts it explicitly, rm does not take effect.
> Meet, admin rm stop create rm thread in all SCM, but SCMStatus is decided by leader and safe mode.
>  
> 3, This admin rm start/stop can not survive restart for a SCM instance. When admin decides to stop rm of the SCM cluster, he should pay attention if any of the SCM crashes.
> Meet. The is actually a relax item. (edited)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org