You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/06/02 09:54:23 UTC

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-852885323


   With this approach we see an issue
   
   
   Before restart, 2 pipelines closed, and let's say it removed and create a new pipeline. But in the SCM pipeline table it has old 2 pipelines, as remove/new pipeline are not persisted to DB as SCM is force killed.
   
   As we call refresh and validate we exit safe mode after 2nd pipeline remove transaction, and we validate pipeline rules for each applyTransaction so safemode pipeline rules will be validated, and we do not wait for all the pending transactions. In this case we come out of safemode early and reads/write might fail. 
   
   This causes problems like reading/write will fail, even after SCM is out of safe mode.
   ```
   2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 0, pipeline's with at least one datanode reported threshold count is 0
   ```
   
   After an offline discussion with @bshashikant 
   1. We thought we shall refresh SCM safe mode rule once after leader Ready on all SCMs.
   2. And start DN RPC port only after leader ready, so that SCM does not come out of safe mode early by considering not upto date DB.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org