You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Hanisha Koneru (Jira)" <ji...@apache.org> on 2022/04/04 22:40:00 UTC

[jira] [Updated] (HDDS-6447) Refine SCM handling of unhealthy container replicas

     [ https://issues.apache.org/jira/browse/HDDS-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hanisha Koneru updated HDDS-6447:
---------------------------------
        Parent: HDDS-6548
    Issue Type: Sub-task  (was: Bug)

> Refine SCM handling of unhealthy container replicas
> ---------------------------------------------------
>
>                 Key: HDDS-6447
>                 URL: https://issues.apache.org/jira/browse/HDDS-6447
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Hanisha Koneru
>            Assignee: Hanisha Koneru
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, containers are marked UNHEALTHY by Container Scrubber for one of the following reasons:
>  # If an operation fails on an open/ closing container, it is marked unhealthy so that subsequent write transactions also fail.
>  # If Container Scrubber is enabled and ContainerMetadataScanner detects an error during KeyValueContainerCheck#fastCheck().
>  ** Metadata path or Chunks path is not accessible as a directory
>  ** Container checksum verification fails
>  ** On-disk Container Yaml data does not match in-memory container data (ContainerType, ContainerID, Container DBType, Metadata Path)
>  # If Container Scrubber is enabled and ContainerDataScanner (runs only on closed and quasi-closed containers) detects any block with missing or corrupted chunks file.
> If a container in “open” state in SCM is marked unhealthy (in the container report), SCM asks the DNs to close the container. But for a “closing” container with an “unhealthy” replica, SCM leaves the container replica as is.
> If ReplicationManager does not find a healthy replica for a container, it does not replicate that container. So if there is only 1 replica of a container and it is unhealthy, SCM will never replicate it and there is potential for data loss if that single replica is lost for any reason (for example: disk failure).
> If there is a _Quasi-Closed_ replica and an _Unhealthy_ container, SCM will delete the unhealthy container. In this scenario, SCM should not delete the unhealthy container if it can recovered as it is possible that the unhealthy container is ahead of the quasi-closed container.
> SCM should be more conservative with deleting unhealthy containers as they could possibly be recovered. This Jira proposes to let SCM replicate an unhealthy container if there is no other replica. Also, if there is only a quasi-closed replica and an unhealthy replica, SCM should not delete the unhealthy replica.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org