You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "siddhantsangwan (via GitHub)" <gi...@apache.org> on 2023/02/10 13:43:25 UTC

[GitHub] [ozone] siddhantsangwan commented on pull request #4227: HDDS-7847. Handle Replication of Unhealthy Replicas in RM

siddhantsangwan commented on PR #4227:
URL: https://github.com/apache/ozone/pull/4227#issuecomment-1425830248

   > When checking for over replication of healthy replicas, we have a stricter check in RatisReplicationCheckHandler:
   > 
   >       if (!overHealth.isReplicatedOkAfterPending() &&
   >           !overHealth.hasMismatchedReplicas()) {
   >         request.getReplicationQueue().enqueue(overHealth);
   >       }
   > This means we don't enqueue at all if there are some mismatched replicas. This is the same point (number 5) that I put in the description above. This behaviour is different from Legacy but I couldn't come up with a good way to check for sufficient number of matching replicas. Legacy checks this by constructing RatisContainerReplicaCount with matching replicas only. I tried to do that here (in hasSufficientMatchingReplicas that I forgot to delete) but realised this can be wrong because the pending ops in that object will include ops on other replicas also.
   > 
   > For example, consider a CLOSED container with the following replicas:
   > CLOSED, CLOSED, CLOSED, UNHEALTHY
   > Suppose Legacy sent a delete command for the UNHEALTHY replica, so we have a pending delete. In the next RM iteration, if we construct RatisContainerReplicaCount with only the matching replicas, we will still have a pending delete being counted. This will cause the container to be seen as under replicated (3 matching replicas - 1 delete). Any ideas how this can be done?
   > 
   > The other place where we're queuing for over rep is in RatisUnhealthyReplicationCheckHandler. verifyPerfectReplication checks perfect replication with just the healthy replicas. I think you're right, we should be checking only with matching replicas here like legacy does in handleOverReplicatedExcessUnhealthy. It then becomes the same problem as above. Do you think it's ok to queue them here as long as RatisOverReplicationHandler tries to delete unhealthy replicas first?
   
   I've tried to resolve this situation. RatisReplicationCheckHandler will still not queue for over rep if we have any mismatched replicas. RatisUnhealthyReplicationCheckHandler will check if we have sufficient number of matching replicas (isSafelyOverReplicated()).
   
   
   Fixed all the other comments except refactoring RatisContainerReplicaCount. This PR is huge as is - we can refactor in an improvement jira if required.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org