You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2019/11/22 16:13:00 UTC
[jira] [Comment Edited] (HDDS-2459) Refactor ReplicationManager to consider maintenance states

    [ https://issues.apache.org/jira/browse/HDDS-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974366#comment-16974366 ] 

Stephen O'Donnell edited comment on HDDS-2459 at 11/22/19 4:12 PM:
-------------------------------------------------------------------

In the decommission design doc, we had an algorithm to determine the number of replicas that need to be created or destroy so a container can be perfectly replicated. The algorithm was:

{code}
/**
 * Calculate the number of the missing replicas.
 * 
 * @return the number of the missing replicas. If it's less than zero, the container is over replicated.
 */
int getReplicationCount(int expectedCount, int healthy, 
   int maintenance, int inFlight) {

   //for over replication, count only with the healthy replicas
   if (expectedCount < healthy) {
      return expectedCount - healthy;
   }
   
   replicaCount = expectedCount - (healthy + maintenance + inFlight);

   if (replicaCount == 0 && healthy < 1) {
      replicaCount ++;
   }
   
   //over replication is already handled
   return Math.max(0, replicaCount);
}
{code}

The code from the design doc needs a minor correction to handle inflight deletes on over replication, and also handling replication factor 1 containers, so it would look like this:

{code}
  public int additionalReplicaNeeded2() {

    if (repFactor < healthyCount) {
      return repFactor - healthyCount + inFlightDel;
    }

    int delta = repFactor - (healthyCount + maintenanceCount + inFlightAdd - inFlightDel);

    if (delta == 0 && healthyCount < minHealthyForMaintenance) {
      delta += Math.min(repFactor, minHealthyForMaintenance) - healthyCount;
    }
    return Math.max(0, delta);
  }
{code}

I also came up with the logic below, which is very similar although a little more verbose. The only different between the above and the below, is that in the case of 3 in_service replicas and one or more inflight deletes, the above will return 1 new replica needed, but the below will return zero. The reasoning is that we should let the delete complete or not, as it may fail, and then deal with the over or under replication when the inflight operations have cleared.

There is also a bug in the above if there is 1 IN_SERVICE and 3 MAINTENANCE and minHealthy = 2. In this case the logic returns zero rather than the intended 1. This scenario could come about if there are 3 hosts put in maintenance and then 1 new replica gets created.

{code}
  /**
   * Calculates the the delta of replicas which need to be created or removed
   * to ensure the container is correctly replicated.
   *
   * Decisions around over-replication are made only on healthy replicas,
   * ignoring any in maintenance and also any inflight adds. InFlight adds are
   * ignored, as they may not complete, so if we have:
   *
   *     H, H, H, IN_FLIGHT_ADD
   *
   * And then schedule a delete, we could end up under-replicated (add fails,
   * delete completes). It is better to let the inflight operations complete
   * and then deal with any further over or under replication.
   *
   * For maintenance replicas, assuming replication factor 3, and minHealthy
   * 2, it is possible for all 3 hosts to be put into maintenance, leaving the
   * following (H = healthy, M = maintenance):
   *
   *     H, H, M, M, M
   *
   * Even though we are tracking 5 replicas, this is not over replicated as we
   * ignore the maintenance copies. Later, the replicas could look like:
   *
   *     H, H, H, H, M
   *
   * At this stage, the container is over replicated by 1, so one replica can be
   * removed.
   *
   * For containers which have replication factor healthy replica, we ignore any
   * inflight add or deletes, as they may fail. Instead, wait for them to
   * complete and then deal with any excess or deficit.
   *
   * For under replicated containers we do consider inflight add and delete to
   * avoid scheduling more adds than needed. There is additional logic around
   * containers with maintenance replica to ensure minHealthyForMaintenance
   * replia are maintained/
   *
   * @return Delta of replicas needed. Negative indicates over replication and
   *         containers should be removed. Positive indicates over replication
   *         and zero indicates the containers has replicationFactor healthy
   *         replica
   */
  public int additionalReplicaNeeded() {
    int delta = repFactor - healthyCount;

    if (delta < 0) {
      // Over replicated, so may need to remove a container. Do not consider
      // inFlightAdds, as they may fail, but do consider inFlightDel which
      // will reduce the over-replication if it completes.
      // Note this could make the delta positive if there are too many in flight
      // deletes, which will result in an additional being scheduled.
      return delta + inFlightDel;
    } else if (delta > 0) {
      // May be under-replicated, depending on maintenance. When a container is
      // under-replicated, we must consider in flight add and delete when
      // calculating the new containers needed.
      delta = Math.max(0, delta - maintenanceCount);
      // Check we have enough healthy replicas
      minHealthyForMaintenance = Math.min(repFactor, minHealthyForMaintenance);
      int neededHealthy =
          Math.max(0, minHealthyForMaintenance - healthyCount);
      delta = Math.max(neededHealthy, delta);
      return delta - inFlightAdd + inFlightDel;
    } else { // delta == 0
      // We have exactly the number of healthy replicas needed, but there may
      // be inflight add or delete. Some of these may fail, but we want to
      // avoid scheduling needless extra replicas. Therefore enforce a lower
      // bound of 0 on the delta, but include the in flight requests in the
      // calculation.
      return Math.max(0, delta + inFlightDel - inFlightAdd);
    }
  }
{code}

The following logic also describes the conditions the replica for a container must meet to be considered sufficiently replicated - note that inflight adds are ignored and inflight deletes are considered until they complete:

{code}
  /**
   * Return true if the container is sufficiently replicated. Decommissioning
   * and Decommissioned containers are ignored in this check, assuming they will
   * eventually be removed from the cluster.
   * This check ignores inflight additions, as those replicas have not yet been
   * created and the create could fail for some reason.
   * The check does consider inflight deletes as there may be 3 healthy replicas
   * now, but once the delete completes it will reduce to 2.
   * We also assume a replica in Maintenance state cannot be removed, so the
   * pending delete would affect only the healthy replica count.
   *
   * @return True if the container is sufficiently replicated and False
   *         otherwise.
   */
  public boolean isSufficientlyReplicated() {
    return (healthyCount + maintenanceCount - inFlightDel) >= repFactor
        && healthyCount - inFlightDel
        >= Math.min(repFactor, minHealthyForMaintenance);
  }
{code}


was (Author: sodonnell):
In the decommission design doc, we had an algorithm to determine the number of replicas that need to be created or destroy so a container can be perfectly replicated. The algorithm was:

{code}
/**
 * Calculate the number of the missing replicas.
 * 
 * @return the number of the missing replicas. If it's less than zero, the container is over replicated.
 */
int getReplicationCount(int expectedCount, int healthy, 
   int maintenance, int inFlight) {

   //for over replication, count only with the healthy replicas
   if (expectedCount < healthy) {
      return expectedCount - healthy;
   }
   
   replicaCount = expectedCount - (healthy + maintenance + inFlight);

   if (replicaCount == 0 && healthy < 1) {
      replicaCount ++;
   }
   
   //over replication is already handled
   return Math.max(0, replicaCount);
}
{code}

The code from the design doc needs a minor correction to handle inflight deletes on over replication, and also handling replication factor 1 containers, so it would look like this:

{code}
  public int additionalReplicaNeeded2() {

    if (repFactor < healthyCount) {
      return repFactor - healthyCount + inFlightDel;
    }

    int delta = repFactor - (healthyCount + maintenanceCount + inFlightAdd - inFlightDel);

    if (delta == 0 && healthyCount < minHealthyForMaintenance) {
      delta += Math.min(repFactor, minHealthyForMaintenance) - healthyCount;
    }
    return Math.max(0, delta);
  }
{code}

I also came up with the logic below, which is very similar although a little more verbose. The only different between the above and the below, is that in the case of 3 in_service replicas and one or more inflight deletes, the above will return 1 new replica needed, but the below will return zero. The reasoning is that we should let the delete complete or not, as it may fail, and then deal with the over or under replication when the inflight operations have cleared.

There is also a bug in the above if there is 1 IN_SERVICE and 3 MAINTENANCE and minHealthy = 2. In this case the logic returns zero rather than the intended 1. This scenario could come about if there are 3 hosts put in maintenance and then 1 new replica gets created.

{code}
  /**
   * Calculates the the delta of replicas which need to be created or removed
   * to ensure the container is correctly replicated.
   *
   * Decisions around over-replication are made only on healthy replicas,
   * ignoring any in maintenance and also any inflight adds. InFlight adds are
   * ignored, as they may not complete, so if we have:
   *
   *     H, H, H, IN_FLIGHT_ADD
   *
   * And then schedule a delete, we could end up under-replicated (add fails,
   * delete completes). It is better to let the inflight operations complete
   * and then deal with any further over or under replication.
   *
   * For maintenance replicas, assuming replication factor 3, and minHealthy
   * 2, it is possible for all 3 hosts to be put into maintenance, leaving the
   * following (H = healthy, M = maintenance):
   *
   *     H, H, M, M, M
   *
   * Even though we are tracking 5 replicas, this is not over replicated as we
   * ignore the maintenance copies. Later, the replicas could look like:
   *
   *     H, H, H, H, M
   *
   * At this stage, the container is over replicated by 1, so one replica can be
   * removed.
   *
   * For containers which have replication factor healthy replica, we ignore any
   * inflight add or deletes, as they may fail. Instead, wait for them to
   * complete and then deal with any excess or deficit.
   *
   * For under replicated containers we do consider inflight add and delete to
   * avoid scheduling more adds than needed. There is additional logic around
   * containers with maintenance replica to ensure minHealthyForMaintenance
   * replia are maintained/
   *
   * @return Delta of replicas needed. Negative indicates over replication and
   *         containers should be removed. Positive indicates over replication
   *         and zero indicates the containers has replicationFactor healthy
   *         replica
   */
  public int additionalReplicaNeeded() {
    int delta = repFactor - healthyCount;

    if (delta < 0) {
      // Over replicated, so may need to remove a block. Do not consider
      // inFlightAdds, as they may fail, but do consider inFlightDel which
      // will reduce the over-replication if it completes.
      return delta + inFlightDel;
    } else if (delta > 0) {
      // May be under-replicated, depending on maintenance. When a container is
      // under-replicated, we must consider inflight add and delete when
      // calculating the new containers needed.
      delta = Math.max(0, delta - maintenanceCount);
      // Check we have enough healthy replicas
      minHealthyForMaintenance = Math.min(repFactor, minHealthyForMaintenance);
      int neededHealthy =
          Math.max(0, minHealthyForMaintenance - healthyCount);
      delta = Math.max(neededHealthy, delta);
      return delta - inFlightAdd + inFlightDel;
    } else { // delta == 0
      // We have exactly the number of healthy replicas needed, but there may
      // be inflight add or delete. Ignore them until they complete or fail
      // and then deal with the excess or deficit.
      return delta;
    }
  }
}
{code}

The following logic also describes the conditions the replica for a container must meet to be considered sufficiently replicated - note that inflight adds are ignored and inflight deletes are considered until they complete:

{code}
  /**
   * Return true if the container is sufficiently replicated. Decommissioning
   * and Decommissioned containers are ignored in this check, assuming they will
   * eventually be removed from the cluster.
   * This check ignores inflight additions, as those replicas have not yet been
   * created and the create could fail for some reason.
   * The check does consider inflight deletes as there may be 3 healthy replicas
   * now, but once the delete completes it will reduce to 2.
   * We also assume a replica in Maintenance state cannot be removed, so the
   * pending delete would affect only the healthy replica count.
   *
   * @return True if the container is sufficiently replicated and False
   *         otherwise.
   */
  public boolean isSufficientlyReplicated() {
    return (healthyCount + maintenanceCount - inFlightDel) >= repFactor
        && healthyCount - inFlightDel
        >= Math.min(repFactor, minHealthyForMaintenance);
  }
{code}

> Refactor ReplicationManager to consider maintenance states
> ----------------------------------------------------------
>
>                 Key: HDDS-2459
>                 URL: https://issues.apache.org/jira/browse/HDDS-2459
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In its current form the replication manager does not consider decommission or maintenance states when checking if replicas are sufficiently replicated. With the introduction of maintenance states, it needs to consider decommission and maintenance states when deciding if blocks are over or under replicated.
> It also needs to provide an API to allow the decommission manager to check if blocks are over or under replicated, so the decommission manager can decide if a node has completed decommission and maintenance or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org