You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/10/06 20:03:56 UTC

[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500491326



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +471,30 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
+    return getLiveInstances(true);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param excludeTimeoutInstances - Only effective during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean excludeTimeoutInstances) {

Review comment:
       I suggest making it explicit.
   One method getAllLiveInstances(), another one something like getActiveLiveInstance().
   
   In the future, the later one can be extended to return different lists according to the conditions (maintenance mode is the first example).

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));

Review comment:
       Just put the logic into a non-static method getOnlineTimestamp()? Which directly returns List<Long>.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,6 +257,10 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a snapshot of old live instances for maintenance mode
+      if (isMaintenanceModeEnabled()) {

Review comment:
       Please be careful about the cache update order. refreshLiveInstances is called before updateMaintenanceInfo. So I guess this map never works as expected. As I suggested, just remove this map to reduce the potential risk. ROI is low.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;

Review comment:
       I guess this one is corresponding to the _liveInstanceSnapshotForMaintenance check. I prefer not to add this.
   1. If the cache update fails due to ZK issue, then we might have the _liveInstanceSnapshotForMaintenance updated but _timedOutInstanceDuringMaintenance is not fully calculated.
   2. The performance enhancement of using this map is not significant. Let's ensure it works fine first.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /**
+   * Parses a history date in millisecond to string.
+   */
+  public static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        continue;

Review comment:
       nit, warning in these abnormal case for debugging.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.

Review comment:
       As I mentioned above, the current logic is too complicated than it needs to be. For a newly added node, the offline time is timestamp 0. So it is definitely timeout.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {

Review comment:
       I agree these nodes shall be excluded. But we are actually doing this for a different reason.
   Note that the logic is a node is offline for a long time and back online during the M mode. In this case, the new node previous offline time is 0. So it is definitely timed out. So we don't need this logic here. The later logic shall be able to cover this case.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts

Review comment:
       nit, I think you need all the online timestamp before the _maintenanceSignal create time. Otherwise, you don't need this list. So the comment is not accurate.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {

Review comment:
       To reduce the implicit dependencies between the refresh/update methods, please input the isMaintenanceModeEnabled as a parameter. So the caller is more likely to pass the refreshed value.
   
   Overall, we shall reduce referring to the private field directly to decouple methods and prevent potential bugs.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /**
+   * Parses a history date in millisecond to string.
+   */
+  public static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }

Review comment:
       As commented above, these 2 methods can be private if we directly return the lists of Long

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {

Review comment:
       nit, the logic might be clearer for the readers if you just subtract the list and then read the last one. Otherwise, you have multiple lists with different orders. And get(0) in the following code could be a mystery for others.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.historyDateStringToLong(historyOfflineList.get(i));

Review comment:
       Same here, getOfflineTimestamp




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org