You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/09/25 21:43:59 UTC

[GitHub] [helix] NealSun96 opened a new pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

NealSun96 opened a new pull request #1413:
URL: https://github.com/apache/helix/pull/1413


   ### Issues
   
   - [x] My PR addresses the following Helix issues and references them in the PR description:
   
   Fixes #1412
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI changes:
   
   As described in the issue, we would like to avoid the scenario where nodes that have been offline for a while come back to life during maintenance mode. Solution: if a node has been offline for a certain period of time, the maintenance mode rebalancer will not assign partitions to it. 
   
   To accomplish this, two aspects of changes are necessary:
   1. A new customer-defined value is required to enable this feature: MAINTENANCE_NODE_TIME_OUT. This value represents "how long a node has been offline such that it should be timed out". This value is set in ClusterConfig, similar to DELAY_REBALANCE_TIME.
   2. A new logic should be added to the refreshLiveInstances() function of BaseControllerDataProvider. When the maintenance mode is enabled, find the new live nodes, and filter out the timed-out nodes; the timed-out nodes are treated as not live throughout the pipeline. 
   
   ### Tests
   
   - [x] The following tests are written for this issue:
   
   Everything under: TestParticipantHistory.java, TestOfflineNodeTimeoutDuringMaintenanceMode.java
   testGetMaintenanceOfflineNodeTimeOut(), testSetMaintenanceOfflineNodeTimeOut()
   
   - [ ] The following is the result of the "mvn test" command on the appropriate module:
   
   (Before CI test pass, please copy & paste the result of "mvn test")
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500676281



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -64,6 +69,20 @@ public ParticipantHistory(ZNRecord znRecord) {
     super(znRecord);
   }
 
+  /**
+   * @return The list field for HISTORY
+   */
+  public List<String> getHistory() {
+    return _record.getListField(ConfigProperty.HISTORY.name());
+  }
+
+  /**
+   * @return The list field for OFFLINE
+   */
+  public List<String> getOffline() {
+    return _record.getListField(ConfigProperty.OFFLINE.name());
+  }
+

Review comment:
       They are public because 
   1. we should provide accessors since now there are actual code usage of ParticipantHistory; and, 
   2. an integration test calls these functions. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499757672



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {

Review comment:
       In short: everywhere else that isn't critical to "timing out" instances. 
   
   The new `getLiveInstances(boolean)` is minimized to only one place: `MessageGenerationPhase`. This approach minimizes any impact to existing code while also making the "timed-out" instances not reachable from the controller; effectively, those instances will have no operations done on them. This decision is made after @jiajunwang 's comment that we shouldn't directly alter the live instance cache; we should only alter the getter. 
   
   Please let me know if this change makes sense. I believe only changing `MessageGenerationPhase` is sufficient. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499792289



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationPhase.java
##########
@@ -89,7 +89,7 @@ protected void processEvent(ClusterEvent event, ResourcesStateMap resourcesState
           + ". Requires HelixManager|DataCache|RESOURCES|CURRENT_STATE|INTERMEDIATE_STATE");
     }
 
-    Map<String, LiveInstance> liveInstances = cache.getLiveInstances();
+    Map<String, LiveInstance> liveInstances = cache.getLiveInstances(true);

Review comment:
       Could this create inconsistency since LivInstances used by all other places in the pipeline are different from here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500491326



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +471,30 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
+    return getLiveInstances(true);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param excludeTimeoutInstances - Only effective during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean excludeTimeoutInstances) {

Review comment:
       I suggest making it explicit.
   One method getAllLiveInstances(), another one something like getActiveLiveInstance().
   
   In the future, the later one can be extended to return different lists according to the conditions (maintenance mode is the first example).

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));

Review comment:
       Just put the logic into a non-static method getOnlineTimestamp()? Which directly returns List<Long>.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,6 +257,10 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a snapshot of old live instances for maintenance mode
+      if (isMaintenanceModeEnabled()) {

Review comment:
       Please be careful about the cache update order. refreshLiveInstances is called before updateMaintenanceInfo. So I guess this map never works as expected. As I suggested, just remove this map to reduce the potential risk. ROI is low.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;

Review comment:
       I guess this one is corresponding to the _liveInstanceSnapshotForMaintenance check. I prefer not to add this.
   1. If the cache update fails due to ZK issue, then we might have the _liveInstanceSnapshotForMaintenance updated but _timedOutInstanceDuringMaintenance is not fully calculated.
   2. The performance enhancement of using this map is not significant. Let's ensure it works fine first.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /**
+   * Parses a history date in millisecond to string.
+   */
+  public static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        continue;

Review comment:
       nit, warning in these abnormal case for debugging.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.

Review comment:
       As I mentioned above, the current logic is too complicated than it needs to be. For a newly added node, the offline time is timestamp 0. So it is definitely timeout.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {

Review comment:
       I agree these nodes shall be excluded. But we are actually doing this for a different reason.
   Note that the logic is a node is offline for a long time and back online during the M mode. In this case, the new node previous offline time is 0. So it is definitely timed out. So we don't need this logic here. The later logic shall be able to cover this case.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts

Review comment:
       nit, I think you need all the online timestamp before the _maintenanceSignal create time. Otherwise, you don't need this list. So the comment is not accurate.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {

Review comment:
       To reduce the implicit dependencies between the refresh/update methods, please input the isMaintenanceModeEnabled as a parameter. So the caller is more likely to pass the refreshed value.
   
   Overall, we shall reduce referring to the private field directly to decouple methods and prevent potential bugs.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /**
+   * Parses a history date in millisecond to string.
+   */
+  public static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }

Review comment:
       As commented above, these 2 methods can be private if we directly return the lists of Long

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {

Review comment:
       nit, the logic might be clearer for the readers if you just subtract the list and then read the last one. Otherwise, you have multiple lists with different orders. And get(0) in the following code could be a mystery for others.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.historyDateStringToLong(historyOfflineList.get(i));

Review comment:
       Same here, getOfflineTimestamp




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499740780



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -112,6 +129,22 @@ public long getLastOfflineTime() {
     return offlineTime;
   }
 
+  /**
+   * Get the time when this node last goes offline in history. If the node does not have offline
+   * history or contains invalid date as the last element, return -1.
+   *
+   * @return
+   */
+  public long getLastTimeInOfflineHistory() {

Review comment:
       What is difference this w/ getLastOfflineTime?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r502129112



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;
+    }
+  }
+
+  /**
+   * For each entry in History, return its millisecond timestamp; for timestamps that cannot be
+   * parsed, skip them.
+   */
+  public List<Long> getHistoryTimestampsAsMilliseconds() {

Review comment:
       I understand what you mean, and since I also want to call it getOnlineTimestamps, I made the change. 
   
   Though I don't think the old naming is bad: callers get the History timestamps, as advertised. If they don't know what these are they should read the class. 😃 

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -64,6 +69,20 @@ public ParticipantHistory(ZNRecord znRecord) {
     super(znRecord);
   }
 
+  /**
+   * @return The list field for HISTORY
+   */
+  public List<String> getHistory() {
+    return _record.getListField(ConfigProperty.HISTORY.name());
+  }
+
+  /**
+   * @return The list field for OFFLINE
+   */
+  public List<String> getOffline() {
+    return _record.getListField(ConfigProperty.OFFLINE.name());
+  }
+

Review comment:
       I disagree with this. For a HelixProperty class, if there is a need of a field, there should be an accessor. It is true that we have timestamp parsers now, but for example, HISTORY contains other informations too. Down the road, there may be other code that relies on other HISTORY information. What would happen then? An accessor will be built. Right now there is a need to these fields, and I don't see a reason to not create accessors. 
   
   For integration test it's not in the same package, so I don't think package private (nor protected) can help. This is a minor point; there are workarounds. My main point is the previous paragraph. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500471814



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       No, `_liveInstanceSnapshotForMaintenance` is equal to `getLiveInstances()` from the **last** pipeline. If a long-offline instance coms back after the first pipeline and before the next pipeline, it will **not** be in the snapshot obtained during the next pipeline, and will therefore be checked by the timeout logic. Whether this instance fails the check, it will be stored either in the snapshot or in the "timed-out cache", and it will not be checked by the pipeline after. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496292698



##########
File path: helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java
##########
@@ -122,7 +122,13 @@
     // don't specify their individual pool sizes, this value will be used for all participants; if
     // none of participants or the cluster define pool sizes,
     // TaskConstants.DEFAULT_TASK_THREAD_POOL_SIZE will be used to create pool sizes.
-    GLOBAL_TARGET_TASK_THREAD_POOL_SIZE
+    GLOBAL_TARGET_TASK_THREAD_POOL_SIZE,
+
+    // The time out window for offline nodes during maintenance mode; if an offline node has been
+    // offline for more than this specified time period, it's treated as offline for the rest of
+    // the maintenance mode's duration even when it comes online.
+    // The unit is milliseconds.
+    MAINTENANCE_OFFLINE_NODE_TIME_OUT

Review comment:
       Sure, but wouldn't that be too long? I don't know the convention about variable length, so I took a shorter version. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499903126



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.
+    if (!offlineTimestamps.isEmpty()
+        && offlineTimestamps.get(offlineTimestamps.size() - 1) > onlineTimestamps
+        .get(onlineTimestamps.size() - 1)) {
+      // Make sure for each offline timestamp, 2 online timestamps creates a range around it;
+      // this is usually true unless the timestamps are malformed (missing online timestamps)
+      onlineTimestamps.add(currentTime);
+    }
+
+    // Hop between each pair of online timestamps and find the maximum offline window
+    for (int onlineTimestampIndex = 1, offlineTimestampIndex = 0;
+        onlineTimestampIndex < onlineTimestamps.size() && offlineTimestampIndex < offlineTimestamps
+            .size(); onlineTimestampIndex++) {
+      long onlineTimestamp = onlineTimestamps.get(onlineTimestampIndex);
+      long offlineTimeStamp = offlineTimestamps.get(offlineTimestampIndex);
+
+      // If the offline timestamp isn't within this pair of online timestamp, continue
+      if (offlineTimeStamp > onlineTimestamp) {
+        continue;
+      }
+
+      // Check the largest offline window against the timeout window
+      if (onlineTimestamp - offlineTimeStamp > timeOutWindow) {

Review comment:
       @lei-xia We need to handle the case of "offline for too long" during maintenance mode as well, right? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501166020



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +466,37 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
-   * @return
+   * Returns the LiveInstances for each of the instances that are currently up and running,
+   * excluding the instances that are considered offline during maintenance mode. Instances
+   * are timed-out if they have been offline for a while before going live during maintenance mode.
    */
   public Map<String, LiveInstance> getLiveInstances() {
+    return getLiveInstances(true);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   */
+  public Map<String, LiveInstance> getAllLiveInstances() {

Review comment:
       I'm just going to remove this function and the boolean flag, since during maintenance mode, we always want to return the non-timed-out instances. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499743741



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -112,6 +129,22 @@ public long getLastOfflineTime() {
     return offlineTime;
   }
 
+  /**
+   * Get the time when this node last goes offline in history. If the node does not have offline
+   * history or contains invalid date as the last element, return -1.
+   *
+   * @return
+   */
+  public long getLastTimeInOfflineHistory() {

Review comment:
       `getLastOfflineTime` returns LAST_OFFLINE_TIME, which can be either a timestamp or a -1 for ONLINE. `getLastTimeInOfflineHistory` returns the last timestamp in OFFLINE (list field), which is always a timestamp. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r498610539



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       The new method should address the problem. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496290055



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       Could you elaborate why there would be race conditions? The cache refresh is synchronously computed in the pipeline main thread; rebalancing is calculated on the same pipeline. 
   
   Also with Jiajun's comment: if filtering is done at getter, I don't think there would possibly be race conditions in this scenario. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500752035



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -333,6 +370,7 @@ public synchronized void refresh(HelixDataAccessor accessor) {
     _stateModelDefinitionCache.refresh(accessor);
     _clusterConstraintsCache.refresh(accessor);
     updateMaintenanceInfo(accessor);
+    timeoutNodesDuringMaintenance(accessor, _clusterConfig, _isMaintenanceModeEnabled);

Review comment:
       Any specific reason we do not put this into refreshLiveInstances()?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r502130473



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -64,6 +69,20 @@ public ParticipantHistory(ZNRecord znRecord) {
     super(znRecord);
   }
 
+  /**
+   * @return The list field for HISTORY
+   */
+  public List<String> getHistory() {
+    return _record.getListField(ConfigProperty.HISTORY.name());
+  }
+
+  /**
+   * @return The list field for OFFLINE
+   */
+  public List<String> getOffline() {
+    return _record.getListField(ConfigProperty.OFFLINE.name());
+  }
+

Review comment:
       I disagree with this. For a HelixProperty class, if there is a need of a field, there should be an accessor. It is true that we have timestamp parsers now, but for example, HISTORY contains other informations too. Down the road, there may be other code that relies on other HISTORY information. What would happen then? An accessor will be built. Right now there is a need to these fields, and I don't see a reason to not create accessors. 
   
   For integration test it's not in the same package, so I don't think package private (nor protected) can help. This is a minor point; there are workarounds. My main point is the previous paragraph. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500621000



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }

Review comment:
       Moved to ParticipantHistory

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }

Review comment:
       Moved to ParticipantHistory




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500604388



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;

Review comment:
       Discussed offline: the snapshot variable is replaced and now it's okay for ZK to be disrupted at any point. `_timedOutInstanceDuringMaintenance` can be partially completed - other nodes will be checked again. The new snapshot variable (which is just `_liveInstanceExcludeTimedOutForMaintenance`) is computed after this step and will always store the correct truth of "what instances are checked"; it can be partially completed - other nodes will be checked again. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499853548



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }

Review comment:
       Please see below. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499853384



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.

Review comment:
       This ties back to the previous 2 sections of algorithm: they are not just parsing the dates. The first section stops parsing as soon as the first timestamp that's before maintenance mode; if there's no such timestamp, it just returns false (for newly added host). The second section stops parsing as soon as the timestamp is just after the first online timestamp. This is also why these logic should not be a member logic to ParticipantHistory because they are not meant to be generic. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499754517



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       For the first time it enters maintenance mode, there will be no snapshot is stored. It's because maintenance mode signal is processed after refreshLiveInstances(). This way, all nodes will be checked when we first enter maintenance mode. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501167946



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,101 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {

Review comment:
       Ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496291271



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();
+    // lastOfflineTime is only negative if there is no offline history or the time format is wrong.
+    // Since this instance is a 'new' live instance, not having offline history = first time created
+    // instance; during maintenance mode, no partition will be assigned to such a new instance,
+    // therefore it's okay to no time it out. The wrong format case shouldn't happen at all and will
+    // not be handled either.
+    if (lastOfflineTime < 0) {
+      return false;
+    }
+
+    long currentTime = System.currentTimeMillis();
+    if (currentTime - lastOfflineTime > timeOutWindow) {
+      LogUtil.logWarn(logger, getClusterEventId(), String.format(
+          "During maintenance mode, instance %s is timed-out due to its offline time. Current time: "

Review comment:
       Why is that? This calculation is only done during the maintenance mode. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r497213697



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       Come to think about it, it's not fine: if `oldLiveInstances` is lost, every node is subject to timeout check; however, the timeout check only regards the node's last offline time. That means if a node is online for 8 weeks and last offline time is 9 weeks ago, we will still mark it as timed_out. That would be incorrect. Let me think about this case and update. @jiajunwang 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496290055



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       Could you elaborate why there would be race conditions? The cache refresh is synchronously computed in the pipeline main thread; rebalancing is calculated on the same pipeline. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496292744



##########
File path: helix-core/src/test/java/org/apache/helix/model/TestClusterConfig.java
##########
@@ -298,6 +298,26 @@ public void testAsyncGlobalRebalanceOption() {
             false), true);
   }
 
+  @Test
+  public void testGetMaintenanceOfflineNodeTimeOut() {

Review comment:
       Sounds good. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501166433



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +790,58 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,

Review comment:
       It's isInstanceTimedOut, which is grammatically correct. I'd prefer is over should. 

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,101 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {

Review comment:
       Ok. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500668848



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;

Review comment:
       Log the error here.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -64,6 +69,20 @@ public ParticipantHistory(ZNRecord znRecord) {
     super(znRecord);
   }
 
+  /**
+   * @return The list field for HISTORY
+   */
+  public List<String> getHistory() {
+    return _record.getListField(ConfigProperty.HISTORY.name());
+  }
+
+  /**
+   * @return The list field for OFFLINE
+   */
+  public List<String> getOffline() {
+    return _record.getListField(ConfigProperty.OFFLINE.name());
+  }
+

Review comment:
       I think you don't need these 2 methods to be public?

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;
+    }
+  }
+
+  /**
+   * For each entry in History, return its millisecond timestamp; for timestamps that cannot be
+   * parsed, skip them.
+   */
+  public List<Long> getHistoryTimestampsAsMilliseconds() {

Review comment:
       Shall we just call it getOnlineTimestampsAsMilliseconds?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500445155



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       _liveInstanceSnapshotForMaintenance will be refreshed in the beginning of every pipeline, it contains all liveInstances (including these that should be timeout-ed), right? Say, if an (long-offline) instance comes back after the first pipeline before the next pipeline, that instance will be included in the _liveInstanceSnapshotForMaintenance and won't be checked here?   I.e, is _liveInstanceSnapshotForMaintenance 
    always equal to getLiveInstances()?  If it is, what is point of keeping a separate cache?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499891686



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {

Review comment:
       Now getLiveInstances will default to excluding timed-out ones. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496292864



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -40,7 +41,9 @@
   private static Logger LOG = LoggerFactory.getLogger(ParticipantHistory.class);
 
   private final static int HISTORY_SIZE = 20;
-  private enum ConfigProperty {
+  final static String HISTORY_DATE_FORMAT = "yyyy-MM-dd'T'HH:mm:ss:SSS";

Review comment:
       That's a great idea; I'll just redefine it in test. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496292467



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();

Review comment:
       1. Sounds good;
   2. I will add that in the javadoc, what do you think? I don't think it should be strictly enforced. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499044458



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.
+    if (!offlineTimestamps.isEmpty()
+        && offlineTimestamps.get(offlineTimestamps.size() - 1) > onlineTimestamps
+        .get(onlineTimestamps.size() - 1)) {
+      // Make sure for each offline timestamp, 2 online timestamps creates a range around it;
+      // this is usually true unless the timestamps are malformed (missing online timestamps)
+      onlineTimestamps.add(currentTime);
+    }
+
+    // Hop between each pair of online timestamps and find the maximum offline window
+    for (int onlineTimestampIndex = 1, offlineTimestampIndex = 0;
+        onlineTimestampIndex < onlineTimestamps.size() && offlineTimestampIndex < offlineTimestamps
+            .size(); onlineTimestampIndex++) {
+      long onlineTimestamp = onlineTimestamps.get(onlineTimestampIndex);
+      long offlineTimeStamp = offlineTimestamps.get(offlineTimestampIndex);
+
+      // If the offline timestamp isn't within this pair of online timestamp, continue
+      if (offlineTimeStamp > onlineTimestamp) {
+        continue;
+      }
+
+      // Check the largest offline window against the timeout window
+      if (onlineTimestamp - offlineTimeStamp > timeOutWindow) {

Review comment:
       It has the edge case for first interval. It is the interval online time just before maintenance mode, but somehow we dont check whether the offline time is before maintenance.
   
   Let's say if the instance is offline before maintenance we need the potential timeout = online time - Math.max(offlinetimestamp, maintenance entering time);




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500471814



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       No, `_liveInstanceSnapshotForMaintenance` is equal to `getLiveInstances()` from the *last* pipeline. If a long-offline instance coms back after the first pipeline and before the next pipeline, it will *not* be in the snapshot, and will therefore be checked by the timeout logic. Whether this instance fails the check, it will be stored either in the snapshot or in the "timed-out cache", and it will not be checked by the pipeline after. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r495299701



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       I think it may not be the place to do that. We can do it in maintenance recovery stage. I cannot remember the details of how to move the last offline time to history and set -1. It happened at cache refresh as well. So I think it is better to make it after cache refresh fully complete.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501165058



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -333,6 +370,7 @@ public synchronized void refresh(HelixDataAccessor accessor) {
     _stateModelDefinitionCache.refresh(accessor);
     _clusterConstraintsCache.refresh(accessor);
     updateMaintenanceInfo(accessor);
+    timeoutNodesDuringMaintenance(accessor, _clusterConfig, _isMaintenanceModeEnabled);

Review comment:
       Yes, because we want this step to be after updateMaintenanceInfo(), which makes the most logical sense. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496292296



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();
+    // lastOfflineTime is only negative if there is no offline history or the time format is wrong.
+    // Since this instance is a 'new' live instance, not having offline history = first time created
+    // instance; during maintenance mode, no partition will be assigned to such a new instance,
+    // therefore it's okay to no time it out. The wrong format case shouldn't happen at all and will
+    // not be handled either.
+    if (lastOfflineTime < 0) {

Review comment:
       This isn't the actual `LAST_OFFLINE_TIME` field in History; I can understand the confusion due to variable name. 
   
   I implemented a new function called `getLastTimeInOfflineHistory()` where the latest date in `OFFLINE` is returned. In the case when that cannot be returned, -1 will be the error indicator; you could refer the comment above for more detail. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500755468



##########
File path: helix-core/src/test/java/org/apache/helix/integration/controller/TestOfflineNodeTimeoutDuringMaintenanceMode.java
##########
@@ -0,0 +1,242 @@
+package org.apache.helix.integration.controller;

Review comment:
       Do we have an end-to-end integration test here, such as verify that a timeout-ed instance during maintenance mode will not get assigned any partitions (received any state transition messages)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500603383



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,6 +257,10 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a snapshot of old live instances for maintenance mode
+      if (isMaintenanceModeEnabled()) {

Review comment:
       Discussed offline. This variable is removed and replaced by the cached result of live instances excluding timed-out ones. That way the intention is more clear. 

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {

Review comment:
       Done. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501378507



##########
File path: helix-core/src/test/java/org/apache/helix/integration/controller/TestOfflineNodeTimeoutDuringMaintenanceMode.java
##########
@@ -0,0 +1,242 @@
+package org.apache.helix.integration.controller;

Review comment:
       It's better to include an integration test to cover the real case in mvn test. CHO is not available to everyone.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496291084



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       `oldLiveInstances` is for performance only. Losing it during leadership switch is still fine. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500654872



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.

Review comment:
       The algorithm has been reworked to look less complicated. Its essence remains - all mentioned edge cases are taken care of. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499822985



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.
+    if (!offlineTimestamps.isEmpty()
+        && offlineTimestamps.get(offlineTimestamps.size() - 1) > onlineTimestamps
+        .get(onlineTimestamps.size() - 1)) {
+      // Make sure for each offline timestamp, 2 online timestamps creates a range around it;
+      // this is usually true unless the timestamps are malformed (missing online timestamps)
+      onlineTimestamps.add(currentTime);
+    }
+
+    // Hop between each pair of online timestamps and find the maximum offline window
+    for (int onlineTimestampIndex = 1, offlineTimestampIndex = 0;
+        onlineTimestampIndex < onlineTimestamps.size() && offlineTimestampIndex < offlineTimestamps
+            .size(); onlineTimestampIndex++) {
+      long onlineTimestamp = onlineTimestamps.get(onlineTimestampIndex);
+      long offlineTimeStamp = offlineTimestamps.get(offlineTimestampIndex);
+
+      // If the offline timestamp isn't within this pair of online timestamp, continue
+      if (offlineTimeStamp > onlineTimestamp) {
+        continue;
+      }
+
+      // Check the largest offline window against the timeout window
+      if (onlineTimestamp - offlineTimeStamp > timeOutWindow) {

Review comment:
       Do we really need to go through every pair? Can we just look at the pair with offline time before maintenance entrance and online time after it? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499821463



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }

Review comment:
       Similar here, the logic should belong to ParticipantHistory, such as getAllOfflineHistory().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501257422



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;
+    }
+  }
+
+  /**
+   * For each entry in History, return its millisecond timestamp; for timestamps that cannot be
+   * parsed, skip them.
+   */
+  public List<Long> getHistoryTimestampsAsMilliseconds() {

Review comment:
       But it is not clear to the caller what do you get.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499750407



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {

Review comment:
       In which places we will need  LiveInstances containing timeouted instance?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501257959



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -64,6 +69,20 @@ public ParticipantHistory(ZNRecord znRecord) {
     super(znRecord);
   }
 
+  /**
+   * @return The list field for HISTORY
+   */
+  public List<String> getHistory() {
+    return _record.getListField(ConfigProperty.HISTORY.name());
+  }
+
+  /**
+   * @return The list field for OFFLINE
+   */
+  public List<String> getOffline() {
+    return _record.getListField(ConfigProperty.OFFLINE.name());
+  }
+

Review comment:
       1. you have the methods which return the timestamps. The raw string is hard to use and we shall not return it.
   2. integration test can rely on protected methods.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499754517



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       For the first time it enters maintenance mode, there will be no snapshot stored. That's because maintenance mode signal is processed after refreshLiveInstances(). 
   
   Since the snapshot is empty, all nodes will be checked when we first enter maintenance mode. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500578304



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,6 +257,10 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a snapshot of old live instances for maintenance mode
+      if (isMaintenanceModeEnabled()) {

Review comment:
       This is intentional. When maintenance mode is first started, the snapshot is left empty so that all nodes are checked. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang merged pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang merged pull request #1413:
URL: https://github.com/apache/helix/pull/1413


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499749749



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
-    return _liveInstanceCache.getPropertyMap();
+    return getLiveInstances(false);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param filterTimedOutInstances - Only set true during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean filterTimedOutInstances) {

Review comment:
       Can we cache the results somewhere (we may already have it?) instead of recomputing every time this method is called?  I think getLiveInstances are called in many places during one pipeline.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499821088



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }

Review comment:
        Can we wrap up the logic here into a method in ParticipantHistory, such as getAllOnlineHistory()?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r502594599



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;
+    }
+  }
+
+  /**
+   * For each entry in History, return its millisecond timestamp; for timestamps that cannot be
+   * parsed, skip them.
+   */
+  public List<Long> getHistoryTimestampsAsMilliseconds() {

Review comment:
       A meaningful method name is always better than comments.
   The best code does not need comments. Although it is almost impossible to achieve.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500658828



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       This has been changed. We now rely on the previous calculation result, which makes more sense when reading. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r502129112



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;
+    }
+  }
+
+  /**
+   * For each entry in History, return its millisecond timestamp; for timestamps that cannot be
+   * parsed, skip them.
+   */
+  public List<Long> getHistoryTimestampsAsMilliseconds() {

Review comment:
       I understand what you mean, and since I also want to call it getOnlineTimestamps, I made the change. 
   
   Though I don't think the old naming is bad: callers get the History timestamps, as advertised. If they don't know what these are they should read the class. 😃 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r502596688



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -64,6 +69,20 @@ public ParticipantHistory(ZNRecord znRecord) {
     super(znRecord);
   }
 
+  /**
+   * @return The list field for HISTORY
+   */
+  public List<String> getHistory() {
+    return _record.getListField(ConfigProperty.HISTORY.name());
+  }
+
+  /**
+   * @return The list field for OFFLINE
+   */
+  public List<String> getOffline() {
+    return _record.getListField(ConfigProperty.OFFLINE.name());
+  }
+

Review comment:
       "What would happen then?" Then we add other methods to access this information explicitly. The string return which requires additional parse is a bad idea in general.
   1. What if we change the string format? These methods return string will be non-backward-compatible. And we don't want to restrict ourselves unnecessarily, right?
   2. Like you add the getOnlineTimestamp now, we can add other methods that parse the HISTORY string and return the required information.
   3. The necessity of testing does not require us to make it public. We can make the test in the same package. Or read the field directly from ZNRecord. There are many ways to achieve it and public is the most impactful and costly (maintenance cost) way. I think it is not good option.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on pull request #1413:
URL: https://github.com/apache/helix/pull/1413#issuecomment-706427049


   This PR is ready to be merged, approved by @dasahcc @jiajunwang 
   Final commit message:
   ## Feature: Offline Node Timeout During Maintenance Mode  ##
   We would like to avoid the scenario where nodes that have been offline for a while come back to life during maintenance mode. Solution: if a node has been offline for a certain period of time, the maintenance mode rebalancer will not assign partitions to it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500754161



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +790,58 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,

Review comment:
       minor:  isInstanceTimeout... -> shouldInstanceTimeout...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499050485



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.
+    if (!offlineTimestamps.isEmpty()
+        && offlineTimestamps.get(offlineTimestamps.size() - 1) > onlineTimestamps
+        .get(onlineTimestamps.size() - 1)) {
+      // Make sure for each offline timestamp, 2 online timestamps creates a range around it;
+      // this is usually true unless the timestamps are malformed (missing online timestamps)
+      onlineTimestamps.add(currentTime);
+    }
+
+    // Hop between each pair of online timestamps and find the maximum offline window
+    for (int onlineTimestampIndex = 1, offlineTimestampIndex = 0;
+        onlineTimestampIndex < onlineTimestamps.size() && offlineTimestampIndex < offlineTimestamps
+            .size(); onlineTimestampIndex++) {
+      long onlineTimestamp = onlineTimestamps.get(onlineTimestampIndex);
+      long offlineTimeStamp = offlineTimestamps.get(offlineTimestampIndex);
+
+      // If the offline timestamp isn't within this pair of online timestamp, continue
+      if (offlineTimeStamp > onlineTimestamp) {
+        continue;
+      }
+
+      // Check the largest offline window against the timeout window
+      if (onlineTimestamp - offlineTimeStamp > timeOutWindow) {

Review comment:
       The first offline time is always after the first online time based on my parsing logic, see above. 
   
   Therefore we always start checking by the second online time, and do (second online time - first offline time), for example. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496115113



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       Then that may not be the snapshot we need. Because, this change could cause race condition and let old machine starting bootstrapping, which breaks our rule.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499853480



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }

Review comment:
       Please see below. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500605923



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +471,30 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
+    return getLiveInstances(true);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param excludeTimeoutInstances - Only effective during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean excludeTimeoutInstances) {

Review comment:
       Discussed offline: function is made private. It can be extended later on. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500753771



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +466,37 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
-   * @return
+   * Returns the LiveInstances for each of the instances that are currently up and running,
+   * excluding the instances that are considered offline during maintenance mode. Instances
+   * are timed-out if they have been offline for a while before going live during maintenance mode.
    */
   public Map<String, LiveInstance> getLiveInstances() {
+    return getLiveInstances(true);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   */
+  public Map<String, LiveInstance> getAllLiveInstances() {

Review comment:
       These two methods (getLiveInstances() and getAllLiveInstances() are very confusing to each other.  We would need a better naming here.  We have four type of instances:  All Instances, Live Instances, enabled/disabled Instances, timeout-ed instances.  We have a method called getEnabledLiveInstances(), maybe we can name this new method getNonTimeoutLiveInstances()?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499744772



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
-    return _liveInstanceCache.getPropertyMap();
+    return getLiveInstances(false);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param filterTimedOutInstances - Only set true during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean filterTimedOutInstances) {

Review comment:
       filterTimedOutInstances is a bit confusing, it could mean either the result contains only Timeouted Instances or the other way. Suggest to rename it as:  excludeTimeoutInstances.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500754782



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,101 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {

Review comment:
       getTimeFromSessionHistory




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500678399



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,99 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) {
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        LOG.warn("Ignore key value pair while parsing session history due to missing '=': " +
+            sessionHistoryKeyValuePair);
+        continue;
+      }
+      sessionHistoryMap.put(keyValuePair[0], keyValuePair[1]);
+    }
+
+    return sessionHistoryMap;
+  }
+
+  /*
+   * Take a string session history entry and extract the TIME field out of it. Return -1 if the TIME
+   * field doesn't exist or if the TIME field cannot be parsed to a long.
+   */
+  private static long extractTimeFromSessionHistoryString(String sessionHistoryString) {
+    Map<String, String> sessionHistoryMap = parseSessionHistoryStringToMap(sessionHistoryString);
+    if (!sessionHistoryMap.containsKey(ConfigProperty.TIME.name())) {
+      return -1;
+    }
+    try {
+      return Long.parseLong(sessionHistoryMap.get(ConfigProperty.TIME.name()));
+    } catch (NumberFormatException e) {
+      return -1;
+    }
+  }
+
+  /**
+   * For each entry in History, return its millisecond timestamp; for timestamps that cannot be
+   * parsed, skip them.
+   */
+  public List<Long> getHistoryTimestampsAsMilliseconds() {

Review comment:
       I had this debate too, but I think in the context of ParticipantHistory, the field names should be respected. The field is HISTORY. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499748505



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +306,33 @@ private void updateMaintenanceInfo(final HelixDataAccessor accessor) {
     // The following flag is to guarantee that there's only one update per pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) {
+    // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because it has been checked;
+        // 3. For all other nodes, check if it's timed-out.
+        // When maintenance mode is first entered, all nodes will be checked as a result.
+        if (!_timedOutInstanceDuringMaintenance.contains(instance)
+            && !_liveInstanceSnapshotForMaintenance.containsKey(instance)
+            && isInstanceTimedOutDuringMaintenance(accessor, instance, timeOutWindow)) {
+          _timedOutInstanceDuringMaintenance.add(instance);

Review comment:
       There could be some issue with this if a long-offline instance (say InstanceA) came back just after the cluster enter maintenance mode and before the next pipeline started.  InstanceA will be in the _liveInstanceSnapshotForMaintenance set (see line 261), so it will skip the timeout check here. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499855639



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {

Review comment:
       Actually I'm rethinking this. There could be some external view issue. Will update. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499755286



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
-    return _liveInstanceCache.getPropertyMap();
+    return getLiveInstances(false);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param filterTimedOutInstances - Only set true during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean filterTimedOutInstances) {

Review comment:
       Makes sense. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r495305614



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       It needs to be here. The reason is that we need the old snapshot; see line 258. Once refresh happens, the old snapshot will be lost, and we wouldn't be able to compute the 'new' live nodes delta. 
   
   In terms of history update, it's updateOfflineInstanceHistory() down below. You can see that I also handled that case. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499855767



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationPhase.java
##########
@@ -89,7 +89,7 @@ protected void processEvent(ClusterEvent event, ResourcesStateMap resourcesState
           + ". Requires HelixManager|DataCache|RESOURCES|CURRENT_STATE|INTERMEDIATE_STATE");
     }
 
-    Map<String, LiveInstance> liveInstances = cache.getLiveInstances();
+    Map<String, LiveInstance> liveInstances = cache.getLiveInstances(true);

Review comment:
       Duplicate discussion with the above comment. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499822372



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > _maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online timestamp.

Review comment:
       This may not be true, for example, a newly added host?  (I.e, our code should be able to handle these corner cases properly).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500754642



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +185,101 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /*
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  private static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + dateString);
+      return -1;
+    }
+  }
+
+  /*
+   * Parses a history date in millisecond to string.
+   */
+  private static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String sessionHistoryString) {

Review comment:
       Let us follow the name convention of another method, such as sessionHistoryStringToMap()? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496290592



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(
+              accessor, instance, timeOutWindow)) {
+            _timedOutInstanceDuringMaintenance.add(instance);
+          }
+        }
+
+        // Remove all timed-out nodes that were recorded in this maintenance duration
+        for (String instance : _timedOutInstanceDuringMaintenance) {
+          _liveInstanceCache.deletePropertyByName(instance);

Review comment:
       That's a good idea. Will try. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499749749



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +463,32 @@ public synchronized void setIdealStates(List<IdealState> idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up and running
+   * Returns the LiveInstances for each of the instances that are currently up and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
-    return _liveInstanceCache.getPropertyMap();
+    return getLiveInstances(false);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up and running
+   * @param filterTimedOutInstances - Only set true during maintenance mode. If true, filter out
+   *                                instances that are timed-out during maintenance mode; instances
+   *                                are timed-out if they have been offline for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean filterTimedOutInstances) {

Review comment:
       Can we cache the results somewhere (we may already have it?) instead of computing everytime this method is called?  I think gertLiveInstances are called in many places during one pipeline.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r502694065



##########
File path: helix-core/src/test/java/org/apache/helix/integration/controller/TestOfflineNodeTimeoutDuringMaintenanceMode.java
##########
@@ -0,0 +1,242 @@
+package org.apache.helix.integration.controller;

Review comment:
       Updated. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r499741494



##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(parseHistoryDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long parseHistoryDateStringToLong(String dateString) {

Review comment:
       minor:  -> historyDateStringToLong? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r496140640



##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(

Review comment:
       If there is a leadership switch, the oldLiveInstances list will be empty. So the behavior would rely on isInstanceTimedOutDuringMaintenance() only. My guess is that we don't need oldLiveInstances list, since it won't be able to cover leadership switch case anyway.
   
   If isInstanceTimedOutDuringMaintenance() has anything missed, so you have to relies on the oldLiveInstances list, then we need to fix it within the scope of isInstanceTimedOutDuringMaintenance().

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();
+    // lastOfflineTime is only negative if there is no offline history or the time format is wrong.
+    // Since this instance is a 'new' live instance, not having offline history = first time created
+    // instance; during maintenance mode, no partition will be assigned to such a new instance,
+    // therefore it's okay to no time it out. The wrong format case shouldn't happen at all and will
+    // not be handled either.
+    if (lastOfflineTime < 0) {
+      return false;
+    }
+
+    long currentTime = System.currentTimeMillis();
+    if (currentTime - lastOfflineTime > timeOutWindow) {
+      LogUtil.logWarn(logger, getClusterEventId(), String.format(
+          "During maintenance mode, instance %s is timed-out due to its offline time. Current time: "

Review comment:
       nit, this calculation does not need to be bound with the maintenance mode, let's don't mention "maintenance mode" in the log.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();
+    // lastOfflineTime is only negative if there is no offline history or the time format is wrong.
+    // Since this instance is a 'new' live instance, not having offline history = first time created
+    // instance; during maintenance mode, no partition will be assigned to such a new instance,
+    // therefore it's okay to no time it out. The wrong format case shouldn't happen at all and will
+    // not be handled either.
+    if (lastOfflineTime < 0) {

Review comment:
       Strictly check for ONLINE? Of course, keeping the < 0 checks for safe is still good here, but optional. I suggest adding "lastOfflineTime == ONLINE" to make the logic easier for the reader to understand.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();

Review comment:
       There is a chance that we failed to record the offlinetime, or if pipeline runs slower, the offline time is not accurately recorded. I don't think there is a solution for it, but let's mention in the comment.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -40,7 +41,9 @@
   private static Logger LOG = LoggerFactory.getLogger(ParticipantHistory.class);
 
   private final static int HISTORY_SIZE = 20;
-  private enum ConfigProperty {
+  final static String HISTORY_DATE_FORMAT = "yyyy-MM-dd'T'HH:mm:ss:SSS";

Review comment:
       private?

##########
File path: helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java
##########
@@ -122,7 +122,13 @@
     // don't specify their individual pool sizes, this value will be used for all participants; if
     // none of participants or the cluster define pool sizes,
     // TaskConstants.DEFAULT_TASK_THREAD_POOL_SIZE will be used to create pool sizes.
-    GLOBAL_TARGET_TASK_THREAD_POOL_SIZE
+    GLOBAL_TARGET_TASK_THREAD_POOL_SIZE,
+
+    // The time out window for offline nodes during maintenance mode; if an offline node has been
+    // offline for more than this specified time period, it's treated as offline for the rest of
+    // the maintenance mode's duration even when it comes online.
+    // The unit is milliseconds.
+    MAINTENANCE_OFFLINE_NODE_TIME_OUT

Review comment:
       nit, OFFLINE_NODE_TIME_OUT_FOR_MAINTENANCE_MODE?

##########
File path: helix-core/src/test/java/org/apache/helix/model/TestClusterConfig.java
##########
@@ -298,6 +298,26 @@ public void testAsyncGlobalRebalanceOption() {
             false), true);
   }
 
+  @Test
+  public void testGetMaintenanceOfflineNodeTimeOut() {

Review comment:
       1. We need a test to cover the get default value case.
   2. Might be easier for you to combine the 2 new tests.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +756,41 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor accessor, String instance,
+      long timeOutWindow) {
+    PropertyKey.Builder keyBuilder = accessor.keyBuilder();
+    PropertyKey propertyKey = keyBuilder.participantHistory(instance);
+    ParticipantHistory history = accessor.getProperty(propertyKey);
+
+    // Cannot use _instanceOfflineTimeMap. _instanceOfflineTimeMap is last updated in the previous
+    // pipeline execution; it is possible a new offline timestamp is updated after the previous
+    // pipeline execution, so we need the most updated timestamp.
+    long lastOfflineTime = history.getLastTimeInOfflineHistory();

Review comment:
       Due to the same reason, we might want to limit the timeout setup to be more than 5 mins (for example). A very small number does not make sense, and it won't work as expected anyway.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,9 +254,34 @@ private void refreshIdealState(final HelixDataAccessor accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if (_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false)) {
+      // Keep a copy of old live instances in case of maintenance mode
+      Map<String, LiveInstance> oldLiveInstances = getLiveInstances();
       _liveInstanceCache.refresh(accessor);
       _updateInstanceOfflineTime = true;
       refreshedType.add(HelixConstants.ChangeType.LIVE_INSTANCE);
+
+      // If maintenance mode is enabled and timeout window is specified, filter 'new' live nodes
+      // for timed-out nodes
+      long timeOutWindow = _clusterConfig.getMaintenanceOfflineNodeTimeOut();
+      if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+        for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+          // For every 'new' live node, check if it's timed-out
+          if (!oldLiveInstances.containsKey(instance) && isInstanceTimedOutDuringMaintenance(
+              accessor, instance, timeOutWindow)) {
+            _timedOutInstanceDuringMaintenance.add(instance);
+          }
+        }
+
+        // Remove all timed-out nodes that were recorded in this maintenance duration
+        for (String instance : _timedOutInstanceDuringMaintenance) {
+          _liveInstanceCache.deletePropertyByName(instance);

Review comment:
       It's better to ensure the cache has all the facts. Removing the timedout instances from the cache might be suboptimal since after this, no one tracking the real alive nodes anymore.
   It would be easier to keep tracking the whole list and the timedout list, then if the filter option is turned on, we filter in the get method. In this way, we still have the capability to return the full list.

##########
File path: helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -40,7 +41,9 @@
   private static Logger LOG = LoggerFactory.getLogger(ParticipantHistory.class);
 
   private final static int HISTORY_SIZE = 20;
-  private enum ConfigProperty {
+  final static String HISTORY_DATE_FORMAT = "yyyy-MM-dd'T'HH:mm:ss:SSS";

Review comment:
       If required in the test, then either put it in HelixConstants or redefine a string in the test case (this helps to prevent if anyone changes the format that breaks the compatibility)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1413: Feature: Offline Node Timeout During Maintenance Mode

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r501168338



##########
File path: helix-core/src/test/java/org/apache/helix/integration/controller/TestOfflineNodeTimeoutDuringMaintenanceMode.java
##########
@@ -0,0 +1,242 @@
+package org.apache.helix.integration.controller;

Review comment:
       I will cover that in a CHO test. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org