You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/10/28 17:53:18 UTC

[GitHub] [helix] zhangmeng916 opened a new pull request #1494: Implement offline nodes purging

zhangmeng916 opened a new pull request #1494:
URL: https://github.com/apache/helix/pull/1494

### Issues

- [X] My PR addresses the following Helix issues and references them in the PR description:

Fixed #1493

### Description

- [X] Here are some details about my PR, including screenshots of any UI changes:

Helix plan to have the function to allow users to purge offline participants that have been timed out. This PR provides both Java API and REST API for users to purge offline participants. Users may specify the timeout through API input or just fall back to the default value defined in cluster config. If either is defined as a timeout, Helix will not purge participants.

### Tests

- [X] The following tests are written for this issue:

TestZkHelixAdmin
TestClusterAccessor

- [ ] The following is the result of the "mvn test" command on the appropriate module:

(Before CI test pass, please copy & paste the result of "mvn test")

### Documentation (Optional)

- In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

### Commits

- My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"

### Code Quality

- My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514898910



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +251,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Long timeout) {
+    Map<String, InstanceConfig> timeoutOfflineInstances = findTimeoutOfflineInstances(clusterName
+        , timeout);
+    timeoutOfflineInstances.values().forEach(instance -> dropInstance(clusterName, instance));

Review comment:
       two high-level comments:
   1. are we not using the lock? as discussed offline, i think to complete this feature, it would be a good idea to use the distributed lock here since all sorts of race conditions are possible
   2. what kind of failure handling are we doing in this call? we're making multiple dropInstance calls here and there's no guarantee that all would go through. plus, the method itself isn't returning anything so there's no way for the user to determine whether the purge operation was successful or not




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514897345



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2069,43 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Long timeout) {
+    Map<String, InstanceConfig> instanceConfigMap = new HashMap<>();
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeout = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    }

Review comment:
       instead of doing the `null` check, maybe you could define a `NOT_SET` constant of some sort...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] zhangmeng916 commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

zhangmeng916 commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514453058



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);
+    Long timeoutValue;
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeoutValue = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    } else {
+      timeoutValue = Long.valueOf(timeout);
+    }
+
+    Map<String, InstanceConfig> instanceConfigMap = new HashMap<>();
+    String path = PropertyPathBuilder.instanceConfig(clusterName);
+    BaseDataAccessor<ZNRecord> baseAccessor = new ZkBaseDataAccessor<>(_zkClient);
+    List<ZNRecord> znRecords = baseAccessor.getChildren(path, null, 0, 0, 0);
+    for (ZNRecord record : znRecords) {
+      if (record != null) {
+        InstanceConfig instanceConfig = new InstanceConfig(record);
+        instanceConfigMap.put(instanceConfig.getInstanceName(), instanceConfig);
+      }
+    }
+
+    path = PropertyPathBuilder.liveInstance(clusterName);
+    List<String> liveNodes = baseAccessor.getChildNames(path, 0);
+    liveNodes.forEach(liveNode -> instanceConfigMap.remove(liveNode));
+
+    Set<String> toRemoveInstances = new HashSet<>();
+    for (String instanceName : instanceConfigMap.keySet()) {
+      String historyPath = PropertyPathBuilder.instanceHistory(clusterName, instanceName);
+      ZNRecord znRecord = baseAccessor.get(historyPath, null, 0);
+      ParticipantHistory participantHistory = new ParticipantHistory(znRecord);
+      long lastOfflineTime = participantHistory.getLastOfflineTime();
+      if (lastOfflineTime == -1 || timeoutValue < 0

Review comment:
       Yeah, good catch!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] dasahcc commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

dasahcc commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513786034



##########
File path: helix-core/src/main/java/org/apache/helix/HelixAdmin.java
##########
@@ -228,6 +228,13 @@ void addResource(String clusterName, String resourceName, int numPartitions, Str
    */
   void dropInstance(String clusterName, InstanceConfig instanceConfig);
 
+  /**
+   * Purge offline instances from a cluster
+   * @param clusterName
+   * @param customizedPurgeMap

Review comment:
       Can we specify the map meaning here? It could be confusing if we dont clarify it in Java doc.

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);

Review comment:
       I am also confused here. What could other entries other than this timeout?

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -105,6 +106,7 @@
   public static final String CONNECTION_TIMEOUT = "helixAdmin.timeOutInSec";
   private static final String MAINTENANCE_ZNODE_ID = "maintenance";
   private static final int DEFAULT_SUPERCLUSTER_REPLICA = 3;
+  public static final String OFFLINE_NODE_PURGE_TIMEOUT = "purge.timeout";

Review comment:
       Better to have public constants class to hold this kind of key words.

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);
+    Long timeoutValue;
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeoutValue = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    } else {
+      timeoutValue = Long.valueOf(timeout);
+    }
+
+    Map<String, InstanceConfig> instanceConfigMap = new HashMap<>();
+    String path = PropertyPathBuilder.instanceConfig(clusterName);
+    BaseDataAccessor<ZNRecord> baseAccessor = new ZkBaseDataAccessor<>(_zkClient);
+    List<ZNRecord> znRecords = baseAccessor.getChildren(path, null, 0, 0, 0);
+    for (ZNRecord record : znRecords) {
+      if (record != null) {
+        InstanceConfig instanceConfig = new InstanceConfig(record);
+        instanceConfigMap.put(instanceConfig.getInstanceName(), instanceConfig);
+      }
+    }
+
+    path = PropertyPathBuilder.liveInstance(clusterName);
+    List<String> liveNodes = baseAccessor.getChildNames(path, 0);
+    liveNodes.forEach(liveNode -> instanceConfigMap.remove(liveNode));

Review comment:
       We can create a set and use instanceConfigMap.keySet().removeAll(set).

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);
+    Long timeoutValue;
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeoutValue = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    } else {
+      timeoutValue = Long.valueOf(timeout);

Review comment:
       Shall we do a try catch? If the timeout is invalid string by human error, shall we stop it or use default value?

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);
+    Long timeoutValue;
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeoutValue = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    } else {
+      timeoutValue = Long.valueOf(timeout);
+    }
+
+    Map<String, InstanceConfig> instanceConfigMap = new HashMap<>();
+    String path = PropertyPathBuilder.instanceConfig(clusterName);
+    BaseDataAccessor<ZNRecord> baseAccessor = new ZkBaseDataAccessor<>(_zkClient);
+    List<ZNRecord> znRecords = baseAccessor.getChildren(path, null, 0, 0, 0);
+    for (ZNRecord record : znRecords) {
+      if (record != null) {
+        InstanceConfig instanceConfig = new InstanceConfig(record);
+        instanceConfigMap.put(instanceConfig.getInstanceName(), instanceConfig);
+      }
+    }
+
+    path = PropertyPathBuilder.liveInstance(clusterName);
+    List<String> liveNodes = baseAccessor.getChildNames(path, 0);
+    liveNodes.forEach(liveNode -> instanceConfigMap.remove(liveNode));
+
+    Set<String> toRemoveInstances = new HashSet<>();
+    for (String instanceName : instanceConfigMap.keySet()) {
+      String historyPath = PropertyPathBuilder.instanceHistory(clusterName, instanceName);
+      ZNRecord znRecord = baseAccessor.get(historyPath, null, 0);
+      ParticipantHistory participantHistory = new ParticipantHistory(znRecord);
+      long lastOfflineTime = participantHistory.getLastOfflineTime();
+      if (lastOfflineTime == -1 || timeoutValue < 0
+          || System.currentTimeMillis() - lastOfflineTime < timeoutValue) {
+        toRemoveInstances.add(instanceName);
+      }
+    }
+    toRemoveInstances.forEach(instance -> instanceConfigMap.remove(instance));

Review comment:
       Same here, we can use instanceConfigMap.keySet().removeAll(toRemoveInstances); It has better performance.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

pkuwm commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513915274



##########
File path: helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/ClusterAccessor.java
##########
@@ -282,6 +282,17 @@ public Response updateCluster(@PathParam("clusterId") String clusterId,
           return badRequest(e.getMessage());
         }
         break;
+      case purgeOfflineParticipants:
+        Map<String, String> customizedPurgeMap = new HashMap<>();
+        try {
+          customizedPurgeMap =
+              OBJECT_MAPPER.readValue(content, new TypeReference<HashMap<String, String>>() {

Review comment:
       It seems only timeout is needed. Anther option is, we can add a QueryParam timeout, instead of putting it a body content? It'd be easier to use.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] lei-xia commented on pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

lei-xia commented on pull request #1494:
URL: https://github.com/apache/helix/pull/1494#issuecomment-718323604


   To make the PR title more clear, maybe change it to "Add admin and Rest API to purge instances that have been offline more than specified period of time"?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] lei-xia commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

lei-xia commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513896977



##########
File path: helix-core/src/main/java/org/apache/helix/HelixAdmin.java
##########
@@ -228,6 +228,13 @@ void addResource(String clusterName, String resourceName, int numPartitions, Str
    */
   void dropInstance(String clusterName, InstanceConfig instanceConfig);
 
+  /**
+   * Purge offline instances from a cluster
+   * @param clusterName
+   * @param customizedPurgeMap
+   */
+  void purgeOfflineInstances(String clusterName, Map<String, String> customizedPurgeMap);

Review comment:
       Should the API looks like:  purgeOfflineInstances(clusterName, long timeout), which will delete all instances that have been offline longer than timeout?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

pkuwm commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513906307



##########
File path: helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/ClusterAccessor.java
##########
@@ -282,6 +282,17 @@ public Response updateCluster(@PathParam("clusterId") String clusterId,
           return badRequest(e.getMessage());
         }
         break;
+      case purgeOfflineParticipants:
+        Map<String, String> customizedPurgeMap = new HashMap<>();
+        try {
+          customizedPurgeMap =
+              OBJECT_MAPPER.readValue(content, new TypeReference<HashMap<String, String>>() {
+              });
+        } catch (Exception e) {
+          // NOP

Review comment:
       Can we have a warning log for why json parsing fails which could help us debug.
   Not sure if we have a unit test to test this kind of failure: content(eg. format) is incorret. Then we should not return OK, but a bad client request. 

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {

Review comment:
       From the API logic, it seems what is needed is only `timeout`? Then I think we don't have to pass along the map, but only the `timeout` value from the rest api layer: `timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);`
   Then the APIs only need the timeout param, not a map.

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +252,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Map<String, String> customizedPurgeMap) {
+    Map<String, InstanceConfig> timeoutOfflineInstances = findTimeoutOfflineInstances(clusterName,
+        customizedPurgeMap);
+    timeoutOfflineInstances.values().forEach(value -> dropInstance(clusterName, value));

Review comment:
       `value` -> `instanceConfig` to be clearer?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r515746448



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +251,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Long timeout) {
+    Map<String, InstanceConfig> timeoutOfflineInstances = findTimeoutOfflineInstances(clusterName
+        , timeout);
+    timeoutOfflineInstances.values().forEach(instance -> dropInstance(clusterName, instance));

Review comment:
       @zhangmeng916 let's read that one a little more carefully. 
   
   > as discussed, i think what we ultimately want to achieve is to provide a service or a sidecar controller to do this automatically
   
   this doesn't mean that we will include the purge logic into the main controller. it clearly states that it will be a service or a sidecar controller. and i believe we all agreed that this would be the direction.
   
   the reason we are shifting the responsibility to the user is for the ease of implementation. ultimately, we want to automate this in our holistic cluster management environment (HaaS).
   
   > adding one more read/write per instance in a short time period (e.g. 10 or 20 seconds) for lock makes it worse without adding extra benefit for them.
   
   I'm not sure if i completely agree with the statement. 10-20 sec is a very generous duration where zk server will be handle many read/writes. also I don't agree that there's no extra benefit. the benefit we're talking about here is concurrency control in a distributed setting? i'm not suggesting we add read/write operations for nothing. 
   
   if you prefer to leave out locking for this API (which is totally fine with the right assumptions), i suggest we give some context in the JavaDoc and clearly state that this operation is not thread-safe and that it _should_ lock in the future when called automatically.

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +251,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Long timeout) {
+    Map<String, InstanceConfig> timeoutOfflineInstances = findTimeoutOfflineInstances(clusterName
+        , timeout);
+    timeoutOfflineInstances.values().forEach(instance -> dropInstance(clusterName, instance));

Review comment:
       @zhangmeng916 let's read that one a little more carefully. 
   
   > as discussed, i think what we ultimately want to achieve is to provide a service or a sidecar controller to do this automatically
   
   this doesn't mean that we will include the purge logic into the main controller. it clearly states that it will be a service or a sidecar controller. and i believe we all agreed that this would be the direction.
   
   the reason we are shifting the responsibility to the user is for the ease of implementation. ultimately, we want to automate this in our holistic cluster management environment (HaaS).
   
   > adding one more read/write per instance in a short time period (e.g. 10 or 20 seconds) for lock makes it worse without adding extra benefit for them.
   
   I'm not sure if i completely agree with the statement. 10-20 sec is a very generous duration where zk server will be handle many read/writes. also I don't agree that there's no extra benefit. the benefit we're talking about here is concurrency control in a distributed setting? it's not like we'd be locking for nothing. 
   
   if you prefer to leave out locking for this API (which is totally fine with the right assumptions), i suggest we give some context in the JavaDoc and clearly state that this operation is not thread-safe and that it _should_ lock in the future when called automatically.

##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +251,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Long timeout) {
+    Map<String, InstanceConfig> timeoutOfflineInstances = findTimeoutOfflineInstances(clusterName
+        , timeout);
+    timeoutOfflineInstances.values().forEach(instance -> dropInstance(clusterName, instance));

Review comment:
       @zhangmeng916 let's read that one a little more carefully. 
   
   > as discussed, i think what we ultimately want to achieve is to provide a service or a sidecar controller to do this automatically
   
   this doesn't mean that we will include the purge logic into the main controller. it clearly states that it will be a _service_ or a _sidecar_ controller. and i believe we all agreed that this would be the direction.
   
   the reason we are shifting the responsibility to the user is for the ease of implementation. ultimately, we want to automate this in our holistic cluster management environment (HaaS).
   
   > adding one more read/write per instance in a short time period (e.g. 10 or 20 seconds) for lock makes it worse without adding extra benefit for them.
   
   I'm not sure if i completely agree with the statement. 10-20 sec is a very generous duration where zk server will be handle many read/writes. also I don't agree that there's no extra benefit. the benefit we're talking about here is concurrency control in a distributed setting? it's not like we'd be locking for nothing. 
   
   if you prefer to leave out locking for this API (which is totally fine with the right assumptions), i suggest we give some context in the JavaDoc and clearly state that this operation is not thread-safe and that it _should_ lock in the future when called automatically.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514896828



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2069,43 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,

Review comment:
       again, `timeout` might be a bit of a misnomer here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514898152



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2069,43 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Long timeout) {
+    Map<String, InstanceConfig> instanceConfigMap = new HashMap<>();
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeout = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    }
+    if (timeout < 0) {
+      return instanceConfigMap;
+    }
+
+    String path = PropertyPathBuilder.instanceConfig(clusterName);
+    BaseDataAccessor<ZNRecord> baseAccessor = new ZkBaseDataAccessor<>(_zkClient);
+    List<ZNRecord> znRecords = baseAccessor.getChildren(path, null, 0, 0, 0);
+    for (ZNRecord record : znRecords) {
+      if (record != null) {
+        InstanceConfig instanceConfig = new InstanceConfig(record);
+        instanceConfigMap.put(instanceConfig.getInstanceName(), instanceConfig);
+      }
+    }
+
+    path = PropertyPathBuilder.liveInstance(clusterName);
+    List<String> liveNodes = baseAccessor.getChildNames(path, 0);
+    instanceConfigMap.keySet().removeAll(liveNodes);
+
+    Set<String> toRemoveInstances = new HashSet<>();
+    for (String instanceName : instanceConfigMap.keySet()) {
+      String historyPath = PropertyPathBuilder.instanceHistory(clusterName, instanceName);
+      ZNRecord znRecord = baseAccessor.get(historyPath, null, 0);
+      ParticipantHistory participantHistory = new ParticipantHistory(znRecord);
+      long lastOfflineTime = participantHistory.getLastOfflineTime();
+      if (lastOfflineTime == -1 || System.currentTimeMillis() - lastOfflineTime < timeout) {
+        toRemoveInstances.add(instanceName);
+      }
+    }
+    instanceConfigMap.keySet().removeAll(toRemoveInstances);
+    return instanceConfigMap;

Review comment:
       nit: is there a strong reason we're using a `BaseDataAccessor` instance here? for helix-related i/o, it is better taste to use `HelixDataAccessor` because you can avoid having to cast it to helix data model from znrecord. 
   
   that way, code will be cleaner and easier to read, shorter, and more concise.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] lei-xia commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

lei-xia commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513897459



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +252,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Map<String, String> customizedPurgeMap) {

Review comment:
       Why a map here, what could be specified other than timeout?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514897034



##########
File path: helix-core/src/main/java/org/apache/helix/HelixAdmin.java
##########
@@ -228,6 +228,13 @@ void addResource(String clusterName, String resourceName, int numPartitions, Str
    */
   void dropInstance(String clusterName, InstanceConfig instanceConfig);
 
+  /**
+   * Purge offline instances from a cluster
+   * @param clusterName
+   * @param timeout
+   */
+  void purgeOfflineInstances(String clusterName, Long timeout);

Review comment:
       also is there a strong reason why we don't use a primitive `long`? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] lei-xia commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

lei-xia commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513896977



##########
File path: helix-core/src/main/java/org/apache/helix/HelixAdmin.java
##########
@@ -228,6 +228,13 @@ void addResource(String clusterName, String resourceName, int numPartitions, Str
    */
   void dropInstance(String clusterName, InstanceConfig instanceConfig);
 
+  /**
+   * Purge offline instances from a cluster
+   * @param clusterName
+   * @param customizedPurgeMap
+   */
+  void purgeOfflineInstances(String clusterName, Map<String, String> customizedPurgeMap);

Review comment:
       Should not be the API looks like:  purgeOfflineInstances(clusterName, long timeout), which will delete all instances that have been offline longer than timeout?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514896634



##########
File path: helix-core/src/main/java/org/apache/helix/HelixAdmin.java
##########
@@ -228,6 +228,13 @@ void addResource(String clusterName, String resourceName, int numPartitions, Str
    */
   void dropInstance(String clusterName, InstanceConfig instanceConfig);
 
+  /**
+   * Purge offline instances from a cluster
+   * @param clusterName
+   * @param timeout
+   */
+  void purgeOfflineInstances(String clusterName, Long timeout);

Review comment:
       i think the javadoc here could use more work. we should try to add more context on why and when the interface should be used. 
   
   also i'm not sure `timeout` is the right term for this. perhaps rename it to something like `offlineDuration`? also if you could please add a description as to how this `offlineDuration` is measured/calculated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] NealSun96 commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

NealSun96 commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513845377



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);
+    Long timeoutValue;
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeoutValue = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    } else {
+      timeoutValue = Long.valueOf(timeout);
+    }
+
+    Map<String, InstanceConfig> instanceConfigMap = new HashMap<>();
+    String path = PropertyPathBuilder.instanceConfig(clusterName);
+    BaseDataAccessor<ZNRecord> baseAccessor = new ZkBaseDataAccessor<>(_zkClient);
+    List<ZNRecord> znRecords = baseAccessor.getChildren(path, null, 0, 0, 0);
+    for (ZNRecord record : znRecords) {
+      if (record != null) {
+        InstanceConfig instanceConfig = new InstanceConfig(record);
+        instanceConfigMap.put(instanceConfig.getInstanceName(), instanceConfig);
+      }
+    }
+
+    path = PropertyPathBuilder.liveInstance(clusterName);
+    List<String> liveNodes = baseAccessor.getChildNames(path, 0);
+    liveNodes.forEach(liveNode -> instanceConfigMap.remove(liveNode));
+
+    Set<String> toRemoveInstances = new HashSet<>();
+    for (String instanceName : instanceConfigMap.keySet()) {
+      String historyPath = PropertyPathBuilder.instanceHistory(clusterName, instanceName);
+      ZNRecord znRecord = baseAccessor.get(historyPath, null, 0);
+      ParticipantHistory participantHistory = new ParticipantHistory(znRecord);
+      long lastOfflineTime = participantHistory.getLastOfflineTime();
+      if (lastOfflineTime == -1 || timeoutValue < 0

Review comment:
       If `timeoutValue < 0`, can we return earlier with an empty map?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514898910



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -250,6 +251,13 @@ public void dropInstance(String clusterName, InstanceConfig instanceConfig) {
     }
   }
 
+  @Override
+  public void purgeOfflineInstances(String clusterName, Long timeout) {
+    Map<String, InstanceConfig> timeoutOfflineInstances = findTimeoutOfflineInstances(clusterName
+        , timeout);
+    timeoutOfflineInstances.values().forEach(instance -> dropInstance(clusterName, instance));

Review comment:
       two high-level comments:
   1. are we not using the lock? as discussed offline, i think to complete this feature, it would be a good idea to use the distributed lock here.
   2. what kind of failure handling are we doing in this call? we're making multiple dropInstance calls here and there's no guarantee that all would go through. plus, the method itself isn't returning anything so there's no way for the user to determine whether the purge operation was successful or not




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] zhangmeng916 commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

zhangmeng916 commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514451102



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -105,6 +106,7 @@
   public static final String CONNECTION_TIMEOUT = "helixAdmin.timeOutInSec";
   private static final String MAINTENANCE_ZNODE_ID = "maintenance";
   private static final int DEFAULT_SUPERCLUSTER_REPLICA = 3;
+  public static final String OFFLINE_NODE_PURGE_TIMEOUT = "purge.timeout";

Review comment:
       We provide an option for users to define the default timeout in cluster config, so that they do not need to provide time in every call. If they do give a timeout in purge call, the value will override the default one in cluster config.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514897137



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2069,43 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Long timeout) {

Review comment:
       i'd prefer a primitive `long` here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] zhangmeng916 commented on a change in pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

zhangmeng916 commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r514555208



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -2061,4 +2070,45 @@ public ZKHelixAdmin build() {
               _zkAddress), false);
     }
   }
+
+  private Map<String, InstanceConfig> findTimeoutOfflineInstances(String clusterName,
+      Map<String, String> customizedPurgeMap) {
+    String timeout = customizedPurgeMap.get(OFFLINE_NODE_PURGE_TIMEOUT);
+    Long timeoutValue;
+    // in case there is no customized timeout value, use the one defined in cluster config
+    if (timeout == null) {
+      timeoutValue = _configAccessor.getClusterConfig(clusterName).getOfflineNodeTimeOutForPurge();
+    } else {
+      timeoutValue = Long.valueOf(timeout);

Review comment:
       With the input changed to Long, now it's not an issue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] lei-xia commented on a change in pull request #1494: Implement offline nodes purging

Posted by GitBox <gi...@apache.org>.

lei-xia commented on a change in pull request #1494:
URL: https://github.com/apache/helix/pull/1494#discussion_r513896405



##########
File path: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
##########
@@ -105,6 +106,7 @@
   public static final String CONNECTION_TIMEOUT = "helixAdmin.timeOutInSec";
   private static final String MAINTENANCE_ZNODE_ID = "maintenance";
   private static final int DEFAULT_SUPERCLUSTER_REPLICA = 3;
+  public static final String OFFLINE_NODE_PURGE_TIMEOUT = "purge.timeout";

Review comment:
       If this is an API, should not the timeout be specified in the purgeOfflineInstances() call? Why still need to keep this in clusterConfig?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] narendly merged pull request #1494: Add admin and Rest API to purge instances that have been offline more than specified period of time

Posted by GitBox <gi...@apache.org>.

narendly merged pull request #1494:
URL: https://github.com/apache/helix/pull/1494


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org