You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2021/10/04 20:08:55 UTC

[GitHub] [helix] subprotocol opened a new issue #1888: Ephemeral Instance Support

subprotocol opened a new issue #1888:
URL: https://github.com/apache/helix/issues/1888


   **Feature request**
   I have a use case that involves ephemeral instances that come and go.
   
   The issue is on cleanup (say an instance dies) the instance information is retained under zookeeper /CLUSTER/INSTANCES/instanceID.
   
   
   **Describe the solution you'd like**
   
   Would it be feasible to optionally allow instance znodes to be ephemeral?  Instead of using helixAdmin.addInstance to create an instance, perhaps something like getZKHelixManager could also take an InstanceConfig for InstanceType.PARTICIPANT—and the manager would create the ephemeral znode.  When the Manager dies, the znode would be automatically cleaned up.
   
   I was curious if anyone else thought this make sense?  As I see things, this would be very useful for supporting clusters that have elastic instances.
   
   Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] subprotocol edited a comment on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

subprotocol edited a comment on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-940121753


   @jiajunwang thank you for the suggestion. I added a periodic check to the leading controller and this works great!
   
   I see one possible improvement to purgeOfflineInstances.  I am using helix instance tags to tag my spot instances.  It would be nice if purgeOfflineInstances could take a set of tags, and only purge those tagged instances.  This would allow me to purge the spot instances only, while also mitigating the race condition for other instances.
   
   I would like to propose adding a second purgeOfflineInstances method as follows:
   
   ```java
   // proposed addition
   void purgeOfflineInstances(String clusterName, long offlineDuration, Set<String> instanceTags);
   ```
   
   Does this make sense—and would this functionality be welcome as a small contribution?
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-938284694


   @subprotocol Thanks for the idea. If my memory is accurate, I discussed a similar design with my team before. There are some concerns regarding race condition that makes the implementation more complicated than we expected. So we end up implement HelixAdmin.purgeOfflineInstances for the immediate requests which cleans up offline&expired instances. Unfortunately, it is not an automatic mechanism so it won't fit your needs 100%. Moreover, "Please note that the purge function should only be called when there is no new instance joining happening in the cluster."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] subprotocol edited a comment on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

subprotocol edited a comment on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-962260090


   @jiajunwang I agree, a purge instance criteria builder is a great idea here. Below is a rough sketch of the required change as I see things.
   
   
   ```java
   // Add criteria builder
   public class HelixPurgeInstanceCriteriaBuilder {
     public HelixPurgeInstanceCriteriaBuilder withSecondsOffline(int seconds);
     public HelixPurgeInstanceCriteriaBuilder withTags(Set<String> tags);
     public HelixPurgeInstanceCriteria build();
   }
   
   // Add purge instance criteria
   public class HelixPurgeInstanceCriteria {
     int getSecondsOffline();
     Set<String> getTags();
   }
   ```
   
   **HelixAdmin:**
   ```java
   // Add to HelixAdmin interface
   void purgeOfflineInstances(String clusterName, HelixPurgeInstanceCriteria purgeCriteria);
   ```
   
   **ZkHelixAdmin:**
   ```java
   // Override and implement in ZkHelixAdmin
   void purgeOfflineInstances(String clusterName, HelixPurgeInstanceCriteria purgeCriteria);
   
   // For backwards compatibility—keep old purgeOfflineInstances method signature as-is, replace the
   // implementation with new builder pattern that calls purgeOfflineInstances(clusterName, purgeCriteria)
   ```
   
   
   Example Usages:
   ```java
   // Purge instances that have been offline for at least 1 hour and tagged
   // with both: "worker" and "spot-instance"
   HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
     .withSecondsOffline(3600)
     .withTags(Set.of("worker", "spot-instance"))
     .build();
   
   admin.purgeOfflineInstances("MyCluster", purgeCriteria);
   ```
   
   ```java
   // Purge instances that have been offline for 1 day
   HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
     .withSecondsOffline(86400)
     .build();
   
   admin.purgeOfflineInstances("MyCluster", purgeCriteria);
   ```
   
   ```java
   // Helix does not allow dropping instances that are online
   HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
     .build(); // Throws exception with message stating that withSecondsOffline is required
   ```
   
   
   
   **Clarifying Point**
   `long offlineDuration` is currently used on purgeOfflineInstances; It's not clear on the current method signature or docs that this is in milliseconds. I thought `withSecondsOffline` would be more clear and intuitive.
   
   
   **Drop vs. Purge**
   HelixAdmin currently has these methods for dropping/purging:
   * `void dropInstance(String clusterName, InstanceConfig instanceConfig);` - singular, fails if instance is online
   * `void purgeOfflineInstances(String clusterName, long offlineDuration);` - plural, only applied to offline instances
   
   Though dropping a live instance is not supported now, in the future it could be?  If so `dropInstances(clusterName, dropCriteria)` may make more sense than `purgeOfflineInstances(clusterName, purgeCriteria)`.  In this world, if/when Helix supports dropping live instances, the builder would no longer throw an exception when withSecondsOffline isn't specified.
   
   In my mind this feels more future proof.  I wanted to float this idea out there.  If this makes more sense I can refine the example code above.
   
   
   I welcome any suggestions.  Once aligned, would a PR be a sufficient starting point? Or should a JIRA be created to track this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] subprotocol commented on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

subprotocol commented on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-940121753


   @junkaixue thank you for the suggestion. I added a periodic check to the leading controller and this works great!
   
   I see one possible improvement to purgeOfflineInstances.  I am using helix instance tags to tag my spot instances.  It would be nice if purgeOfflineInstances could take a set of tags, and only purge those tagged instances.  This would allow me to purge the spot instances only, while also mitigating the race condition for other instances.
   
   I would like to propose adding a second purgeOfflineInstances method as follows:
   
   ```java
   // proposed addition
   void purgeOfflineInstances(String clusterName, long offlineDuration, Set<String> instanceTags);
   ```
   
   Does this make sense—and would this functionality be welcome as a small contribution?
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] junkaixue commented on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

junkaixue commented on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-933837813


   @subprotocol Under /CLUSTER/INSTANCE/ folder are the ones for heavy metadata of continuous connection. It will be a huge traffic to create those all for every connection and lost historical data for that instance if there is any application related operation, such as deployment, taking off for repair. The example for historical data such as connection history, or last time current state and something else.
   
   Can we understand why you need this? We can think about other solutions to support you requirement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-952199651


   I vote up this idea.
   
   But it might be concerning if we introduce more parameters to the purge method. There will be a backward compatibility concern as well. It might help to define a "criteria" class with the builder pattern so the interface itself and the user code whoever needs this API is more future proof. Just my 2 cents, FYI. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] subprotocol commented on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

subprotocol commented on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-962260090


   @jiajunwang  I agree, a purge instance criteria builder is a great idea here. Below is a rough sketch of the required change as I see things.
   
   
   ```java
   // Add criteria builder
   public class HelixPurgeInstanceCriteriaBuilder {
     public HelixPurgeInstanceCriteriaBuilder withSecondsOffline(int seconds);
     public HelixPurgeInstanceCriteriaBuilder withTags(Set<String> tags);
     public HelixPurgeInstanceCriteria build();
   }
   
   // Add purge instance criteria
   public class HelixPurgeInstanceCriteria {
     int getSecondsOffline();
     Set<String> getTags();
   }
   ```
   
   **HelixAdmin:**
   ```java
   // Add to HelixAdmin interface
   void purgeOfflineInstances(String clusterName, HelixPurgeInstanceCriteria purgeCriteria);
   ```
   
   **ZkHelixAdmin:**
   ```java
   // Override and implement in ZkHelixAdmin
   void purgeOfflineInstances(String clusterName, HelixPurgeInstanceCriteria purgeCriteria);
   
   // For backwards compatibility—keep old purgeOfflineInstances method signature as-is, replace the
   // implementation with new builder pattern that calls purgeOfflineInstances(clusterName, purgeCriteria)
   ```
   
   
   Example Usages:
   ```java
   // Purge instances that have been offline for at least 1 hour and tagged
   // with both: "worker" and "spot-instance"
   HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
     .withSecondsOffline(3600)
     .withTags(Set.of("worker", "spot-instance"))
     .build();
   
   admin.purgeOfflineInstances("MyCluster", purgeCriteria);
   ```
   
   ```java
   // Purge instances that have been offline for 1 day
   HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
     .withSecondsOffline(86400)
     .build();
   
   admin.purgeOfflineInstances("MyCluster", purgeCriteria);
   ```
   
   ```java
   // Helix does not allow dropping instances that are online
   HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
     .build(); // Throws exception with message stating that withSecondsOffline is required
   ```
   
   
   
   **Clarifying Point**
   `long offlineDuration` is currently used on purgeOfflineInstances; It's not clear on the current method signature or docs that this is in milliseconds. I thought `withSecondsOffline` would be more clear and intuitive.
   
   
   **Drop vs. Purge**
   HelixAdmin currently has these methods for dropping/purging:
   * `void dropInstance(String clusterName, InstanceConfig instanceConfig);` - singular, fails if instance is online
   * `void purgeOfflineInstances(String clusterName, long offlineDuration);` - plural, only applied to offline instances
   
   Though dropping a live instance is not supported now, in the future it could be?  If so `dropInstances(clusterName, dropCriteria)` may make more sense than `purgeOfflineInstances(clusterName, purgeCriteria)`.  In this world, if/when Helix supports dropping live instances, the builder would no longer throw an exception when withSecondsOffline isn't specified.
   
   In my mind this feels more future proof.  I wanted to float this idea out there.  If this makes more sense I can refine the example code above.
   
   
   I welcome any suggestions.  Once aligned, would a PR be a sufficient starting point? Or should a JIRA be created to track this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] subprotocol commented on issue #1888: Ephemeral Instance Support

Posted by GitBox <gi...@apache.org>.

subprotocol commented on issue #1888:
URL: https://github.com/apache/helix/issues/1888#issuecomment-934605069

> Can we understand why you need this? We can think about other solutions to support you requirement.

Absolutely! I would like to run a portion of my Helix participants on spot instances. Spot instances allow one to cheaply scale processing when there is spare capacity. The downside is that spot instances come and go, and when exactly is beyond my control. Storage is disaggregated, so rebalancing helix resources is not expensive.

I am not sure how best to configure helix for this model. If host machines are fixed, when a host first starts up it can register itself as a helix instance and use that as the participant.

When the hosts are not fixed (spot instances), as the spot instances come and go, I would end up with a forever increasing set of helix instances in zookeeper.

I understand and share your concern about high traffic on the zookeeper path. Would a better strategy be to try and re-use helix instances that are offline when a spot instance first starts up? Or is there a better strategy that I'm not seeing?

Thank you @junkaixue!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org