You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@helix.apache.org by William Morgan <wi...@morgan-fam.com> on 2023/01/29 23:15:18 UTC

Intermediate Partition Assignments Create Invalid State

Hi,

So I noticed something odd that will happen with a Helix Cluster if the PersistIntermediateAssignment is set to true.

The oddity is as follows:

In a Helix cluster with ~250 instances and ~2000 shards with 1 Master and 1 Slave, where the max partitions per instance for both the cluster and the resource are set to 20, I'll add an instance to a cluster that is stable. While calculating and persisting an intermediate assignment as part of the Ideal State, we'll end up with an instance that has more than 20 partitions assigned to it. This will then throw Helix into an error loop because we happen to have an ideal state that is invalid (an instance have > 20 partitions) and it infinitely fail to compute an ideal state that is valid. The only way I was able to fix this (without completely rebuilding the cluster) was to edit the max partitions per instance to 30 for both the resource and cluster, let the cluster become stable, and then set it back. We were lucky in that we had the extra capacity hardware-wise, but this is obviously concerning because Helix is bringing itself into a state in which it cannot recover without manual intervention. Any help on why this happening or if this was a fixed bug in new versions of Helix (we're on a patched version of 1.0.2) would be helpful.

A secondary question: we use the Helix Rest API in some cases and notice that it will not mark an instance as Healthy unless PersistIntermediateAssignment is set to true. But the behavior we want is PersistBestPossibleAssignment. In looking into the source code, I've noticed the only place that health check is used is in the Helix Rest API and is not used in the Waged Rebalancer planning. If we don't care about the health check in the REST API, should we just continue to use PersistBestPossibleAssignment? Is there any other place that health check matters?

Thanks,

Will