You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by "Phong X. Nguyen" <p....@yahooinc.com> on 2022/06/08 01:49:56 UTC

Changed behavior for WAGED in Helix 1.0.3?

Helix Team,

We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for the
log4j2 fixes. As we test it, we're discovering that WAGED seems to be
rebalancing in a slightly different way than before:

Our configuration has 32 instances and 32 partitions. The simpleFields
configuration is as follows:

"simpleFields" : {
    "HELIX_ENABLED" : "true",
    "NUM_PARTITIONS" : "32",
    "MAX_PARTITIONS_PER_INSTANCE" : "4",
    "DELAY_REBALANCE_ENABLE" : "true",
    "DELAY_REBALANCE_TIME" : "30000",
    "REBALANCE_MODE" : "FULL_AUTO",
    "REBALANCER_CLASS_NAME" :
"org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
    "REPLICAS" : "1",
    "STATE_MODEL_DEF_REF" : "OnlineOffline",
    "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
  }

Out of the 32 instances, we have 2 production test servers, e.g. 'server01'
and 'server02'.

Previously, if we restarted the application on 'server01' in order to
deploy some test code, Helix would move one of the partitions over to
another host, and when 'server01' came back online the partition would be
rebalanced back. Currently we are not seeing his behavior; the partition
stays with the other host and does not go back. While this is within the
constraints of the max partitions, we're confused as to why this might
happen now.

Have there been any changes to WAGED that might account for this? The
release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
Helix.

Thanks,
- Phong X. Nguyen

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by "Phong X. Nguyen" <p....@yahooinc.com>.

[zk: localhost:2181(CONNECTED) 15] get /helix/CONFIGS/CLUSTER/helix
{
  "id" : "helix",
  "mapFields" : {
  },
  "listFields" : {
  },
  "simpleFields" : {
  }
}

On Tue, Jun 7, 2022 at 7:29 PM Junkai Xue <ju...@gmail.com> wrote:

> What I mean is the ZNode inside the Zookeeper under path of /[your cluster
> name]/CONFIGS/CLUSTER/[your cluster name]
>
> Best,
>
> Junkai
>
> On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <p....@yahooinc.com>
> wrote:
>
>> Yes, it involves enable/disable operations as the server comes up and
>> down. In the logs we would sometimes not see the host in the "Current quota
>> capacity" log message, either.
>>
>> When you refer to Cluster Config, did you mean what's accessible by
>> "-listClusterInfo helix-ctrl" ?
>>
>> Thanks,
>> Phong X. Nguyen
>>
>> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:
>>
>>> Hi Phong,
>>>
>>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>>> Will this test involve enable/disable operation? If yes, it could be a bug
>>> that was caused in 1.0.3, which leads to the instance being disabled
>>> through batch enable/disable. One thing you can verify: check the Cluster
>>> Config to see in map fields of disabled instances whether they contain the
>>> instance coming back.
>>>
>>> We are working on the 1.0.4 version to fix that.
>>>
>>> Best,
>>>
>>> Junkai
>>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
>>> wrote:
>>>
>>>> Helix Team,
>>>>
>>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>>> rebalancing in a slightly different way than before:
>>>>
>>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>>> configuration is as follows:
>>>>
>>>> "simpleFields" : {
>>>>     "HELIX_ENABLED" : "true",
>>>>     "NUM_PARTITIONS" : "32",
>>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>>     "REBALANCER_CLASS_NAME" :
>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>>     "REPLICAS" : "1",
>>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>>   }
>>>>
>>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>>> 'server01' and 'server02'.
>>>>
>>>> Previously, if we restarted the application on 'server01' in order to
>>>> deploy some test code, Helix would move one of the partitions over to
>>>> another host, and when 'server01' came back online the partition would be
>>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>>> stays with the other host and does not go back. While this is within the
>>>> constraints of the max partitions, we're confused as to why this might
>>>> happen now.
>>>>
>>>> Have there been any changes to WAGED that might account for this? The
>>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>>> Helix.
>>>>
>>>> Thanks,
>>>> - Phong X. Nguyen
>>>>
>>>
>>>
>>> --
>>> Junkai Xue
>>>
>>
>
> --
> Junkai Xue
>

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by "Phong X. Nguyen" <p....@yahooinc.com>.

I don't normally have direct access to the zookeeper cluster itself; I'll
see if we can get our production engineers to dump that ZNode when we're
testing it again.

On Tue, Jun 7, 2022 at 7:29 PM Junkai Xue <ju...@gmail.com> wrote:

> What I mean is the ZNode inside the Zookeeper under path of /[your cluster
> name]/CONFIGS/CLUSTER/[your cluster name]
>
> Best,
>
> Junkai
>
> On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <p....@yahooinc.com>
> wrote:
>
>> Yes, it involves enable/disable operations as the server comes up and
>> down. In the logs we would sometimes not see the host in the "Current quota
>> capacity" log message, either.
>>
>> When you refer to Cluster Config, did you mean what's accessible by
>> "-listClusterInfo helix-ctrl" ?
>>
>> Thanks,
>> Phong X. Nguyen
>>
>> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:
>>
>>> Hi Phong,
>>>
>>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>>> Will this test involve enable/disable operation? If yes, it could be a bug
>>> that was caused in 1.0.3, which leads to the instance being disabled
>>> through batch enable/disable. One thing you can verify: check the Cluster
>>> Config to see in map fields of disabled instances whether they contain the
>>> instance coming back.
>>>
>>> We are working on the 1.0.4 version to fix that.
>>>
>>> Best,
>>>
>>> Junkai
>>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
>>> wrote:
>>>
>>>> Helix Team,
>>>>
>>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>>> rebalancing in a slightly different way than before:
>>>>
>>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>>> configuration is as follows:
>>>>
>>>> "simpleFields" : {
>>>>     "HELIX_ENABLED" : "true",
>>>>     "NUM_PARTITIONS" : "32",
>>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>>     "REBALANCER_CLASS_NAME" :
>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>>     "REPLICAS" : "1",
>>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>>   }
>>>>
>>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>>> 'server01' and 'server02'.
>>>>
>>>> Previously, if we restarted the application on 'server01' in order to
>>>> deploy some test code, Helix would move one of the partitions over to
>>>> another host, and when 'server01' came back online the partition would be
>>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>>> stays with the other host and does not go back. While this is within the
>>>> constraints of the max partitions, we're confused as to why this might
>>>> happen now.
>>>>
>>>> Have there been any changes to WAGED that might account for this? The
>>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>>> Helix.
>>>>
>>>> Thanks,
>>>> - Phong X. Nguyen
>>>>
>>>
>>>
>>> --
>>> Junkai Xue
>>>
>>
>
> --
> Junkai Xue
>

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by Junkai Xue <ju...@gmail.com>.

What I mean is the ZNode inside the Zookeeper under path of /[your cluster
name]/CONFIGS/CLUSTER/[your cluster name]

Best,

Junkai

On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <p....@yahooinc.com>
wrote:

> Yes, it involves enable/disable operations as the server comes up and
> down. In the logs we would sometimes not see the host in the "Current quota
> capacity" log message, either.
>
> When you refer to Cluster Config, did you mean what's accessible by
> "-listClusterInfo helix-ctrl" ?
>
> Thanks,
> Phong X. Nguyen
>
> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:
>
>> Hi Phong,
>>
>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>> Will this test involve enable/disable operation? If yes, it could be a bug
>> that was caused in 1.0.3, which leads to the instance being disabled
>> through batch enable/disable. One thing you can verify: check the Cluster
>> Config to see in map fields of disabled instances whether they contain the
>> instance coming back.
>>
>> We are working on the 1.0.4 version to fix that.
>>
>> Best,
>>
>> Junkai
>>
>>
>>
>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
>> wrote:
>>
>>> Helix Team,
>>>
>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>> rebalancing in a slightly different way than before:
>>>
>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>> configuration is as follows:
>>>
>>> "simpleFields" : {
>>>     "HELIX_ENABLED" : "true",
>>>     "NUM_PARTITIONS" : "32",
>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>     "REBALANCER_CLASS_NAME" :
>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>     "REPLICAS" : "1",
>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>   }
>>>
>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>> 'server01' and 'server02'.
>>>
>>> Previously, if we restarted the application on 'server01' in order to
>>> deploy some test code, Helix would move one of the partitions over to
>>> another host, and when 'server01' came back online the partition would be
>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>> stays with the other host and does not go back. While this is within the
>>> constraints of the max partitions, we're confused as to why this might
>>> happen now.
>>>
>>> Have there been any changes to WAGED that might account for this? The
>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>> Helix.
>>>
>>> Thanks,
>>> - Phong X. Nguyen
>>>
>>
>>
>> --
>> Junkai Xue
>>
>

-- 
Junkai Xue

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by "Phong X. Nguyen" <p....@yahooinc.com>.

Yes, it involves enable/disable operations as the server comes up and down.
In the logs we would sometimes not see the host in the "Current quota
capacity" log message, either.

When you refer to Cluster Config, did you mean what's accessible by
"-listClusterInfo helix-ctrl" ?

Thanks,
Phong X. Nguyen

On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:

> Hi Phong,
>
> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
> Will this test involve enable/disable operation? If yes, it could be a bug
> that was caused in 1.0.3, which leads to the instance being disabled
> through batch enable/disable. One thing you can verify: check the Cluster
> Config to see in map fields of disabled instances whether they contain the
> instance coming back.
>
> We are working on the 1.0.4 version to fix that.
>
> Best,
>
> Junkai
>
>
>
> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
> wrote:
>
>> Helix Team,
>>
>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>> rebalancing in a slightly different way than before:
>>
>> Our configuration has 32 instances and 32 partitions. The simpleFields
>> configuration is as follows:
>>
>> "simpleFields" : {
>>     "HELIX_ENABLED" : "true",
>>     "NUM_PARTITIONS" : "32",
>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>     "DELAY_REBALANCE_ENABLE" : "true",
>>     "DELAY_REBALANCE_TIME" : "30000",
>>     "REBALANCE_MODE" : "FULL_AUTO",
>>     "REBALANCER_CLASS_NAME" :
>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>     "REPLICAS" : "1",
>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>   }
>>
>> Out of the 32 instances, we have 2 production test servers, e.g.
>> 'server01' and 'server02'.
>>
>> Previously, if we restarted the application on 'server01' in order to
>> deploy some test code, Helix would move one of the partitions over to
>> another host, and when 'server01' came back online the partition would be
>> rebalanced back. Currently we are not seeing his behavior; the partition
>> stays with the other host and does not go back. While this is within the
>> constraints of the max partitions, we're confused as to why this might
>> happen now.
>>
>> Have there been any changes to WAGED that might account for this? The
>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>> Helix.
>>
>> Thanks,
>> - Phong X. Nguyen
>>
>
>
> --
> Junkai Xue
>

Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by Junkai Xue <ju...@gmail.com>.

Hi Phong,

Thanks for leveraging Helix 1.0.3. I have a question for your testing. Will
this test involve enable/disable operation? If yes, it could be a bug that
was caused in 1.0.3, which leads to the instance being disabled through
batch enable/disable. One thing you can verify: check the Cluster Config to
see in map fields of disabled instances whether they contain the instance
coming back.

We are working on the 1.0.4 version to fix that.

Best,

Junkai



On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
wrote:

> Helix Team,
>
> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for the
> log4j2 fixes. As we test it, we're discovering that WAGED seems to be
> rebalancing in a slightly different way than before:
>
> Our configuration has 32 instances and 32 partitions. The simpleFields
> configuration is as follows:
>
> "simpleFields" : {
>     "HELIX_ENABLED" : "true",
>     "NUM_PARTITIONS" : "32",
>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>     "DELAY_REBALANCE_ENABLE" : "true",
>     "DELAY_REBALANCE_TIME" : "30000",
>     "REBALANCE_MODE" : "FULL_AUTO",
>     "REBALANCER_CLASS_NAME" :
> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>     "REPLICAS" : "1",
>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>   }
>
> Out of the 32 instances, we have 2 production test servers, e.g.
> 'server01' and 'server02'.
>
> Previously, if we restarted the application on 'server01' in order to
> deploy some test code, Helix would move one of the partitions over to
> another host, and when 'server01' came back online the partition would be
> rebalanced back. Currently we are not seeing his behavior; the partition
> stays with the other host and does not go back. While this is within the
> constraints of the max partitions, we're confused as to why this might
> happen now.
>
> Have there been any changes to WAGED that might account for this? The
> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
> Helix.
>
> Thanks,
> - Phong X. Nguyen
>


-- 
Junkai Xue