You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by "Phong X. Nguyen" <p....@yahooinc.com> on 2022/06/08 02:23:50 UTC

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Yes, it involves enable/disable operations as the server comes up and down.
In the logs we would sometimes not see the host in the "Current quota
capacity" log message, either.

When you refer to Cluster Config, did you mean what's accessible by
"-listClusterInfo helix-ctrl" ?

Thanks,
Phong X. Nguyen

On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:

> Hi Phong,
>
> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
> Will this test involve enable/disable operation? If yes, it could be a bug
> that was caused in 1.0.3, which leads to the instance being disabled
> through batch enable/disable. One thing you can verify: check the Cluster
> Config to see in map fields of disabled instances whether they contain the
> instance coming back.
>
> We are working on the 1.0.4 version to fix that.
>
> Best,
>
> Junkai
>
>
>
> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
> wrote:
>
>> Helix Team,
>>
>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>> rebalancing in a slightly different way than before:
>>
>> Our configuration has 32 instances and 32 partitions. The simpleFields
>> configuration is as follows:
>>
>> "simpleFields" : {
>>     "HELIX_ENABLED" : "true",
>>     "NUM_PARTITIONS" : "32",
>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>     "DELAY_REBALANCE_ENABLE" : "true",
>>     "DELAY_REBALANCE_TIME" : "30000",
>>     "REBALANCE_MODE" : "FULL_AUTO",
>>     "REBALANCER_CLASS_NAME" :
>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>     "REPLICAS" : "1",
>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>   }
>>
>> Out of the 32 instances, we have 2 production test servers, e.g.
>> 'server01' and 'server02'.
>>
>> Previously, if we restarted the application on 'server01' in order to
>> deploy some test code, Helix would move one of the partitions over to
>> another host, and when 'server01' came back online the partition would be
>> rebalanced back. Currently we are not seeing his behavior; the partition
>> stays with the other host and does not go back. While this is within the
>> constraints of the max partitions, we're confused as to why this might
>> happen now.
>>
>> Have there been any changes to WAGED that might account for this? The
>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>> Helix.
>>
>> Thanks,
>> - Phong X. Nguyen
>>
>
>
> --
> Junkai Xue
>

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by "Phong X. Nguyen" <p....@yahooinc.com>.

[zk: localhost:2181(CONNECTED) 15] get /helix/CONFIGS/CLUSTER/helix
{
  "id" : "helix",
  "mapFields" : {
  },
  "listFields" : {
  },
  "simpleFields" : {
  }
}

On Tue, Jun 7, 2022 at 7:29 PM Junkai Xue <ju...@gmail.com> wrote:

> What I mean is the ZNode inside the Zookeeper under path of /[your cluster
> name]/CONFIGS/CLUSTER/[your cluster name]
>
> Best,
>
> Junkai
>
> On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <p....@yahooinc.com>
> wrote:
>
>> Yes, it involves enable/disable operations as the server comes up and
>> down. In the logs we would sometimes not see the host in the "Current quota
>> capacity" log message, either.
>>
>> When you refer to Cluster Config, did you mean what's accessible by
>> "-listClusterInfo helix-ctrl" ?
>>
>> Thanks,
>> Phong X. Nguyen
>>
>> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:
>>
>>> Hi Phong,
>>>
>>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>>> Will this test involve enable/disable operation? If yes, it could be a bug
>>> that was caused in 1.0.3, which leads to the instance being disabled
>>> through batch enable/disable. One thing you can verify: check the Cluster
>>> Config to see in map fields of disabled instances whether they contain the
>>> instance coming back.
>>>
>>> We are working on the 1.0.4 version to fix that.
>>>
>>> Best,
>>>
>>> Junkai
>>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
>>> wrote:
>>>
>>>> Helix Team,
>>>>
>>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>>> rebalancing in a slightly different way than before:
>>>>
>>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>>> configuration is as follows:
>>>>
>>>> "simpleFields" : {
>>>>     "HELIX_ENABLED" : "true",
>>>>     "NUM_PARTITIONS" : "32",
>>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>>     "REBALANCER_CLASS_NAME" :
>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>>     "REPLICAS" : "1",
>>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>>   }
>>>>
>>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>>> 'server01' and 'server02'.
>>>>
>>>> Previously, if we restarted the application on 'server01' in order to
>>>> deploy some test code, Helix would move one of the partitions over to
>>>> another host, and when 'server01' came back online the partition would be
>>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>>> stays with the other host and does not go back. While this is within the
>>>> constraints of the max partitions, we're confused as to why this might
>>>> happen now.
>>>>
>>>> Have there been any changes to WAGED that might account for this? The
>>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>>> Helix.
>>>>
>>>> Thanks,
>>>> - Phong X. Nguyen
>>>>
>>>
>>>
>>> --
>>> Junkai Xue
>>>
>>
>
> --
> Junkai Xue
>

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by "Phong X. Nguyen" <p....@yahooinc.com>.

I don't normally have direct access to the zookeeper cluster itself; I'll
see if we can get our production engineers to dump that ZNode when we're
testing it again.

On Tue, Jun 7, 2022 at 7:29 PM Junkai Xue <ju...@gmail.com> wrote:

> What I mean is the ZNode inside the Zookeeper under path of /[your cluster
> name]/CONFIGS/CLUSTER/[your cluster name]
>
> Best,
>
> Junkai
>
> On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <p....@yahooinc.com>
> wrote:
>
>> Yes, it involves enable/disable operations as the server comes up and
>> down. In the logs we would sometimes not see the host in the "Current quota
>> capacity" log message, either.
>>
>> When you refer to Cluster Config, did you mean what's accessible by
>> "-listClusterInfo helix-ctrl" ?
>>
>> Thanks,
>> Phong X. Nguyen
>>
>> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:
>>
>>> Hi Phong,
>>>
>>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>>> Will this test involve enable/disable operation? If yes, it could be a bug
>>> that was caused in 1.0.3, which leads to the instance being disabled
>>> through batch enable/disable. One thing you can verify: check the Cluster
>>> Config to see in map fields of disabled instances whether they contain the
>>> instance coming back.
>>>
>>> We are working on the 1.0.4 version to fix that.
>>>
>>> Best,
>>>
>>> Junkai
>>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
>>> wrote:
>>>
>>>> Helix Team,
>>>>
>>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>>> rebalancing in a slightly different way than before:
>>>>
>>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>>> configuration is as follows:
>>>>
>>>> "simpleFields" : {
>>>>     "HELIX_ENABLED" : "true",
>>>>     "NUM_PARTITIONS" : "32",
>>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>>     "REBALANCER_CLASS_NAME" :
>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>>     "REPLICAS" : "1",
>>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>>   }
>>>>
>>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>>> 'server01' and 'server02'.
>>>>
>>>> Previously, if we restarted the application on 'server01' in order to
>>>> deploy some test code, Helix would move one of the partitions over to
>>>> another host, and when 'server01' came back online the partition would be
>>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>>> stays with the other host and does not go back. While this is within the
>>>> constraints of the max partitions, we're confused as to why this might
>>>> happen now.
>>>>
>>>> Have there been any changes to WAGED that might account for this? The
>>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>>> Helix.
>>>>
>>>> Thanks,
>>>> - Phong X. Nguyen
>>>>
>>>
>>>
>>> --
>>> Junkai Xue
>>>
>>
>
> --
> Junkai Xue
>

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Posted by Junkai Xue <ju...@gmail.com>.

What I mean is the ZNode inside the Zookeeper under path of /[your cluster
name]/CONFIGS/CLUSTER/[your cluster name]

Best,

Junkai

On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <p....@yahooinc.com>
wrote:

> Yes, it involves enable/disable operations as the server comes up and
> down. In the logs we would sometimes not see the host in the "Current quota
> capacity" log message, either.
>
> When you refer to Cluster Config, did you mean what's accessible by
> "-listClusterInfo helix-ctrl" ?
>
> Thanks,
> Phong X. Nguyen
>
> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <ju...@gmail.com> wrote:
>
>> Hi Phong,
>>
>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>> Will this test involve enable/disable operation? If yes, it could be a bug
>> that was caused in 1.0.3, which leads to the instance being disabled
>> through batch enable/disable. One thing you can verify: check the Cluster
>> Config to see in map fields of disabled instances whether they contain the
>> instance coming back.
>>
>> We are working on the 1.0.4 version to fix that.
>>
>> Best,
>>
>> Junkai
>>
>>
>>
>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <p....@yahooinc.com>
>> wrote:
>>
>>> Helix Team,
>>>
>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>> rebalancing in a slightly different way than before:
>>>
>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>> configuration is as follows:
>>>
>>> "simpleFields" : {
>>>     "HELIX_ENABLED" : "true",
>>>     "NUM_PARTITIONS" : "32",
>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>     "REBALANCER_CLASS_NAME" :
>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>     "REPLICAS" : "1",
>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>   }
>>>
>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>> 'server01' and 'server02'.
>>>
>>> Previously, if we restarted the application on 'server01' in order to
>>> deploy some test code, Helix would move one of the partitions over to
>>> another host, and when 'server01' came back online the partition would be
>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>> stays with the other host and does not go back. While this is within the
>>> constraints of the max partitions, we're confused as to why this might
>>> happen now.
>>>
>>> Have there been any changes to WAGED that might account for this? The
>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>> Helix.
>>>
>>> Thanks,
>>> - Phong X. Nguyen
>>>
>>
>>
>> --
>> Junkai Xue
>>
>

-- 
Junkai Xue