You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@helix.apache.org by Wang Jiajun <er...@gmail.com> on 2019/01/07 22:54:04 UTC

Regarding Helix releasing 0.8.3

Hi Kishore,

Hope you are doing well.
Since last time we met to discuss potential ZkClient improvements in Helix,
we have completed the fix of one issue. However, the resolving of the whole
list will take more time, given Pinot is still waiting for the new release,
I'd like to hear your opinion that whether we shall release 0.8.3 based on
the current situation.

Fixed issues:

   1. For an Ephemeral node, the source of truth should be the owner
   session Id instead of the node content.
   This fixes the leader election issue we found in Pinot cluster.

Pending issues:

   1. ZkClient should not interrupt the callback handling during session
   reestablishment or other reset logic. Interrupt for shutdown should only
   happen when things are closed. For fixing this problem, we need to think
   about how to handle thread leaking.
   2. ZkConnection.getZookeeper() == null potentially cause
   retryUntilConnect to terminate earlier than expected. Should keep waiting
   for this error.
   3. The ZkClient event should keep a session Id. The event processor can
   discard expired event.

Best Regards,
Jiajun

Re: Regarding Helix releasing 0.8.3

Posted by kishore g <g....@gmail.com>.
Thanks a lot. Will look into it

On Fri, Jan 11, 2019 at 6:18 PM Wang Jiajun <er...@gmail.com> wrote:

> Hi Kishore,
>
> I have sent a pull request to fix the first 2 issues.
> https://github.com/apache/helix/pull/297
> As for the 3rd one, it requires a much larger scope of change. And
> actually, it does not break any logic now after we fixed the ephemeral node
> owner validate logic. We think it can be scheduled for future release.
>
> Best Regards,
> Jiajun
>
>
> On Mon, Jan 7, 2019 at 3:57 PM Wang Jiajun <er...@gmail.com> wrote:
>
>> Resending. Reply to all.
>>
>> We can probably fix the first 2 issues within 2 weeks, considering the
>> additional test and validation required.
>> For issue 1, we can make the original reset into 2 methods. For new
>> session handling, we should not interrupt. For client closing, we shall
>> interrupt thread and shut down.
>> For issue 2, we need to try catch for zookeeper NPE in addition.
>>
>> Issue 3 will take more time since we need to change both ZkClient and
>> event handler. There may be some interfaces need to be updated. Moreover,
>> it changes the current ZkClient behavior. So we'd better run it in the test
>> environment for a longer time.
>>
>> With the ephemeral node's owner fixed, the 3rd issue does not impact
>> correctness. So maybe we can plan for fixing the first 2 issues first? And
>> then plan for the 3rd issue in the next release? If that's the case, we
>> shall have a release candidate after 2 weeks.
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Mon, Jan 7, 2019 at 3:14 PM kishore g <g....@gmail.com> wrote:
>>
>>> I think the pending issues are the ones that are affecting us. What does
>>> it take to fix those issues?
>>>
>>> On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun <er...@gmail.com>
>>> wrote:
>>>
>>>> Hi Kishore,
>>>>
>>>> Hope you are doing well.
>>>> Since last time we met to discuss potential ZkClient improvements in
>>>> Helix, we have completed the fix of one issue. However, the resolving of
>>>> the whole list will take more time, given Pinot is still waiting for the
>>>> new release, I'd like to hear your opinion that whether we shall release
>>>> 0.8.3 based on the current situation.
>>>>
>>>> Fixed issues:
>>>>
>>>>    1. For an Ephemeral node, the source of truth should be the owner
>>>>    session Id instead of the node content.
>>>>    This fixes the leader election issue we found in Pinot cluster.
>>>>
>>>> Pending issues:
>>>>
>>>>    1. ZkClient should not interrupt the callback handling during
>>>>    session reestablishment or other reset logic. Interrupt for shutdown should
>>>>    only happen when things are closed. For fixing this problem, we need to
>>>>    think about how to handle thread leaking.
>>>>    2. ZkConnection.getZookeeper() == null potentially cause
>>>>    retryUntilConnect to terminate earlier than expected. Should keep waiting
>>>>    for this error.
>>>>    3. The ZkClient event should keep a session Id. The event processor
>>>>    can discard expired event.
>>>>
>>>> Best Regards,
>>>> Jiajun
>>>>
>>>

Re: Regarding Helix releasing 0.8.3

Posted by Wang Jiajun <er...@gmail.com>.
Hi Kishore,

I have sent a pull request to fix the first 2 issues.
https://github.com/apache/helix/pull/297
As for the 3rd one, it requires a much larger scope of change. And
actually, it does not break any logic now after we fixed the ephemeral node
owner validate logic. We think it can be scheduled for future release.

Best Regards,
Jiajun


On Mon, Jan 7, 2019 at 3:57 PM Wang Jiajun <er...@gmail.com> wrote:

> Resending. Reply to all.
>
> We can probably fix the first 2 issues within 2 weeks, considering the
> additional test and validation required.
> For issue 1, we can make the original reset into 2 methods. For new
> session handling, we should not interrupt. For client closing, we shall
> interrupt thread and shut down.
> For issue 2, we need to try catch for zookeeper NPE in addition.
>
> Issue 3 will take more time since we need to change both ZkClient and
> event handler. There may be some interfaces need to be updated. Moreover,
> it changes the current ZkClient behavior. So we'd better run it in the test
> environment for a longer time.
>
> With the ephemeral node's owner fixed, the 3rd issue does not impact
> correctness. So maybe we can plan for fixing the first 2 issues first? And
> then plan for the 3rd issue in the next release? If that's the case, we
> shall have a release candidate after 2 weeks.
>
> Best Regards,
> Jiajun
>
>
> On Mon, Jan 7, 2019 at 3:14 PM kishore g <g....@gmail.com> wrote:
>
>> I think the pending issues are the ones that are affecting us. What does
>> it take to fix those issues?
>>
>> On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun <er...@gmail.com>
>> wrote:
>>
>>> Hi Kishore,
>>>
>>> Hope you are doing well.
>>> Since last time we met to discuss potential ZkClient improvements in
>>> Helix, we have completed the fix of one issue. However, the resolving of
>>> the whole list will take more time, given Pinot is still waiting for the
>>> new release, I'd like to hear your opinion that whether we shall release
>>> 0.8.3 based on the current situation.
>>>
>>> Fixed issues:
>>>
>>>    1. For an Ephemeral node, the source of truth should be the owner
>>>    session Id instead of the node content.
>>>    This fixes the leader election issue we found in Pinot cluster.
>>>
>>> Pending issues:
>>>
>>>    1. ZkClient should not interrupt the callback handling during
>>>    session reestablishment or other reset logic. Interrupt for shutdown should
>>>    only happen when things are closed. For fixing this problem, we need to
>>>    think about how to handle thread leaking.
>>>    2. ZkConnection.getZookeeper() == null potentially cause
>>>    retryUntilConnect to terminate earlier than expected. Should keep waiting
>>>    for this error.
>>>    3. The ZkClient event should keep a session Id. The event processor
>>>    can discard expired event.
>>>
>>> Best Regards,
>>> Jiajun
>>>
>>

Re: Regarding Helix releasing 0.8.3

Posted by Wang Jiajun <er...@gmail.com>.
Resending. Reply to all.

We can probably fix the first 2 issues within 2 weeks, considering the
additional test and validation required.
For issue 1, we can make the original reset into 2 methods. For new session
handling, we should not interrupt. For client closing, we shall interrupt
thread and shut down.
For issue 2, we need to try catch for zookeeper NPE in addition.

Issue 3 will take more time since we need to change both ZkClient and event
handler. There may be some interfaces need to be updated. Moreover, it
changes the current ZkClient behavior. So we'd better run it in the test
environment for a longer time.

With the ephemeral node's owner fixed, the 3rd issue does not impact
correctness. So maybe we can plan for fixing the first 2 issues first? And
then plan for the 3rd issue in the next release? If that's the case, we
shall have a release candidate after 2 weeks.

Best Regards,
Jiajun


On Mon, Jan 7, 2019 at 3:14 PM kishore g <g....@gmail.com> wrote:

> I think the pending issues are the ones that are affecting us. What does
> it take to fix those issues?
>
> On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun <er...@gmail.com> wrote:
>
>> Hi Kishore,
>>
>> Hope you are doing well.
>> Since last time we met to discuss potential ZkClient improvements in
>> Helix, we have completed the fix of one issue. However, the resolving of
>> the whole list will take more time, given Pinot is still waiting for the
>> new release, I'd like to hear your opinion that whether we shall release
>> 0.8.3 based on the current situation.
>>
>> Fixed issues:
>>
>>    1. For an Ephemeral node, the source of truth should be the owner
>>    session Id instead of the node content.
>>    This fixes the leader election issue we found in Pinot cluster.
>>
>> Pending issues:
>>
>>    1. ZkClient should not interrupt the callback handling during session
>>    reestablishment or other reset logic. Interrupt for shutdown should only
>>    happen when things are closed. For fixing this problem, we need to think
>>    about how to handle thread leaking.
>>    2. ZkConnection.getZookeeper() == null potentially cause
>>    retryUntilConnect to terminate earlier than expected. Should keep waiting
>>    for this error.
>>    3. The ZkClient event should keep a session Id. The event processor
>>    can discard expired event.
>>
>> Best Regards,
>> Jiajun
>>
>

Re: Regarding Helix releasing 0.8.3

Posted by kishore g <g....@gmail.com>.
I think the pending issues are the ones that are affecting us. What does it
take to fix those issues?

On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun <er...@gmail.com> wrote:

> Hi Kishore,
>
> Hope you are doing well.
> Since last time we met to discuss potential ZkClient improvements in
> Helix, we have completed the fix of one issue. However, the resolving of
> the whole list will take more time, given Pinot is still waiting for the
> new release, I'd like to hear your opinion that whether we shall release
> 0.8.3 based on the current situation.
>
> Fixed issues:
>
>    1. For an Ephemeral node, the source of truth should be the owner
>    session Id instead of the node content.
>    This fixes the leader election issue we found in Pinot cluster.
>
> Pending issues:
>
>    1. ZkClient should not interrupt the callback handling during session
>    reestablishment or other reset logic. Interrupt for shutdown should only
>    happen when things are closed. For fixing this problem, we need to think
>    about how to handle thread leaking.
>    2. ZkConnection.getZookeeper() == null potentially cause
>    retryUntilConnect to terminate earlier than expected. Should keep waiting
>    for this error.
>    3. The ZkClient event should keep a session Id. The event processor
>    can discard expired event.
>
> Best Regards,
> Jiajun
>