You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Bo Liu <ne...@gmail.com> on 2018/04/18 17:44:01 UTC

reduce erros rate during service deploy

Hi folks,

We are running a service managed by Helix. When we rolling restart the
service, we first disable instances through Helix before restarting the
service processes in the hope that the read errors are minimized.

However, the instances being restarted may get Online->Offline messages
before clients get the latest version of the external view. I am wondering
if there is any way to delay the Online->Offline messages generated by
instance "disable" command?

-- 
Best regards,
Bo

Re: reduce erros rate during service deploy

Posted by Bo Liu <ne...@gmail.com>.

That totally makes sense. Thank you!

On Wed, Apr 18, 2018 at 6:49 PM, kishore g <g....@gmail.com> wrote:

> sorry for not being clear in the previous email. The solution I described
> is completely outside of what Helix provides but is built on top of
> primitives that Helix already provides.
>
> In your case, you need to listen to external view and instance config
> changes and compose a config file of shard mapping. But the difference
> would be that you ignore the instances where shutdownInProgess=true.
>
>
>
> On Wed, Apr 18, 2018 at 6:24 PM, Bo Liu <ne...@gmail.com> wrote:
>
>> We listen to external view changes and compose a config file of shard
>> mapping in our internal format that is used by our clients.
>>
>> Will shutdownInProgress=true in InstanceConfig get reflected in external
>> view immediately (Mark all partitions on the hosts with shutdownInProgress=true
>> to be in a state other than Online/Master/Slave)?
>>
>> On Wed, Apr 18, 2018 at 6:14 PM, kishore g <g....@gmail.com> wrote:
>>
>>> This is a good question. We have the same problem in Pinot and we solved
>>> it using a shutdownInProgress flag in instanceConfig znode, spectator will
>>> look into this flag and stop routing queries to that node. We avoided using
>>> disable instance solution.
>>>
>>> The solution is as follows
>>> - the participant sets shutdownInProgress=true in InstanceConfig in its
>>> shutdownHook
>>> - Broker routing table gets updated because it listens to changes in
>>> instanceConfig.
>>> - the routingTableProvider treats this node as disabled if it sees this
>>> flag. (you need to extend routingtableprovider)
>>> - when the participant restarts, as part of registering itself in Helix,
>>> it sets the shudownInProgress=false.
>>>
>>> This is a valid feature and potentially be added to Helix.
>>>
>>> thanks,
>>> Kishore G
>>>
>>>
>>> On Wed, Apr 18, 2018 at 10:44 AM, Bo Liu <ne...@gmail.com> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> We are running a service managed by Helix. When we rolling restart the
>>>> service, we first disable instances through Helix before restarting the
>>>> service processes in the hope that the read errors are minimized.
>>>>
>>>> However, the instances being restarted may get Online->Offline messages
>>>> before clients get the latest version of the external view. I am wondering
>>>> if there is any way to delay the Online->Offline messages generated by
>>>> instance "disable" command?
>>>>
>>>> --
>>>> Best regards,
>>>> Bo
>>>>
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Bo
>>
>>
>


-- 
Best regards,
Bo

Re: reduce erros rate during service deploy

Posted by kishore g <g....@gmail.com>.

sorry for not being clear in the previous email. The solution I described
is completely outside of what Helix provides but is built on top of
primitives that Helix already provides.

In your case, you need to listen to external view and instance config
changes and compose a config file of shard mapping. But the difference
would be that you ignore the instances where shutdownInProgess=true.



On Wed, Apr 18, 2018 at 6:24 PM, Bo Liu <ne...@gmail.com> wrote:

> We listen to external view changes and compose a config file of shard
> mapping in our internal format that is used by our clients.
>
> Will shutdownInProgress=true in InstanceConfig get reflected in external
> view immediately (Mark all partitions on the hosts with shutdownInProgress=true
> to be in a state other than Online/Master/Slave)?
>
> On Wed, Apr 18, 2018 at 6:14 PM, kishore g <g....@gmail.com> wrote:
>
>> This is a good question. We have the same problem in Pinot and we solved
>> it using a shutdownInProgress flag in instanceConfig znode, spectator will
>> look into this flag and stop routing queries to that node. We avoided using
>> disable instance solution.
>>
>> The solution is as follows
>> - the participant sets shutdownInProgress=true in InstanceConfig in its
>> shutdownHook
>> - Broker routing table gets updated because it listens to changes in
>> instanceConfig.
>> - the routingTableProvider treats this node as disabled if it sees this
>> flag. (you need to extend routingtableprovider)
>> - when the participant restarts, as part of registering itself in Helix,
>> it sets the shudownInProgress=false.
>>
>> This is a valid feature and potentially be added to Helix.
>>
>> thanks,
>> Kishore G
>>
>>
>> On Wed, Apr 18, 2018 at 10:44 AM, Bo Liu <ne...@gmail.com> wrote:
>>
>>> Hi folks,
>>>
>>> We are running a service managed by Helix. When we rolling restart the
>>> service, we first disable instances through Helix before restarting the
>>> service processes in the hope that the read errors are minimized.
>>>
>>> However, the instances being restarted may get Online->Offline messages
>>> before clients get the latest version of the external view. I am wondering
>>> if there is any way to delay the Online->Offline messages generated by
>>> instance "disable" command?
>>>
>>> --
>>> Best regards,
>>> Bo
>>>
>>>
>>
>
>
> --
> Best regards,
> Bo
>
>

Re: reduce erros rate during service deploy

Posted by Bo Liu <ne...@gmail.com>.

We listen to external view changes and compose a config file of shard
mapping in our internal format that is used by our clients.

Will shutdownInProgress=true in InstanceConfig get reflected in external
view immediately (Mark all partitions on the hosts with shutdownInProgress=true
to be in a state other than Online/Master/Slave)?

On Wed, Apr 18, 2018 at 6:14 PM, kishore g <g....@gmail.com> wrote:

> This is a good question. We have the same problem in Pinot and we solved
> it using a shutdownInProgress flag in instanceConfig znode, spectator will
> look into this flag and stop routing queries to that node. We avoided using
> disable instance solution.
>
> The solution is as follows
> - the participant sets shutdownInProgress=true in InstanceConfig in its
> shutdownHook
> - Broker routing table gets updated because it listens to changes in
> instanceConfig.
> - the routingTableProvider treats this node as disabled if it sees this
> flag. (you need to extend routingtableprovider)
> - when the participant restarts, as part of registering itself in Helix,
> it sets the shudownInProgress=false.
>
> This is a valid feature and potentially be added to Helix.
>
> thanks,
> Kishore G
>
>
> On Wed, Apr 18, 2018 at 10:44 AM, Bo Liu <ne...@gmail.com> wrote:
>
>> Hi folks,
>>
>> We are running a service managed by Helix. When we rolling restart the
>> service, we first disable instances through Helix before restarting the
>> service processes in the hope that the read errors are minimized.
>>
>> However, the instances being restarted may get Online->Offline messages
>> before clients get the latest version of the external view. I am wondering
>> if there is any way to delay the Online->Offline messages generated by
>> instance "disable" command?
>>
>> --
>> Best regards,
>> Bo
>>
>>
>


-- 
Best regards,
Bo

Re: reduce erros rate during service deploy

Posted by kishore g <g....@gmail.com>.

This is a good question. We have the same problem in Pinot and we solved it
using a shutdownInProgress flag in instanceConfig znode, spectator will
look into this flag and stop routing queries to that node. We avoided using
disable instance solution.

The solution is as follows
- the participant sets shutdownInProgress=true in InstanceConfig in its
shutdownHook
- Broker routing table gets updated because it listens to changes in
instanceConfig.
- the routingTableProvider treats this node as disabled if it sees this
flag. (you need to extend routingtableprovider)
- when the participant restarts, as part of registering itself in Helix, it
sets the shudownInProgress=false.

This is a valid feature and potentially be added to Helix.

thanks,
Kishore G


On Wed, Apr 18, 2018 at 10:44 AM, Bo Liu <ne...@gmail.com> wrote:

> Hi folks,
>
> We are running a service managed by Helix. When we rolling restart the
> service, we first disable instances through Helix before restarting the
> service processes in the hope that the read errors are minimized.
>
> However, the instances being restarted may get Online->Offline messages
> before clients get the latest version of the external view. I am wondering
> if there is any way to delay the Online->Offline messages generated by
> instance "disable" command?
>
> --
> Best regards,
> Bo
>
>