You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Hang Qi <ha...@gmail.com> on 2014/06/27 20:14:41 UTC

Question about spectator behavior whenever it is under zookeeper flapping

Hi folks,

We are using helix 0.6.3 to build our storage system, and our clients rely
on the spectator to route traffic to corresponding node.

It works very well, however, currently we encounter an issue that almost
all the clients fail to route the traffic, and the log shows that

ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName: hostname
is flapping. diconnect it.  maxDisconnectThreshold: 5 disconnects in
300000ms.

Look at the code, there is flapping detection mechanism in ZKHelixManager,
and in case of zookeeper flapping, it will disconnect itself, and in turn
it will call resetHandlers in disconnect() method, result in the
routingTableProvider reset, thus the routingTable becomes empty.

When browsing the jira, I find that this feature was introduced by helix-31
and helix-32. I like the idea of zookeeper flapping detection and
disconnect when it happens for participant and controller, that makes the
whole cluster more stable.

However, in the spectator's perspective, the more reasonable behavior is
that it keeps using the most up to date state from zookeeper even if
zookeeper is down in my opinion. Besides, it should keep retrying to
connect to the zookeeper, or provide some callback to let client know. What
do you think?

So my question is, what is the most practical way to handle this in client?
Currently we use the work around to increase the value of
helixmanager.maxDisconnectThreshold. Is there any callback I could register
to get notified about the disconnect event, does polling
HelixManager#isConnect works?

Thanks
Hang Qi

Re: Question about spectator behavior whenever it is under zookeeper flapping

Posted by Zhen Zhang <ne...@gmail.com>.

Hi Hang, may I know why the connections between router and zookeeper are
flapping? Is it caused by GC on routers?

Thanks,
Jason


On Fri, Jun 27, 2014 at 11:40 AM, kishore g <g....@gmail.com> wrote:

> Hi Hang,
>
> Good point, I agree that the handling of flapping should be different
> based on the role. For now, we have focused on the participant but as you
> have explained its not the right thing to do for a spectator.
>
> Keeping the latest information is the right thing to do in Specator. We
> should probably create a JIRA and go over the possible solutions.
>
> So couple of things we need to decide
> -- keep the latest information
> -- Retry to Zookeeper --
> -- How do we provide a callback to client if they need custom logic.
>
> Polling HelixManager.isConnected should work but its possible to miss that
> event, for example if your polling interval is 10 seconds if the disconnect
> and connect happens within that time interval client may not notice that.
>
> Ideally we want to avoid clients understanding the Zookeeper
> state/internals. In the long term this will allow us to plugin a different
> backend for storing state information.
>
> Thanks,
> Kishore G
>
>
>
>
>
>
> On Fri, Jun 27, 2014 at 11:14 AM, Hang Qi <ha...@gmail.com> wrote:
>
>> Hi folks,
>>
>> We are using helix 0.6.3 to build our storage system, and our clients
>> rely on the spectator to route traffic to corresponding node.
>>
>> It works very well, however, currently we encounter an issue that almost
>> all the clients fail to route the traffic, and the log shows that
>>
>> ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName:
>> hostname is flapping. diconnect it.  maxDisconnectThreshold: 5 disconnects
>> in 300000ms.
>>
>> Look at the code, there is flapping detection mechanism in
>> ZKHelixManager, and in case of zookeeper flapping, it will disconnect
>> itself, and in turn it will call resetHandlers in disconnect() method,
>> result in the routingTableProvider reset, thus the routingTable becomes
>> empty.
>>
>> When browsing the jira, I find that this feature was introduced by
>> helix-31 and helix-32. I like the idea of zookeeper flapping detection and
>> disconnect when it happens for participant and controller, that makes the
>> whole cluster more stable.
>>
>> However, in the spectator's perspective, the more reasonable behavior is
>> that it keeps using the most up to date state from zookeeper even if
>> zookeeper is down in my opinion. Besides, it should keep retrying to
>> connect to the zookeeper, or provide some callback to let client know. What
>> do you think?
>>
>> So my question is, what is the most practical way to handle this in
>> client? Currently we use the work around to increase the value of
>> helixmanager.maxDisconnectThreshold. Is there any callback I could register
>> to get notified about the disconnect event, does polling
>> HelixManager#isConnect works?
>>
>> Thanks
>> Hang Qi
>>
>
>

Re: Question about spectator behavior whenever it is under zookeeper flapping

Posted by kishore g <g....@gmail.com>.

Hi Hang,

Good point, I agree that the handling of flapping should be different based
on the role. For now, we have focused on the participant but as you have
explained its not the right thing to do for a spectator.

Keeping the latest information is the right thing to do in Specator. We
should probably create a JIRA and go over the possible solutions.

So couple of things we need to decide
-- keep the latest information
-- Retry to Zookeeper --
-- How do we provide a callback to client if they need custom logic.

Polling HelixManager.isConnected should work but its possible to miss that
event, for example if your polling interval is 10 seconds if the disconnect
and connect happens within that time interval client may not notice that.

Ideally we want to avoid clients understanding the Zookeeper
state/internals. In the long term this will allow us to plugin a different
backend for storing state information.

Thanks,
Kishore G

On Fri, Jun 27, 2014 at 11:14 AM, Hang Qi <ha...@gmail.com> wrote:

> Hi folks,
>
> We are using helix 0.6.3 to build our storage system, and our clients rely
> on the spectator to route traffic to corresponding node.
>
> It works very well, however, currently we encounter an issue that almost
> all the clients fail to route the traffic, and the log shows that
>
> ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName: hostname
> is flapping. diconnect it.  maxDisconnectThreshold: 5 disconnects in
> 300000ms.
>
> Look at the code, there is flapping detection mechanism in ZKHelixManager,
> and in case of zookeeper flapping, it will disconnect itself, and in turn
> it will call resetHandlers in disconnect() method, result in the
> routingTableProvider reset, thus the routingTable becomes empty.
>
> When browsing the jira, I find that this feature was introduced by
> helix-31 and helix-32. I like the idea of zookeeper flapping detection and
> disconnect when it happens for participant and controller, that makes the
> whole cluster more stable.
>
> However, in the spectator's perspective, the more reasonable behavior is
> that it keeps using the most up to date state from zookeeper even if
> zookeeper is down in my opinion. Besides, it should keep retrying to
> connect to the zookeeper, or provide some callback to let client know. What
> do you think?
>
> So my question is, what is the most practical way to handle this in
> client? Currently we use the work around to increase the value of
> helixmanager.maxDisconnectThreshold. Is there any callback I could register
> to get notified about the disconnect event, does polling
> HelixManager#isConnect works?
>
> Thanks
> Hang Qi
>