You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@helix.apache.org by DImuthu Upeksha <di...@gmail.com> on 2019/03/30 12:53:57 UTC

ZkHelixManager disconnection hangs

Hi Folks,

In helix controller, we have seen below log line and by looking at the
code, I understood that it is due to ZkHelixManager is failing to connect
to zookeeper for 5 times. So I tried to stop the controller and in the stop
logic, we have a call to ZkHelixManager.disconnect() method and it hangs. I
got a thread dump and you can see where it is waiting. Can you please
advice as better approach to solve this?

I noticed that ZkHelixManager disconnects [1] it self when a flapping is
detected. Is calling disconnect() twice the reason for that?

2019-03-29 15:19:56,832 [ZkClient-EventThread-14-api.staging.scigap.org:2181]
ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in 300000ms.

Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
nativeId (decimal):25585 - state:BLOCKED
stackTrace:
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
- waiting to lock <0x00000006c7e08110> (a
org.apache.helix.manager.zk.ZKHelixManager)
at
org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
at
org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
at
org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
Source)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None

[1]
https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
Thanks
Dimuthu

Re: ZkHelixManager disconnection hangs

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi Jiajun, Kishore and others,

Thanks for looking into this. We used 0.8.1 in that setup and now we
upgraded it to 0.8.2. However this was a frequently occurring issue causing
controller to manually kill and restart and we expect it to happen again if
it isn't fixed in 0.8.2. Main reason for that is we are relying on a single
Zookeper node across a fairly unreliable network for our staging
environment. We will let you know if we see that issue again. In the
meantime, I will try to reproduce it in 0.8.2 in my local helix deployment.

Thanks
Dimuthu

On Tue, Apr 2, 2019 at 1:58 AM ericwang1985 <er...@gmail.com> wrote:

> Could you please confirm the Helix version that is used, Dimuthu?
> The thing is that we have fixed several potential ZkHelixManager
> concurrency issues in 0.8.2. Basically, that was a race condition in which
> the disconnect method could get a disconnected non-null zkclient. In this
> case, reset handler will never finish.
>
> Please let us know if you are already using 0.8.2 or a later version. That
> probably means we have a new bug to fix.
>
> Cheers,
> -Jiajun
>
> On Apr 1, 2019, at 13:15, kishore g <g....@gmail.com> wrote:
>
> This is a good catch. @Wang Jiajun <er...@gmail.com> the stack
> trace is good enough to fix this right. We just have to look at all the
> paths we can get into this method and make sure resetHandler is thread safe
> and validates the state of the zkConnection and handlers.
>
> On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <er...@gmail.com>
> wrote:
>
>> Hi Dimuthu,
>>
>> Did you stop the controller when the connection is flapping or when it is
>> normal?
>> Could you please list all the steps that you have done in order?
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <
>> dimuthu.upeksha2@gmail.com>
>> wrote:
>>
>> > Hi Folks,
>> >
>> > In helix controller, we have seen below log line and by looking at the
>> > code, I understood that it is due to ZkHelixManager is failing to
>> connect
>> > to zookeeper for 5 times. So I tried to stop the controller and in the
>> stop
>> > logic, we have a call to ZkHelixManager.disconnect() method and it
>> hangs. I
>> > got a thread dump and you can see where it is waiting. Can you please
>> > advice as better approach to solve this?
>> >
>> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
>> > detected. Is calling disconnect() twice the reason for that?
>> >
>> > 2019-03-29 15:19:56,832 [
>> > ZkClient-EventThread-14-api.staging.scigap.org:2181
>> <http://zkclient-eventthread-14-api.staging.scigap.org:2181/>]
>> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
>> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
>> > 300000ms.
>> >
>> > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
>> > nativeId (decimal):25585 - state:BLOCKED
>> > stackTrace:
>> > java.lang.Thread.State: BLOCKED (on object monitor)
>> > at
>> >
>> >
>> org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
>> > - waiting to lock <0x00000006c7e08110> (a
>> > org.apache.helix.manager.zk.ZKHelixManager)
>> > at
>> >
>> >
>> org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
>> > at
>> >
>> >
>> org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
>> > at
>> >
>> >
>> org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
>> > Source)
>> > at java.lang.Thread.run(Thread.java:748)
>> > Locked ownable synchronizers:
>> > - None
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
>> > Thanks
>> > Dimuthu
>> >
>>
>
>

Re: ZkHelixManager disconnection hangs

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi Jiajun, Kishore and others,

Thanks for looking into this. We used 0.8.1 in that setup and now we
upgraded it to 0.8.2. However this was a frequently occurring issue causing
controller to manually kill and restart and we expect it to happen again if
it isn't fixed in 0.8.2. Main reason for that is we are relying on a single
Zookeper node across a fairly unreliable network for our staging
environment. We will let you know if we see that issue again. In the
meantime, I will try to reproduce it in 0.8.2 in my local helix deployment.

Thanks
Dimuthu

On Tue, Apr 2, 2019 at 1:58 AM ericwang1985 <er...@gmail.com> wrote:

> Could you please confirm the Helix version that is used, Dimuthu?
> The thing is that we have fixed several potential ZkHelixManager
> concurrency issues in 0.8.2. Basically, that was a race condition in which
> the disconnect method could get a disconnected non-null zkclient. In this
> case, reset handler will never finish.
>
> Please let us know if you are already using 0.8.2 or a later version. That
> probably means we have a new bug to fix.
>
> Cheers,
> -Jiajun
>
> On Apr 1, 2019, at 13:15, kishore g <g....@gmail.com> wrote:
>
> This is a good catch. @Wang Jiajun <er...@gmail.com> the stack
> trace is good enough to fix this right. We just have to look at all the
> paths we can get into this method and make sure resetHandler is thread safe
> and validates the state of the zkConnection and handlers.
>
> On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <er...@gmail.com>
> wrote:
>
>> Hi Dimuthu,
>>
>> Did you stop the controller when the connection is flapping or when it is
>> normal?
>> Could you please list all the steps that you have done in order?
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <
>> dimuthu.upeksha2@gmail.com>
>> wrote:
>>
>> > Hi Folks,
>> >
>> > In helix controller, we have seen below log line and by looking at the
>> > code, I understood that it is due to ZkHelixManager is failing to
>> connect
>> > to zookeeper for 5 times. So I tried to stop the controller and in the
>> stop
>> > logic, we have a call to ZkHelixManager.disconnect() method and it
>> hangs. I
>> > got a thread dump and you can see where it is waiting. Can you please
>> > advice as better approach to solve this?
>> >
>> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
>> > detected. Is calling disconnect() twice the reason for that?
>> >
>> > 2019-03-29 15:19:56,832 [
>> > ZkClient-EventThread-14-api.staging.scigap.org:2181
>> <http://zkclient-eventthread-14-api.staging.scigap.org:2181/>]
>> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
>> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
>> > 300000ms.
>> >
>> > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
>> > nativeId (decimal):25585 - state:BLOCKED
>> > stackTrace:
>> > java.lang.Thread.State: BLOCKED (on object monitor)
>> > at
>> >
>> >
>> org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
>> > - waiting to lock <0x00000006c7e08110> (a
>> > org.apache.helix.manager.zk.ZKHelixManager)
>> > at
>> >
>> >
>> org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
>> > at
>> >
>> >
>> org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
>> > at
>> >
>> >
>> org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
>> > Source)
>> > at java.lang.Thread.run(Thread.java:748)
>> > Locked ownable synchronizers:
>> > - None
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
>> > Thanks
>> > Dimuthu
>> >
>>
>
>

Re: ZkHelixManager disconnection hangs

Posted by ericwang1985 <er...@gmail.com>.
Could you please confirm the Helix version that is used, Dimuthu?
The thing is that we have fixed several potential ZkHelixManager concurrency issues in 0.8.2. Basically, that was a race condition in which the disconnect method could get a disconnected non-null zkclient. In this case, reset handler will never finish.

Please let us know if you are already using 0.8.2 or a later version. That probably means we have a new bug to fix.

Cheers,
-Jiajun

> On Apr 1, 2019, at 13:15, kishore g <g....@gmail.com> wrote:
> 
> This is a good catch. @Wang Jiajun <ma...@gmail.com> the stack trace is good enough to fix this right. We just have to look at all the paths we can get into this method and make sure resetHandler is thread safe and validates the state of the zkConnection and handlers.
> 
> On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <ericwang1985@gmail.com <ma...@gmail.com>> wrote:
> Hi Dimuthu,
> 
> Did you stop the controller when the connection is flapping or when it is
> normal?
> Could you please list all the steps that you have done in order?
> 
> Best Regards,
> Jiajun
> 
> 
> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <dimuthu.upeksha2@gmail.com <ma...@gmail.com>>
> wrote:
> 
> > Hi Folks,
> >
> > In helix controller, we have seen below log line and by looking at the
> > code, I understood that it is due to ZkHelixManager is failing to connect
> > to zookeeper for 5 times. So I tried to stop the controller and in the stop
> > logic, we have a call to ZkHelixManager.disconnect() method and it hangs. I
> > got a thread dump and you can see where it is waiting. Can you please
> > advice as better approach to solve this?
> >
> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
> > detected. Is calling disconnect() twice the reason for that?
> >
> > 2019-03-29 15:19:56,832 [
> > ZkClient-EventThread-14-api.staging.scigap.org:2181 <http://zkclient-eventthread-14-api.staging.scigap.org:2181/>]
> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
> > 300000ms.
> >
> > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
> > nativeId (decimal):25585 - state:BLOCKED
> > stackTrace:
> > java.lang.Thread.State: BLOCKED (on object monitor)
> > at
> >
> > org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
> > - waiting to lock <0x00000006c7e08110> (a
> > org.apache.helix.manager.zk.ZKHelixManager)
> > at
> >
> > org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
> > at
> >
> > org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
> > at
> >
> > org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
> > Source)
> > at java.lang.Thread.run(Thread.java:748)
> > Locked ownable synchronizers:
> > - None
> >
> > [1]
> >
> > https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991 <https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991>
> > Thanks
> > Dimuthu
> >


Re: ZkHelixManager disconnection hangs

Posted by kishore g <g....@gmail.com>.
This is a good catch. @Wang Jiajun <er...@gmail.com> the stack trace
is good enough to fix this right. We just have to look at all the paths we
can get into this method and make sure resetHandler is thread safe and
validates the state of the zkConnection and handlers.

On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <er...@gmail.com> wrote:

> Hi Dimuthu,
>
> Did you stop the controller when the connection is flapping or when it is
> normal?
> Could you please list all the steps that you have done in order?
>
> Best Regards,
> Jiajun
>
>
> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <
> dimuthu.upeksha2@gmail.com>
> wrote:
>
> > Hi Folks,
> >
> > In helix controller, we have seen below log line and by looking at the
> > code, I understood that it is due to ZkHelixManager is failing to connect
> > to zookeeper for 5 times. So I tried to stop the controller and in the
> stop
> > logic, we have a call to ZkHelixManager.disconnect() method and it
> hangs. I
> > got a thread dump and you can see where it is waiting. Can you please
> > advice as better approach to solve this?
> >
> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
> > detected. Is calling disconnect() twice the reason for that?
> >
> > 2019-03-29 15:19:56,832 [
> > ZkClient-EventThread-14-api.staging.scigap.org:2181]
> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
> > 300000ms.
> >
> > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
> > nativeId (decimal):25585 - state:BLOCKED
> > stackTrace:
> > java.lang.Thread.State: BLOCKED (on object monitor)
> > at
> >
> >
> org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
> > - waiting to lock <0x00000006c7e08110> (a
> > org.apache.helix.manager.zk.ZKHelixManager)
> > at
> >
> >
> org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
> > at
> >
> >
> org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
> > at
> >
> >
> org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
> > Source)
> > at java.lang.Thread.run(Thread.java:748)
> > Locked ownable synchronizers:
> > - None
> >
> > [1]
> >
> >
> https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
> > Thanks
> > Dimuthu
> >
>

Re: ZkHelixManager disconnection hangs

Posted by Wang Jiajun <er...@gmail.com>.
Hi Dimuthu,

Did you stop the controller when the connection is flapping or when it is
normal?
Could you please list all the steps that you have done in order?

Best Regards,
Jiajun


On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <di...@gmail.com>
wrote:

> Hi Folks,
>
> In helix controller, we have seen below log line and by looking at the
> code, I understood that it is due to ZkHelixManager is failing to connect
> to zookeeper for 5 times. So I tried to stop the controller and in the stop
> logic, we have a call to ZkHelixManager.disconnect() method and it hangs. I
> got a thread dump and you can see where it is waiting. Can you please
> advice as better approach to solve this?
>
> I noticed that ZkHelixManager disconnects [1] it self when a flapping is
> detected. Is calling disconnect() twice the reason for that?
>
> 2019-03-29 15:19:56,832 [
> ZkClient-EventThread-14-api.staging.scigap.org:2181]
> ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
> flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
> 300000ms.
>
> Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
> nativeId (decimal):25585 - state:BLOCKED
> stackTrace:
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
>
> org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
> - waiting to lock <0x00000006c7e08110> (a
> org.apache.helix.manager.zk.ZKHelixManager)
> at
>
> org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
> at
>
> org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
> at
>
> org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - None
>
> [1]
>
> https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
> Thanks
> Dimuthu
>