You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@curator.apache.org by Brian Phillips <br...@etinternational.com> on 2014/03/25 20:38:01 UTC
Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Please open an issue and, if you can, provide a pull request with the fix.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 27, 2014 at 12:37:17 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
So I guess I'm going to go back to using the double barrier recipe.
Jordan, are you a Curator contributor? Are you going to check in that race condition fix you found for the next version of Curator?
From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
To: Brian Phillips [mailto:brian@etinternational.com], user@curator.apache.org
Sent: Thu, 27 Mar 2014 13:30:26 -0500
Subject: Re: Curator barriers missing watch events
https://cwiki.apache.org/confluence/display/CURATOR/TN1
:)
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 27, 2014 at 12:26:32 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I finally figured out my problem, and it was my fault. Hopefully someone else can learn from this.
What was happening was that I was using a zookeeper watch event to kick off a bunch of code, which then ended up in the zookeeper/curator barrier. Since the watch thread was the one that executed the barrier, it blocked itself from receiving any additional watch events from zookeeper, including the ones that the barrier depended upon.
So as a general rule, DON'T BLOCK YOUR WATCH THREADS. I feel stupid for not realizing this sooner.
_Brian=
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org
Sent: Wed, 26 Mar 2014 15:32:13 -0500
Subject: Re: Curator barriers missing watch events
So I'm still working on this issue. I grabbed a zookeeper only barrier implementation from here:
http://zookeeper.apache.org/doc/r3.3.3/zookeeperTutorial.html
This barrier makes it's own zookeeper connection separately from the curator connection that my program uses. When I put this barrier into my program, everything works as it should, and nobody gets stuck on the barriers. I then modified the barrier to use curators connection, passing in CuratorFramework.getZookeeperClient().getZooKeeper() instead of connecting separately. Once I did this, it breaks exactly as it did before when using the curator barrier.
This seems to indicate to me that something else I've done in the program has 'broken' the zookeeper session associated with my curator connection, to the point where some watch events no longer work.
I'm going to embark on the arduous process of trying to figure out what I'm doing thats breaking my sessions watches. Watches not working properly is disturbing, and will certainly prevent other parts of my program from functioning correctly, probably in less obvious ways.
_Brian=
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org [mailto:user@curator.apache.org]
Sent: Tue, 25 Mar 2014 20:39:31 -0500
Subject: Re: Curator barriers missing watch events
Yes, there's two barrier sessions. But different barrier instances, and different barrier paths. ):
Sent from my iPhone
On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 6:10:46 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
So I guess I'm going to go back to using the double barrier recipe.
Jordan, are you a Curator contributor? Are you going to check in that race condition fix you found for the next version of Curator?
_____
From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
To: Brian Phillips [mailto:brian@etinternational.com], user@curator.apache.org
Sent: Thu, 27 Mar 2014 13:30:26 -0500
Subject: Re: Curator barriers missing watch events
https://cwiki.apache.org/confluence/display/CURATOR/TN1
:)
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 27, 2014 at 12:26:32 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I finally figured out my problem, and it was my fault. Hopefully someone else can learn from this.
What was happening was that I was using a zookeeper watch event to kick off a bunch of code, which then ended up in the zookeeper/curator barrier. Since the watch thread was the one that executed the barrier, it blocked itself from receiving any additional watch events from zookeeper, including the ones that the barrier depended upon.
So as a general rule, DON'T BLOCK YOUR WATCH THREADS. I feel stupid for not realizing this sooner.
_Brian=
_____
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org
Sent: Wed, 26 Mar 2014 15:32:13 -0500
Subject: Re: Curator barriers missing watch events
So I'm still working on this issue. I grabbed a zookeeper only barrier implementation from here:
http://zookeeper.apache.org/doc/r3.3.3/zookeeperTutorial.html
This barrier makes it's own zookeeper connection separately from the curator connection that my program uses. When I put this barrier into my program, everything works as it should, and nobody gets stuck on the barriers. I then modified the barrier to use curators connection, passing in CuratorFramework.getZookeeperClient().getZooKeeper() instead of connecting separately. Once I did this, it breaks exactly as it did before when using the curator barrier.
This seems to indicate to me that something else I've done in the program has 'broken' the zookeeper session associated with my curator connection, to the point where some watch events no longer work.
I'm going to embark on the arduous process of trying to figure out what I'm doing thats breaking my sessions watches. Watches not working properly is disturbing, and will certainly prevent other parts of my program from functioning correctly, probably in less obvious ways.
_Brian=
_____
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org [mailto:user@curator.apache.org]
Sent: Tue, 25 Mar 2014 20:39:31 -0500
Subject: Re: Curator barriers missing watch events
Yes, there's two barrier sessions. But different barrier instances, and different barrier paths. ):
Sent from my iPhone
On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 6:10:46 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
https://cwiki.apache.org/confluence/display/CURATOR/TN1
:)
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 27, 2014 at 12:26:32 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I finally figured out my problem, and it was my fault. Hopefully someone else can learn from this.
What was happening was that I was using a zookeeper watch event to kick off a bunch of code, which then ended up in the zookeeper/curator barrier. Since the watch thread was the one that executed the barrier, it blocked itself from receiving any additional watch events from zookeeper, including the ones that the barrier depended upon.
So as a general rule, DON'T BLOCK YOUR WATCH THREADS. I feel stupid for not realizing this sooner.
_Brian=
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org
Sent: Wed, 26 Mar 2014 15:32:13 -0500
Subject: Re: Curator barriers missing watch events
So I'm still working on this issue. I grabbed a zookeeper only barrier implementation from here:
http://zookeeper.apache.org/doc/r3.3.3/zookeeperTutorial.html
This barrier makes it's own zookeeper connection separately from the curator connection that my program uses. When I put this barrier into my program, everything works as it should, and nobody gets stuck on the barriers. I then modified the barrier to use curators connection, passing in CuratorFramework.getZookeeperClient().getZooKeeper() instead of connecting separately. Once I did this, it breaks exactly as it did before when using the curator barrier.
This seems to indicate to me that something else I've done in the program has 'broken' the zookeeper session associated with my curator connection, to the point where some watch events no longer work.
I'm going to embark on the arduous process of trying to figure out what I'm doing thats breaking my sessions watches. Watches not working properly is disturbing, and will certainly prevent other parts of my program from functioning correctly, probably in less obvious ways.
_Brian=
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org [mailto:user@curator.apache.org]
Sent: Tue, 25 Mar 2014 20:39:31 -0500
Subject: Re: Curator barriers missing watch events
Yes, there's two barrier sessions. But different barrier instances, and different barrier paths. ):
Sent from my iPhone
On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 6:10:46 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
I finally figured out my problem, and it was my fault. Hopefully someone else can learn from this.
What was happening was that I was using a zookeeper watch event to kick off a bunch of code, which then ended up in the zookeeper/curator barrier. Since the watch thread was the one that executed the barrier, it blocked itself from receiving any additional watch events from zookeeper, including the ones that the barrier depended upon.
So as a general rule, DON'T BLOCK YOUR WATCH THREADS. I feel stupid for not realizing this sooner.
_Brian=
_____
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org
Sent: Wed, 26 Mar 2014 15:32:13 -0500
Subject: Re: Curator barriers missing watch events
So I'm still working on this issue. I grabbed a zookeeper only barrier implementation from here:
http://zookeeper.apache.org/doc/r3.3.3/zookeeperTutorial.html
This barrier makes it's own zookeeper connection separately from the curator connection that my program uses. When I put this barrier into my program, everything works as it should, and nobody gets stuck on the barriers. I then modified the barrier to use curators connection, passing in CuratorFramework.getZookeeperClient().getZooKeeper() instead of connecting separately. Once I did this, it breaks exactly as it did before when using the curator barrier.
This seems to indicate to me that something else I've done in the program has 'broken' the zookeeper session associated with my curator connection, to the point where some watch events no longer work.
I'm going to embark on the arduous process of trying to figure out what I'm doing thats breaking my sessions watches. Watches not working properly is disturbing, and will certainly prevent other parts of my program from functioning correctly, probably in less obvious ways.
_Brian=
_____
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org [mailto:user@curator.apache.org]
Sent: Tue, 25 Mar 2014 20:39:31 -0500
Subject: Re: Curator barriers missing watch events
Yes, there's two barrier sessions. But different barrier instances, and different barrier paths. ):
Sent from my iPhone
On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 6:10:46 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
So I'm still working on this issue. I grabbed a zookeeper only barrier implementation from here:
http://zookeeper.apache.org/doc/r3.3.3/zookeeperTutorial.html
This barrier makes it's own zookeeper connection separately from the curator connection that my program uses. When I put this barrier into my program, everything works as it should, and nobody gets stuck on the barriers. I then modified the barrier to use curators connection, passing in CuratorFramework.getZookeeperClient().getZooKeeper() instead of connecting separately. Once I did this, it breaks exactly as it did before when using the curator barrier.
This seems to indicate to me that something else I've done in the program has 'broken' the zookeeper session associated with my curator connection, to the point where some watch events no longer work.
I'm going to embark on the arduous process of trying to figure out what I'm doing thats breaking my sessions watches. Watches not working properly is disturbing, and will certainly prevent other parts of my program from functioning correctly, probably in less obvious ways.
_Brian=
_____
From: Brian Phillips [mailto:brian@etinternational.com]
To: user@curator.apache.org [mailto:user@curator.apache.org]
Sent: Tue, 25 Mar 2014 20:39:31 -0500
Subject: Re: Curator barriers missing watch events
Yes, there's two barrier sessions. But different barrier instances, and different barrier paths. ):
Sent from my iPhone
On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 6:10:46 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
Yes, there's two barrier sessions. But different barrier instances, and different barrier paths. ):
Sent from my iPhone
> On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>
> Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
>
> -JZ
>
>
> From: Brian Phillips brian@etinternational.com
> Reply: user@curator.apache.org user@curator.apache.org
> Date: March 25, 2014 at 6:10:46 PM
> To: user@curator.apache.org user@curator.apache.org
> Subject: Re: Curator barriers missing watch events
>
>> I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
>>
>> I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
>>
>> _B
>>
>>
>> On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>>
>>> There’s no way you can distill your usage into a test?
>>>
>>> -JZ
>>>
>>>
>>> From: Brian Phillips brian@etinternational.com
>>> Reply: user@curator.apache.org user@curator.apache.org
>>> Date: March 25, 2014 at 5:51:37 PM
>>> To: user@curator.apache.org user@curator.apache.org
>>> Subject: Re: Curator barriers missing watch events
>>>
>>>> Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
>>>>
>>>> The second program got a node created event, but the first program never got any event from its watcher.
>>>>
>>>> I appreciate the help! Must be something else.
>>>>
>>>> _B
>>>>
>>>> On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>>>>
>>>>> Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
>>>>>
>>>>> else if ( !hasBeenNotified.get() )
>>>>>
>>>>> -JZ
>>>>>
>>>>>
>>>>> From: Brian Phillips brian@etinternational.com
>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>> Date: March 25, 2014 at 5:25:48 PM
>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>> Subject: Re: Curator barriers missing watch events
>>>>>
>>>>>> I am using the no arg version! What's the bug?
>>>>>>
>>>>>> _B
>>>>>>
>>>>>> On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>>>>>>
>>>>>>> Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
>>>>>>>
>>>>>>>
>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>> Reply: Brian Phillips brian@etinternational.com
>>>>>>> Date: March 25, 2014 at 4:19:36 PM
>>>>>>> To: Jordan Zimmerman jordan@jordanzimmerman.com
>>>>>>> Subject: Re: Curator barriers missing watch events
>>>>>>>
>>>>>>>> Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
>>>>>>>>
>>>>>>>> I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
>>>>>>>>
>>>>>>>> _B
>>>>>>>>
>>>>>>>>> On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>>>>>
>>>>>>>>> Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
>>>>>>>>>
>>>>>>>>> -JZ
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>>>> Date: March 25, 2014 at 2:51:42 PM
>>>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>>>> Subject: Re: Curator barriers missing watch events
>>>>>>>>>
>>>>>>>>>> I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
>>>>>>>>>>
>>>>>>>>>> I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
>>>>>>>>>>
>>>>>>>>>> It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
>>>>>>>>>>
>>>>>>>>>> _B
>>>>>>>>>>
>>>>>>>>>>> On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
>>>>>>>>>>>
>>>>>>>>>>> -JZ
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>>>>>> Date: March 25, 2014 at 2:38:40 PM
>>>>>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>>>>>> Subject: Curator barriers missing watch events
>>>>>>>>>>>
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>
>>>>>>>>>>>> I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
>>>>>>>>>>>>
>>>>>>>>>>>> I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
>>>>>>>>>>>>
>>>>>>>>>>>> It would seem that the watch for one of the programs just never triggers.
>>>>>>>>>>>>
>>>>>>>>>>>> To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
>>>>>>>>>>>>
>>>>>>>>>>>> Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
>>>>>>>>>>>>
>>>>>>>>>>>> _Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Are you saying there are two barrier sessions? The first one works, but the second doesn’t? Are you re-using the same path? I wonder if there are znodes left in the path or something. Before running the second barrier session, double check that the path is empty (do a getChildren on it). If it’s not empty that could be the problem.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 6:10:46 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program, where the first barrier always works but the one at the end of the program usually gets stuck. I’ve spent all day trying to make sense of it, as my project really needs it to work.
I’d like to be able to figure out if the zookeeper server is actually sending my clients the watch events.
_B
> On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>
> There’s no way you can distill your usage into a test?
>
> -JZ
>
>
> From: Brian Phillips brian@etinternational.com
> Reply: user@curator.apache.org user@curator.apache.org
> Date: March 25, 2014 at 5:51:37 PM
> To: user@curator.apache.org user@curator.apache.org
> Subject: Re: Curator barriers missing watch events
>
>> Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
>>
>> The second program got a node created event, but the first program never got any event from its watcher.
>>
>> I appreciate the help! Must be something else.
>>
>> _B
>>
>> On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>>
>>> Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
>>>
>>> else if ( !hasBeenNotified.get() )
>>>
>>> -JZ
>>>
>>>
>>> From: Brian Phillips brian@etinternational.com
>>> Reply: user@curator.apache.org user@curator.apache.org
>>> Date: March 25, 2014 at 5:25:48 PM
>>> To: user@curator.apache.org user@curator.apache.org
>>> Subject: Re: Curator barriers missing watch events
>>>
>>>> I am using the no arg version! What's the bug?
>>>>
>>>> _B
>>>>
>>>> On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>>>>
>>>>> Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
>>>>>
>>>>>
>>>>> From: Brian Phillips brian@etinternational.com
>>>>> Reply: Brian Phillips brian@etinternational.com
>>>>> Date: March 25, 2014 at 4:19:36 PM
>>>>> To: Jordan Zimmerman jordan@jordanzimmerman.com
>>>>> Subject: Re: Curator barriers missing watch events
>>>>>
>>>>>> Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
>>>>>>
>>>>>> I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
>>>>>>
>>>>>> _B
>>>>>>
>>>>>>> On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>>>
>>>>>>> Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
>>>>>>>
>>>>>>> -JZ
>>>>>>>
>>>>>>>
>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>> Date: March 25, 2014 at 2:51:42 PM
>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>> Subject: Re: Curator barriers missing watch events
>>>>>>>
>>>>>>>> I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
>>>>>>>>
>>>>>>>> I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
>>>>>>>>
>>>>>>>> It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
>>>>>>>>
>>>>>>>> _B
>>>>>>>>
>>>>>>>>> On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>>>>>
>>>>>>>>> Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
>>>>>>>>>
>>>>>>>>> -JZ
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>>>> Date: March 25, 2014 at 2:38:40 PM
>>>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>>>> Subject: Curator barriers missing watch events
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
>>>>>>>>>>
>>>>>>>>>> I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
>>>>>>>>>>
>>>>>>>>>> I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
>>>>>>>>>>
>>>>>>>>>> It would seem that the watch for one of the programs just never triggers.
>>>>>>>>>>
>>>>>>>>>> To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
>>>>>>>>>>
>>>>>>>>>> Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
>>>>>>>>>>
>>>>>>>>>> Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
>>>>>>>>>>
>>>>>>>>>> _Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
There’s no way you can distill your usage into a test?
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:51:37 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
Hmm, I made that change, but it didn't seem to help. The first program made it to the barrier enter, then the second program entered, exited, and the first program never left the barrier.
The second program got a node created event, but the first program never got any event from its watcher.
I appreciate the help! Must be something else.
_B
> On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>
> Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
>
> else if ( !hasBeenNotified.get() )
>
> -JZ
>
>
> From: Brian Phillips brian@etinternational.com
> Reply: user@curator.apache.org user@curator.apache.org
> Date: March 25, 2014 at 5:25:48 PM
> To: user@curator.apache.org user@curator.apache.org
> Subject: Re: Curator barriers missing watch events
>
>> I am using the no arg version! What's the bug?
>>
>> _B
>>
>> On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>>
>>> Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
>>>
>>>
>>> From: Brian Phillips brian@etinternational.com
>>> Reply: Brian Phillips brian@etinternational.com
>>> Date: March 25, 2014 at 4:19:36 PM
>>> To: Jordan Zimmerman jordan@jordanzimmerman.com
>>> Subject: Re: Curator barriers missing watch events
>>>
>>>> Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
>>>>
>>>> I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
>>>>
>>>> _B
>>>>
>>>>> On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>
>>>>> Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
>>>>>
>>>>> -JZ
>>>>>
>>>>>
>>>>> From: Brian Phillips brian@etinternational.com
>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>> Date: March 25, 2014 at 2:51:42 PM
>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>> Subject: Re: Curator barriers missing watch events
>>>>>
>>>>>> I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
>>>>>>
>>>>>> I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
>>>>>>
>>>>>> It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
>>>>>>
>>>>>> _B
>>>>>>
>>>>>>> On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>>>
>>>>>>> Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
>>>>>>>
>>>>>>> -JZ
>>>>>>>
>>>>>>>
>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>> Date: March 25, 2014 at 2:38:40 PM
>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>> Subject: Curator barriers missing watch events
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
>>>>>>>>
>>>>>>>> I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
>>>>>>>>
>>>>>>>> I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
>>>>>>>>
>>>>>>>> It would seem that the watch for one of the programs just never triggers.
>>>>>>>>
>>>>>>>> To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
>>>>>>>>
>>>>>>>> Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
>>>>>>>>
>>>>>>>> Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
>>>>>>>>
>>>>>>>> _Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Look at line 313 and line 331. The noarg version of enter() causes internalEnter() to call wait even though the watcher may have already notified. I believe line 331 should be:
else if ( !hasBeenNotified.get() )
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 5:25:48 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I am using the no arg version! What's the bug?
_B
On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
I am using the no arg version! What's the bug?
_B
> On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com> wrote:
>
> Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
>
>
> From: Brian Phillips brian@etinternational.com
> Reply: Brian Phillips brian@etinternational.com
> Date: March 25, 2014 at 4:19:36 PM
> To: Jordan Zimmerman jordan@jordanzimmerman.com
> Subject: Re: Curator barriers missing watch events
>
>> Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
>>
>> I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
>>
>> _B
>>
>>> On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>
>>> Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
>>>
>>> -JZ
>>>
>>>
>>> From: Brian Phillips brian@etinternational.com
>>> Reply: user@curator.apache.org user@curator.apache.org
>>> Date: March 25, 2014 at 2:51:42 PM
>>> To: user@curator.apache.org user@curator.apache.org
>>> Subject: Re: Curator barriers missing watch events
>>>
>>>> I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
>>>>
>>>> I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
>>>>
>>>> It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
>>>>
>>>> _B
>>>>
>>>>> On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
>>>>>
>>>>> Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
>>>>>
>>>>> -JZ
>>>>>
>>>>>
>>>>> From: Brian Phillips brian@etinternational.com
>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>> Date: March 25, 2014 at 2:38:40 PM
>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>> Subject: Curator barriers missing watch events
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
>>>>>>
>>>>>> I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
>>>>>>
>>>>>> I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
>>>>>>
>>>>>> It would seem that the watch for one of the programs just never triggers.
>>>>>>
>>>>>> To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
>>>>>>
>>>>>> Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
>>>>>>
>>>>>> Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
>>>>>>
>>>>>> _Brian=
>>
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Which version of enter() are you using? I see a potential bug when the no arg version of enter() is used.
From: Brian Phillips brian@etinternational.com
Reply: Brian Phillips brian@etinternational.com
Date: March 25, 2014 at 4:19:36 PM
To: Jordan Zimmerman jordan@jordanzimmerman.com
Subject: Re: Curator barriers missing watch events
Good idea, but yes I am. The connection state doesn’t change while I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good as far as I can tell though. I’m practically pulling my hair out at this point.
I may try a non-curator zookeeper only barrier tomorrow. See if that works. Or I may start trying to debug the zookeeper client, see if its actually getting the watches but not delivering them.
_B
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Are you setting a ConnectionStateListener? If the connection gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
One thing to know is that it’s not possible to get every ZK event. I don’t know if that helps. If it’s not too big, I can do a code review on your code. Of course, let’s not rule out a Curator bug. I’ll have a re-look at the Barrier code when I get a chance.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Re: Curator barriers missing watch events
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=
Re: Curator barriers missing watch events
Posted by Brian Phillips <br...@etinternational.com>.
I have tried writing a test program which launches two programs in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely and everything works out fine.
I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to increment it until it hits a memberQty. This too missed watch events and does not work properly.
It’s almost as if something else that I’ve done during the running of my program has broken zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the working vs non-working barriers...
_B
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
> Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
>
> -JZ
>
>
> From: Brian Phillips brian@etinternational.com
> Reply: user@curator.apache.org user@curator.apache.org
> Date: March 25, 2014 at 2:38:40 PM
> To: user@curator.apache.org user@curator.apache.org
> Subject: Curator barriers missing watch events
>
>> Hi guys,
>>
>> I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
>>
>> I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
>>
>> I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
>>
>> It would seem that the watch for one of the programs just never triggers.
>>
>> To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
>>
>> Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
>>
>> Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
>>
>> _Brian=
Re: Curator barriers missing watch events
Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may well be a bug. If you could get a test to show the problem that would be ideal.
-JZ
From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject: Curator barriers missing watch events
Hi guys,
I’ve been integrating curator into my project and have recently run into an issue I just can’t seem to make sense of.
I’m running two JVMs on the same host machine, each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches of other nodes.
I’m finding that the first double barrier, everyone always making it through. The job-end barrier, sometimes everyone gets through, but more often than not one of the programs hangs in enter's wait(), and never gets the watch event for the ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready path is actually set in there.
It would seem that the watch for one of the programs just never triggers.
To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave(). Both barriers have their own separate path.
Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely after it gets out of the final barrier.
Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.
_Brian=