You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@curator.apache.org by Steve Boyle <sb...@connexity.com> on 2016/08/17 17:43:35 UTC

Leader Latch question

I'm using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become 'active' and one become 'standby'.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became 'active'.  Unfortunately, the log files rolled over before we could check for exceptions.  I've been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper - this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become 'active' if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously 'standby' gets a notification from zookeeper "Got WatchedEvent state:SyncConnected type:NodeDeleted".  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?

Thanks,
Steve Boyle

Re: Leader Latch question

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

No - notLeader() will not get called automatically when there’s a network partition. Please see:

http://curator.apache.org/errors.html

and

http://curator.apache.org/curator-recipes/leader-latch.html - Error Handling

-Jordan

> On Aug 17, 2016, at 3:14 PM, Steve Boyle <sb...@connexity.com> wrote:
> 
> I should note that we are using version 2.9.1.  I believe we rely on Curator to handle the Lost and Suspended cases, looks like we’d expect calls to leaderLatchListener.isLeader and leaderLatchListener.notLeader.  We’ve never seen long GCs with this app, I’ll start logging that.
>  
> Thanks,
> Steve
>   <>
> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] 
> Sent: Wednesday, August 17, 2016 11:23 AM
> To: user@curator.apache.org
> Subject: Re: Leader Latch question
>  
> * How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST? 
> * Was there possibly a very long gc? See https://cwiki.apache.org/confluence/display/CURATOR/TN10 <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
>  
> -Jordan
>  
> On Aug 17, 2016, at 1:07 PM, Steve Boyle <sboyle@connexity.com <ma...@connexity.com>> wrote:
>  
> I appreciate your response.  Any thoughts on how the issue may have occurred in production?  Or thoughts on how to reproduce that scenario?
>  
> In the production case, there were two instances of the app – both configured for a list of 5 zookeepers.
>  
> Thanks,
> Steve
>  
> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>] 
> Sent: Wednesday, August 17, 2016 11:03 AM
> To: user@curator.apache.org <ma...@curator.apache.org>
> Subject: Re: Leader Latch question
>  
> Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens. 
>  
> -Jordan
>  
> On Aug 17, 2016, at 12:43 PM, Steve Boyle <sboyle@connexity.com <ma...@connexity.com>> wrote:
>  
> I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?
>  
> Thanks,
> Steve Boyle

RE: Leader Latch question

Posted by Steve Boyle <sb...@connexity.com>.

Ok, I tried the ConnectionStateListener and called notLeader for SUSPENDED, looked like notLeader got called twice when I did that.  On TN10, when you say “VM pauses might exceed your client heartbeat and cause a client misperception about it’s state for a short period of time once the VM un-pauses”, what happens after a ‘short period’ of time to change the client’s misperception?

Thanks,
Steve

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
Sent: Wednesday, August 17, 2016 1:23 PM
To: user@curator.apache.org
Subject: Re: Leader Latch question

i apologize - I was thinking of a different recipe. LeaderLatch does handle partitions internally. Maybe it’s a gc

On Aug 17, 2016, at 3:14 PM, Steve Boyle <sb...@connexity.com>> wrote:

I should note that we are using version 2.9.1.  I believe we rely on Curator to handle the Lost and Suspended cases, looks like we’d expect calls to leaderLatchListener.isLeader and leaderLatchListener.notLeader.  We’ve never seen long GCs with this app, I’ll start logging that.

Thanks,
Steve

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
Sent: Wednesday, August 17, 2016 11:23 AM
To: user@curator.apache.org<ma...@curator.apache.org>
Subject: Re: Leader Latch question

* How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST?
* Was there possibly a very long gc? See https://cwiki.apache.org/confluence/display/CURATOR/TN10

-Jordan

On Aug 17, 2016, at 1:07 PM, Steve Boyle <sb...@connexity.com>> wrote:

I appreciate your response.  Any thoughts on how the issue may have occurred in production?  Or thoughts on how to reproduce that scenario?

In the production case, there were two instances of the app – both configured for a list of 5 zookeepers.

Thanks,
Steve

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
Sent: Wednesday, August 17, 2016 11:03 AM
To: user@curator.apache.org<ma...@curator.apache.org>
Subject: Re: Leader Latch question

Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens.

-Jordan

On Aug 17, 2016, at 12:43 PM, Steve Boyle <sb...@connexity.com>> wrote:

I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?

Thanks,
Steve Boyle

Re: Leader Latch question

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

i apologize - I was thinking of a different recipe. LeaderLatch does handle partitions internally. Maybe it’s a gc

> On Aug 17, 2016, at 3:14 PM, Steve Boyle <sb...@connexity.com> wrote:
> 
> I should note that we are using version 2.9.1.  I believe we rely on Curator to handle the Lost and Suspended cases, looks like we’d expect calls to leaderLatchListener.isLeader and leaderLatchListener.notLeader.  We’ve never seen long GCs with this app, I’ll start logging that.
>  
> Thanks,
> Steve
>   <>
> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] 
> Sent: Wednesday, August 17, 2016 11:23 AM
> To: user@curator.apache.org
> Subject: Re: Leader Latch question
>  
> * How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST? 
> * Was there possibly a very long gc? See https://cwiki.apache.org/confluence/display/CURATOR/TN10 <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
>  
> -Jordan
>  
> On Aug 17, 2016, at 1:07 PM, Steve Boyle <sboyle@connexity.com <ma...@connexity.com>> wrote:
>  
> I appreciate your response.  Any thoughts on how the issue may have occurred in production?  Or thoughts on how to reproduce that scenario?
>  
> In the production case, there were two instances of the app – both configured for a list of 5 zookeepers.
>  
> Thanks,
> Steve
>  
> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>] 
> Sent: Wednesday, August 17, 2016 11:03 AM
> To: user@curator.apache.org <ma...@curator.apache.org>
> Subject: Re: Leader Latch question
>  
> Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens. 
>  
> -Jordan
>  
> On Aug 17, 2016, at 12:43 PM, Steve Boyle <sboyle@connexity.com <ma...@connexity.com>> wrote:
>  
> I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?
>  
> Thanks,
> Steve Boyle

RE: Leader Latch question

Posted by Steve Boyle <sb...@connexity.com>.

I should note that we are using version 2.9.1.  I believe we rely on Curator to handle the Lost and Suspended cases, looks like we’d expect calls to leaderLatchListener.isLeader and leaderLatchListener.notLeader.  We’ve never seen long GCs with this app, I’ll start logging that.

Thanks,
Steve

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
Sent: Wednesday, August 17, 2016 11:23 AM
To: user@curator.apache.org
Subject: Re: Leader Latch question

* How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST?
* Was there possibly a very long gc? See https://cwiki.apache.org/confluence/display/CURATOR/TN10

-Jordan

On Aug 17, 2016, at 1:07 PM, Steve Boyle <sb...@connexity.com>> wrote:

I appreciate your response.  Any thoughts on how the issue may have occurred in production?  Or thoughts on how to reproduce that scenario?

In the production case, there were two instances of the app – both configured for a list of 5 zookeepers.

Thanks,
Steve

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
Sent: Wednesday, August 17, 2016 11:03 AM
To: user@curator.apache.org<ma...@curator.apache.org>
Subject: Re: Leader Latch question

Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens.

-Jordan

On Aug 17, 2016, at 12:43 PM, Steve Boyle <sb...@connexity.com>> wrote:

I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?

Thanks,
Steve Boyle

Re: Leader Latch question

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

* How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST? 
* Was there possibly a very long gc? See https://cwiki.apache.org/confluence/display/CURATOR/TN10

-Jordan

> On Aug 17, 2016, at 1:07 PM, Steve Boyle <sb...@connexity.com> wrote:
> 
> I appreciate your response.  Any thoughts on how the issue may have occurred in production?  Or thoughts on how to reproduce that scenario?
>  
> In the production case, there were two instances of the app – both configured for a list of 5 zookeepers.
>  
> Thanks,
> Steve
>   <>
> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] 
> Sent: Wednesday, August 17, 2016 11:03 AM
> To: user@curator.apache.org
> Subject: Re: Leader Latch question
>  
> Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens. 
>  
> -Jordan
>  
> On Aug 17, 2016, at 12:43 PM, Steve Boyle <sboyle@connexity.com <ma...@connexity.com>> wrote:
>  
> I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?
>  
> Thanks,
> Steve Boyle

RE: Leader Latch question

Posted by Steve Boyle <sb...@connexity.com>.

I appreciate your response.  Any thoughts on how the issue may have occurred in production?  Or thoughts on how to reproduce that scenario?

In the production case, there were two instances of the app – both configured for a list of 5 zookeepers.

Thanks,
Steve

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
Sent: Wednesday, August 17, 2016 11:03 AM
To: user@curator.apache.org
Subject: Re: Leader Latch question

Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens.

-Jordan

On Aug 17, 2016, at 12:43 PM, Steve Boyle <sb...@connexity.com>> wrote:

I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?

Thanks,
Steve Boyle

Re: Leader Latch question

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

Manual removal of the latch node isn’t supported. It would require the latch to add a watch on its own node and that has performance/runtime overhead. The recommended behavior is to watch for connection loss/suspended events and exit your latch when that happens. 

-Jordan

> On Aug 17, 2016, at 12:43 PM, Steve Boyle <sb...@connexity.com> wrote:
> 
> I’m using the Leader Latch recipe.  I can successfully bring up two instances of my app and have one become ‘active’ and one become ‘standby’.  Most everything works as expected.  We had an issue, in production, when adding a zookeeper to our existing quorum, both instances of the app became ‘active’.  Unfortunately, the log files rolled over before we could check for exceptions.  I’ve been trying to reproduce this issue in a test environment.  In my test environment, I have two instances of my app configured to use a single zookeeper – this zookeeper is part of a 5 node quorum and is not currently the leader.  I can trigger both instances of the app to become ‘active’ if I use zkCli and manually delete the latch path from the single zookeeper to which my apps are connected.  When I manually delete the latch path, I can see via debug logging that the instance that was previously ‘standby’ gets a notification from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  However, the instance that had already been active gets no notification at all.  Is it expected that manually removing the latch path would only generate notifications to some instances of my app?
>  
> Thanks,
> Steve Boyle