You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by "Figura, Mark" <mf...@empirix.com> on 2016/06/22 13:18:54 UTC

Handling of xid rollover

Hi,

We are using ZooKeeper 3.4.5 along with Curator to perform leader elections and also store some application data on a 3-node ensemble. Our application is not hard-realtime, but glitches in stream processing do get noticed and may raise support tickets.

Yesterday, we had such a glitch and by looking through the logs, I found there was an XID rollover. When this happened, a new election within the ensemble was triggered and all client connections were closed. From our application's point of view (possibly filtered through Curator), we saw the session expire and then the connection was lost. This caused our application to shutdown each component, re-perform leader elections, and eventually start back up.

We do have an issue where our application is making many more writes than it should, but once this is fixed, we'll still run into an XID rollover sooner or later.

Is there something our application can do to handle this situation better? Are there any plans for Zookeeper to handle this situation without closing client connections?

Thanks!
Mark

Re: Handling of xid rollover

Posted by Patrick Hunt <ph...@apache.org>.

Local sessions are only available in 3.5+ so I don't think that's an issue
for Mark (3.4.5). However it's a really good point and I'm not sure myself
what would happen - Thanks Dan!

Patrick

On Wed, Jun 22, 2016 at 11:10 AM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> Curator 3.0 will simulate a session expiration when there’s a network
> partition, but Curator 2.0 does not. If you’re using ZK 3.4.5 you’d be
> using Curator 2.0 so the only way you’d see a session expiration is when
> you successfully reconnect to the ensemble.
>
> -JZ
>
> > On Jun 22, 2016, at 12:58 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> > Hi Mark. See this jira for background:
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> >
> > However what you describe is correct behavior from our perspective. When
> > the lower 32 roll over we now (that was the fix) force a re-election of
> the
> > leader. Leader re-election causes the quorum to stop serving clients
> until
> > a new quorum forms.
> >
> > Leader re-election is a normal behavior for the ZK service, it happens
> > whenever the current leader is lost and a new quorum, with a (possibly
> new)
> > leader needs to reform. Say if the current leader process is restarted.
> > Your clients need to be able to handle this situation (typically the
> client
> > library does this for you).
> >
> > That said, you should not be seeing session expiration as a result of
> this.
> > Client timeouts certainly, but not session expiration. It might happen
> for
> > other reasons, but the leader is the one responsible for expiring
> sessions.
> > If there is no leader (e.g. being re-elected) there is no session
> > expiration. When the new leader is elected it will reset the clock on
> > session expiration, for all sessions, from the time it's reelected. For
> > example you can shutdown the entire ZK server ensemble, start it back up
> an
> > hour later and the clients should all be able to rejoin. Hm, that said
> I'm
> > not sure if Curator is doing some special magic, that's the behavior of
> the
> > stock client that we ship.
> >
> > Patrick
> >
> >
> > On Wed, Jun 22, 2016 at 6:18 AM, Figura, Mark <mf...@empirix.com>
> wrote:
> >
> >> Hi,
> >>
> >> We are using ZooKeeper 3.4.5 along with Curator to perform leader
> >> elections and also store some application data on a 3-node ensemble. Our
> >> application is not hard-realtime, but glitches in stream processing do
> get
> >> noticed and may raise support tickets.
> >>
> >> Yesterday, we had such a glitch and by looking through the logs, I found
> >> there was an XID rollover. When this happened, a new election within the
> >> ensemble was triggered and all client connections were closed. From our
> >> application's point of view (possibly filtered through Curator), we saw
> the
> >> session expire and then the connection was lost. This caused our
> >> application to shutdown each component, re-perform leader elections, and
> >> eventually start back up.
> >>
> >> We do have an issue where our application is making many more writes
> than
> >> it should, but once this is fixed, we'll still run into an XID rollover
> >> sooner or later.
> >>
> >> Is there something our application can do to handle this situation
> better?
> >> Are there any plans for Zookeeper to handle this situation without
> closing
> >> client connections?
> >>
> >> Thanks!
> >> Mark
> >>
>
>

RE: Handling of xid rollover

Posted by "Figura, Mark" <mf...@empirix.com>.

Thanks for the responses.

Patrick: Thanks for the example of shutting down ZK for an hour. That makes a lot of sense.

Looking further at our application logs, I see actually only SOME instances see a lost session - not ALL as I had thought. Other instances see the lost connection, but are able to reestablish it within a short time. The instances seeing a session loss also have an unexpected gap in application log timestamps, so I'm assuming this is something on my end.

This caused our processing glitch because we are handling connection and session loss the same way as recommended in the Curator LeaderSelector docs. I'll look into whether we should handle those 2 cases separately. I suppose the ultimate solution would be for our app to recover from a leader change more quickly though...

Thank you!
Mark

-----Original Message-----
From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] 
Sent: Wednesday, June 22, 2016 2:10 PM
To: user@zookeeper.apache.org
Subject: Re: Handling of xid rollover

Curator 3.0 will simulate a session expiration when there’s a network partition, but Curator 2.0 does not. If you’re using ZK 3.4.5 you’d be using Curator 2.0 so the only way you’d see a session expiration is when you successfully reconnect to the ensemble.

-JZ

> On Jun 22, 2016, at 12:58 PM, Patrick Hunt <ph...@apache.org> wrote:
> 
> Hi Mark. See this jira for background:
> https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> 
> However what you describe is correct behavior from our perspective. 
> When the lower 32 roll over we now (that was the fix) force a 
> re-election of the leader. Leader re-election causes the quorum to 
> stop serving clients until a new quorum forms.
> 
> Leader re-election is a normal behavior for the ZK service, it happens 
> whenever the current leader is lost and a new quorum, with a (possibly 
> new) leader needs to reform. Say if the current leader process is restarted.
> Your clients need to be able to handle this situation (typically the 
> client library does this for you).
> 
> That said, you should not be seeing session expiration as a result of this.
> Client timeouts certainly, but not session expiration. It might happen 
> for other reasons, but the leader is the one responsible for expiring sessions.
> If there is no leader (e.g. being re-elected) there is no session 
> expiration. When the new leader is elected it will reset the clock on 
> session expiration, for all sessions, from the time it's reelected. 
> For example you can shutdown the entire ZK server ensemble, start it 
> back up an hour later and the clients should all be able to rejoin. 
> Hm, that said I'm not sure if Curator is doing some special magic, 
> that's the behavior of the stock client that we ship.
> 
> Patrick
> 
> 
> On Wed, Jun 22, 2016 at 6:18 AM, Figura, Mark <mf...@empirix.com> wrote:
> 
>> Hi,
>> 
>> We are using ZooKeeper 3.4.5 along with Curator to perform leader 
>> elections and also store some application data on a 3-node ensemble. 
>> Our application is not hard-realtime, but glitches in stream 
>> processing do get noticed and may raise support tickets.
>> 
>> Yesterday, we had such a glitch and by looking through the logs, I 
>> found there was an XID rollover. When this happened, a new election 
>> within the ensemble was triggered and all client connections were 
>> closed. From our application's point of view (possibly filtered 
>> through Curator), we saw the session expire and then the connection 
>> was lost. This caused our application to shutdown each component, 
>> re-perform leader elections, and eventually start back up.
>> 
>> We do have an issue where our application is making many more writes 
>> than it should, but once this is fixed, we'll still run into an XID 
>> rollover sooner or later.
>> 
>> Is there something our application can do to handle this situation better?
>> Are there any plans for Zookeeper to handle this situation without 
>> closing client connections?
>> 
>> Thanks!
>> Mark
>>

Re: Handling of xid rollover

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

Curator 3.0 will simulate a session expiration when there’s a network partition, but Curator 2.0 does not. If you’re using ZK 3.4.5 you’d be using Curator 2.0 so the only way you’d see a session expiration is when you successfully reconnect to the ensemble.

-JZ

> On Jun 22, 2016, at 12:58 PM, Patrick Hunt <ph...@apache.org> wrote:
> 
> Hi Mark. See this jira for background:
> https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> 
> However what you describe is correct behavior from our perspective. When
> the lower 32 roll over we now (that was the fix) force a re-election of the
> leader. Leader re-election causes the quorum to stop serving clients until
> a new quorum forms.
> 
> Leader re-election is a normal behavior for the ZK service, it happens
> whenever the current leader is lost and a new quorum, with a (possibly new)
> leader needs to reform. Say if the current leader process is restarted.
> Your clients need to be able to handle this situation (typically the client
> library does this for you).
> 
> That said, you should not be seeing session expiration as a result of this.
> Client timeouts certainly, but not session expiration. It might happen for
> other reasons, but the leader is the one responsible for expiring sessions.
> If there is no leader (e.g. being re-elected) there is no session
> expiration. When the new leader is elected it will reset the clock on
> session expiration, for all sessions, from the time it's reelected. For
> example you can shutdown the entire ZK server ensemble, start it back up an
> hour later and the clients should all be able to rejoin. Hm, that said I'm
> not sure if Curator is doing some special magic, that's the behavior of the
> stock client that we ship.
> 
> Patrick
> 
> 
> On Wed, Jun 22, 2016 at 6:18 AM, Figura, Mark <mf...@empirix.com> wrote:
> 
>> Hi,
>> 
>> We are using ZooKeeper 3.4.5 along with Curator to perform leader
>> elections and also store some application data on a 3-node ensemble. Our
>> application is not hard-realtime, but glitches in stream processing do get
>> noticed and may raise support tickets.
>> 
>> Yesterday, we had such a glitch and by looking through the logs, I found
>> there was an XID rollover. When this happened, a new election within the
>> ensemble was triggered and all client connections were closed. From our
>> application's point of view (possibly filtered through Curator), we saw the
>> session expire and then the connection was lost. This caused our
>> application to shutdown each component, re-perform leader elections, and
>> eventually start back up.
>> 
>> We do have an issue where our application is making many more writes than
>> it should, but once this is fixed, we'll still run into an XID rollover
>> sooner or later.
>> 
>> Is there something our application can do to handle this situation better?
>> Are there any plans for Zookeeper to handle this situation without closing
>> client connections?
>> 
>> Thanks!
>> Mark
>>

Re: Handling of xid rollover

Posted by Dan Benediktson <db...@twitter.com.INVALID>.

I believe if local sessions are in use and the session in question hasn't
been upgraded to global by creating an ephemeral node, it would see session
expiration after a leader election (unless maybe if it lands on the same
peer - I do not remember if the session table gets recycled completely in
that case).

On Wed, Jun 22, 2016 at 10:58 AM, Patrick Hunt <ph...@apache.org> wrote:

> Hi Mark. See this jira for background:
> https://issues.apache.org/jira/browse/ZOOKEEPER-1277
>
> However what you describe is correct behavior from our perspective. When
> the lower 32 roll over we now (that was the fix) force a re-election of the
> leader. Leader re-election causes the quorum to stop serving clients until
> a new quorum forms.
>
> Leader re-election is a normal behavior for the ZK service, it happens
> whenever the current leader is lost and a new quorum, with a (possibly new)
> leader needs to reform. Say if the current leader process is restarted.
> Your clients need to be able to handle this situation (typically the client
> library does this for you).
>
> That said, you should not be seeing session expiration as a result of this.
> Client timeouts certainly, but not session expiration. It might happen for
> other reasons, but the leader is the one responsible for expiring sessions.
> If there is no leader (e.g. being re-elected) there is no session
> expiration. When the new leader is elected it will reset the clock on
> session expiration, for all sessions, from the time it's reelected. For
> example you can shutdown the entire ZK server ensemble, start it back up an
> hour later and the clients should all be able to rejoin. Hm, that said I'm
> not sure if Curator is doing some special magic, that's the behavior of the
> stock client that we ship.
>
> Patrick
>
>
> On Wed, Jun 22, 2016 at 6:18 AM, Figura, Mark <mf...@empirix.com> wrote:
>
> > Hi,
> >
> > We are using ZooKeeper 3.4.5 along with Curator to perform leader
> > elections and also store some application data on a 3-node ensemble. Our
> > application is not hard-realtime, but glitches in stream processing do
> get
> > noticed and may raise support tickets.
> >
> > Yesterday, we had such a glitch and by looking through the logs, I found
> > there was an XID rollover. When this happened, a new election within the
> > ensemble was triggered and all client connections were closed. From our
> > application's point of view (possibly filtered through Curator), we saw
> the
> > session expire and then the connection was lost. This caused our
> > application to shutdown each component, re-perform leader elections, and
> > eventually start back up.
> >
> > We do have an issue where our application is making many more writes than
> > it should, but once this is fixed, we'll still run into an XID rollover
> > sooner or later.
> >
> > Is there something our application can do to handle this situation
> better?
> > Are there any plans for Zookeeper to handle this situation without
> closing
> > client connections?
> >
> > Thanks!
> > Mark
> >
>

Re: Handling of xid rollover

Posted by Patrick Hunt <ph...@apache.org>.

Hi Mark. See this jira for background:
https://issues.apache.org/jira/browse/ZOOKEEPER-1277

However what you describe is correct behavior from our perspective. When
the lower 32 roll over we now (that was the fix) force a re-election of the
leader. Leader re-election causes the quorum to stop serving clients until
a new quorum forms.

Leader re-election is a normal behavior for the ZK service, it happens
whenever the current leader is lost and a new quorum, with a (possibly new)
leader needs to reform. Say if the current leader process is restarted.
Your clients need to be able to handle this situation (typically the client
library does this for you).

That said, you should not be seeing session expiration as a result of this.
Client timeouts certainly, but not session expiration. It might happen for
other reasons, but the leader is the one responsible for expiring sessions.
If there is no leader (e.g. being re-elected) there is no session
expiration. When the new leader is elected it will reset the clock on
session expiration, for all sessions, from the time it's reelected. For
example you can shutdown the entire ZK server ensemble, start it back up an
hour later and the clients should all be able to rejoin. Hm, that said I'm
not sure if Curator is doing some special magic, that's the behavior of the
stock client that we ship.

Patrick

On Wed, Jun 22, 2016 at 6:18 AM, Figura, Mark <mf...@empirix.com> wrote:

> Hi,
>
> We are using ZooKeeper 3.4.5 along with Curator to perform leader
> elections and also store some application data on a 3-node ensemble. Our
> application is not hard-realtime, but glitches in stream processing do get
> noticed and may raise support tickets.
>
> Yesterday, we had such a glitch and by looking through the logs, I found
> there was an XID rollover. When this happened, a new election within the
> ensemble was triggered and all client connections were closed. From our
> application's point of view (possibly filtered through Curator), we saw the
> session expire and then the connection was lost. This caused our
> application to shutdown each component, re-perform leader elections, and
> eventually start back up.
>
> We do have an issue where our application is making many more writes than
> it should, but once this is fixed, we'll still run into an XID rollover
> sooner or later.
>
> Is there something our application can do to handle this situation better?
> Are there any plans for Zookeeper to handle this situation without closing
> client connections?
>
> Thanks!
> Mark
>