You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Scott Fines <sc...@gmail.com> on 2011/04/21 15:37:26 UTC

Unexpected behavior with Session Timeouts in Java Client

Hello everybody,

I've been working on a situation with ZooKeeper leader elections and there
seems like an odd rule about Session Timeouts is currently in place in the
java client.

The task I'm trying to accomplish is the following:

I have multiple clients running the same body of code. For consistency in
the "everything works" case, I need to have only one machine actually doing
work at a time. However, when that machine fails for any reason, I need
another machine to pick up the work as quickly as possible. So far, so good:
do a leader election in ZooKeeper and we're set.

However, these processes are long-running, and themselves do not need to
interact with ZooKeeper in any way. What I've been seeing is that, if
ZooKeeper itself goes down, or if the leader becomes partitioned from the ZK
cluster for a long period of time ( > session timeout),  then the ZooKeeper
client seems to infinitely attempt to reconnect, in the background, but the
application code itself never receives a session expired event.

>From reading the mailing list archives, I saw this
thread<http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg01274.html>which
indicates that a Session Expired event does not get fired until the
ZooKeeper client reconnects to the cluster. However, for my use-case this is
not ideal, since once the Leader has been elected, it may run for months (if
all goes well) without needing to contact zookeeper again. So, in this
world, I could end up in an inconsistent state, where the process is running
on two separate clients because the one leader has been ousted, but doesn't
know it.

This isn't terribly difficult to work around: I can create a background
thread that pings ZooKeeper every N milliseconds, and if there's a
connection loss for a time greater than the session timeout, fire a
SessionExpired event back to the application so it can kill itself, but it
made me wonder why this particular choice was made. It seems like, from
looking at the log output of a client, that ClientCnxn will basically fall
into an infinite loop of trying to reconnect to ZK until it succeeds, at
which point (or some point soon after--perhaps the next time somebody tries
to use that client instance?) a SessionExpiration will be dealt with. It
seems to me, though, that all the information is there already--after the
session timeout has been exceeded without connecting to ZK, we know that
that instance is shot, so why wait until we've reconnected to fire the
Session expiration? Why not just fire it right away and then give up trying?
Is there a performance or consistency reason why that wouldn't work?

Thanks for the help,

Scott

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ted Dunning <te...@gmail.com>.

Nice summary.  Your situation is definitely not the normal case and I
completely misunderstood your original question.

Since you can survive double connections, I would recommend something a bit
different than before.

Your master can keep downloading during a connection loss and stop
downloading at t_loss + session-expiration + epsilon if connectivity with ZK
is not re-established or whenever an expiration event is received.  For
momentary connection losses, the master would continue downloads with no
interruption and nobody would be tempted to take over.  For longer
connection losses, you may have epsilon time period with the old master
still downloading.

In the worst case, you may have a network partition followed by ZK restart
which would expire the session early and cause somebody else to take over
the download early.  This is a fairly bizarre scenario with only moderate
cost and pretty low likelihood.  Certain power loss scenarios are the most
common way for this to happen.

It should pointed out that you can have connection loss in normal operation
due to the standard procedure of doing a rolling restart during an upgrade.
 This will cause a transient connection loss without session expiration as
the server you are connected to restarts.  You should plan for these events
because otherwise you can't upgrade ZK.

On Thu, Apr 21, 2011 at 3:41 PM, Scott Fines <sc...@gmail.com> wrote:

> Perhaps I am not being clear in my description.
>
> I'm building a system, that receives data events from an external source.
> This external source is not necessarily under my control, and starting up a
> connection to that source is highly expensive, as it entails a
> high-latency,
> low-bandwith transfer of data. It's more stable than packet radio, but not
> a
> whole lot faster. This system retains the ability to recreate events, and
> can do so upon request, but the cost to recreate is extremely high.
>
> On the other end, the system pushes these events on to distributed
> processing and (eventually) a long-term storage situation, like Cassandra
> or
> Hadoop. Each event that is received can be idempotently applied, so the
> system can safely process duplicate messages if they come in.
>
> If the world were perfect and we had Infiniband connections to all of our
> external sources, then there would be no reason for a leader-election
> protocol in this scenario. I would just boot the system on every node, and
> have them do their thing, and why worry? Idempotency is a beautiful thing.
> Sadly, the world is not perfect, and trying to deal with an already slow
> external connection by asking it to send the same data 10 or 15 times is
> not
> a great idea, performance-wise. In addition to slowing everything down on
> the receiving end, it also has an adverse affect on the source's
> performance; the source, it must be noted, has other things to do besides
> just feeding my system data.
>
> So my solution is to limit the number of external connection to 1, and use
> ZooKeeper leader-elections to manage which machine is running at which
> time.
> This way, we keep the number of external connections down as low as we go,
> we can guarantee that messages are received and processed idempotently, and
> in the normal situation where there is no trouble at all, life is fine.
>
> What I am trying to deal with right now is how to manage the corner cases
> of
> when communication with ZooKeeper breaks down.
>
> To answer your question about the ZooKeeper cluster installation: no, it is
> not located in multiple data centers. It is, however, co-located with other
> processes. For about 90-95% (we have an actual measurement, but I can't
> remember it off the top of my head) of the time, the resource utilization
> is
> low enough and ZooKeeper is lightweight enough that it makes sense to
> co-locate. Occasionally, however, we do see a spike in an individual
> machine's utilization. Even more occasionally, that spike can result in
> clients being disconnected from that ZooKeeper node. Since almost all the
> remainder of the cluster is reachable and appropriately utilized, clients
> typically reconnect to another node, and all is well. Of course, with this
> arrangement, there is a small chance that the entire cluster could blow out
> in usage, which would result in catastrophic ZooKeeper failure. This hasn't
> happened yet, but I'm sure it's only a matter of time. However, if this
> were
> to happen, the co-located services would also be at the tipping point, and
> would probably crash soon after. Since that system is mission-critical, we
> have lots of monitoring in place, as well as a couple of spare nodes to
> throw at the cluster in case the overall usage is high. So, in short, it
> isn't terribly surprising that we might occasionally get a connection loss
> from ZooKeeper, but in almost all cases, that connection loss can be
> resolved very quickly.
>
> What can NOT be resolved quickly is our leader failing. This is not
> catastrophic, but it does negatively impact performance. Therefore, I wish
> to avoid needlessly creating external connections. If we shut down the
> external connection upon every connection loss, we would pay a price in
> minutes what could easily be resolved in milliseconds. If, however, we
> wait,
> keeping a connection open, knowing that we are still probabilistically the
> leader, we can continue to receive events. Once the leadership has been
> passed on, however, we should shut down to avoid excess connections.
>
> What's puzzling to me here is the client expectation. On the one hand,
> there
> is a SessionExpired Event, which clearly indicates that a session has
> expired. However, the session only expires if ZooKeeper fails to hear from
> you within a specified time frame. This specified time is agreed upon, and
> is known by both ZK and the client. Therefore, one expectation would be
> that
> the SessionExpired event would be fired when that timeframe has been
> exhausted before a connection with ZK could be reestablished. On the other
> hand, Session Expired events only happen AFTER a reconnection to the
> cluster. This is strange to me, because it seems that the client has all
> the
> information it needs to know that it's session has been expired by
> ZooKeeper, and it doesn't really need to first talk to the Server to pass
> that information back to the application. It's made particularly strange to
> me when I think that, for most cases, a failure to talk to ZooKeeper within
> the session timeout probably means that you aren't going to be able to
> reconnect with it for a lot longer yet, and client is probably going to
> just
> keep failing. If you were talking about total machine failure, then who
> cares? The application is dead anyway. But if you're talking about a
> network
> partition, where the node could conceivably still do its job successfully,
> and is only prevented from doing so by a connection to ZooKeeper, as in my
> use case, you want that information in a reasonably timely manner.
>
> Judging from my questions, the relative number of questions about
> SessionExpiration versus disconnect that have been answered, and the fact
> that Twitter and Travis (and presumably others as well) have felt the need
> to implement this behavior outside of the system, it seems like I'm not the
> only one confused about this. So my question is: Why does the ZooKeeper
> client do notification only upon reconnect? Is there a
> performance/consistency reason that I'm not seeing?
>
> Thanks, and sorry for the uber-long message,
>
> Scott
>
> On Thu, Apr 21, 2011 at 4:47 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Scott,
> >
> > Having your master enter a suspended state is fine, but it cannot act as
> > master during this time (because somebody else may have become master
> > during
> > this time).
> >
> > It is fine to enter a suspended mode, but the suspended master cannot
> > commit
> > to any actions as a master.  Any transactions that it accepts must be
> > considered as expendable.  Usually that means that whoever sent the
> > transactions must retain them until the suspended master regains its
> senses
> > or relinquishes its master state.
> >
> > The other question that comes up from your description is how your ZK
> > cluster works.  Do you have zookeeper split across data centers?
> >
> > On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <sc...@gmail.com>
> wrote:
> >
> > > Ryan,
> > >
> > > That is a fair point in that I would have consistency of services--that
> > is,
> > > that I would be pretty sure I'd only have one service running at a
> time.
> > > However, my particular application demands are such that just stopping
> > and
> > > re-starting on disconnected events is not a good idea.
> > >
> > > What I'm writing is a connector between two data centers, where the
> > > measured
> > > latency is on the order of seconds, and each time a service connects,
> it
> > > must transfer (hopefully only a few) megabytes of data, which I've
> > measured
> > > to take on the order of minutes. On the other hand, it is not unusual
> for
> > > us
> > > to receive a disconnected event every now and then, which is generally
> > > resolved on the order of milliseconds. Clearly, I don't want to
> recreate
> > a
> > > minutes-long process every time we get a milliseconds-long
> disconnection
> > > which does not remove the service's existing leadership.
> > >
> > > So, when the leader receives a disconnected event, it queues up events
> to
> > > process, but holds on to its connections and continues to receive
> events
> > > while it waits for a connection to ZK to be re-established. If the
> > > connection to ZK comes back online within the session timeout window,
> > then
> > > it will just turn processing back on as if nothing happened. However,
> if
> > > the
> > > session timeout happens, then the client must cut all of its
> connections
> > > and
> > > kill itself with fire, rather than overwrite what the next leader does.
> > > Then
> > > the next leader has to go through the expensive process of starting the
> > > service back up.
> > >
> > > Hopefully that will give some color for why I'm concerned about this
> > > situation.
> > >
> > > Thanks,
> > >
> > > Scott
> > >
> > > On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rc...@gmail.com>
> > wrote:
> > >
> > > > Scott:
> > > >
> > > >  the right answer in this case is for the leader to watch for the
> > > > "disconnected" event and shut down. If the connection re-establishes,
> > > > the leader should still be the leader (their ephemeral sequential
> node
> > > > should still be there), in which case it can go back to work. If the
> > > > connection doesn't re-establish, one of two things may happen…
> > > >
> > > > 1) Your leader stays in the disconnected state (because it's unable
> to
> > > > reconnect), meanwhile the zookeeper server expires the session
> > > > (because it hasn't seen a heartbeat), deletes the ephemeral
> sequential
> > > > node and a new worker is promoted to leader.
> > > >
> > > > 2) Your leader quickly transitions to the expired state, the
> ephemeral
> > > > node is lost and a new worker is promoted to leader.
> > > >
> > > > In both cases, your initial leader should see a disconnected event
> > > > first. If it shuts down when it sees that event, you should be
> > > > relatively safe in thinking that you only have one worker going at a
> > > > time.
> > > >
> > > > Once your initial leader sees the expiration event, it can try to
> > > > reconnect to the ensemble, create the new ephemeral sequential node
> > > > and get back into the queue for being a leader.
> > > >
> > > > Ryan
> > > >
> > >
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ted Dunning <te...@gmail.com>.

I wouldn't want to change current behavior but rather to augment it and
comply with Ben's dictum
that ZK report what it knows as it knows it.  Thus, it doesn't know about
expiration until it reconnects,
but it really does know that the expiration period has passed since loss of
connection.  Thus, the name
should reflect what is known.

It does seem good to make current behavior the default.

On Fri, Apr 22, 2011 at 5:10 PM, Scott Fines <sc...@gmail.com> wrote:

> That is one option. It seems like it might complicate what is already a
> fairly subtle system of considerations, though.
>
> An alternative might be to have an option like "fireOnProbableExpiration"
> in
> the ZooKeeper instance.The default would then have to be set to false,
> which
> would preserve the current behavior, but setting it to true would provide
> an
> option for when we absolutely NEED the behavior.
>
> Scott
>
> On Fri, Apr 22, 2011 at 5:38 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Ben, Everybody,
> >
> > What would you think if there were additional events such as
> > "PossibleSessionExpiration", "EstimatedSessionExpiration" and
> > "ProbableSessionExpiration"?  This event would be delivered
> > by the client at a time based on the last successful heartbeat, an
> > intermediate point or the connection loss event respectively.
> >
> > Does this sound interesting?
> >
> > On Fri, Apr 22, 2011 at 3:06 PM, Dave Wright <wr...@gmail.com> wrote:
> >
> > > We ran into this exact scenario, and while it would have been nice to
> > > have the timer option implemented internally by ZK, we ended up
> > > implementing it externally ourself. We start a timer on the
> > > disconnected event, and when it gets "close" to the session timeout,
> > > we trigger the session lost behavior on the master.
> > > We may be without a master for a second or two, but that's OK in our
> > > case. As Ted mentioned, without a connection to ZK, there is no way to
> > > time it exactly anyway.
> > >
> > > The one advantage of having the session-lost timer running within
> > > zkclient instead of our app, is that it could track the timer from the
> > > last actual heartbeat, rather than the disconnected event. Depending
> > > on the network conditions that caused the disconnection, it may have
> > > been a while from when we actually lost connectivity to ZK to when the
> > > disconnection event triggers, so our own timer may not be super
> > > accurate. Having zkclient set a timer based on the last heartbeat, and
> > > triggering the session lost event when that timer expires would be
> > > more accurate.
> > >
> > > -Dave
> > >
> > >
> > > On Fri, Apr 22, 2011 at 10:03 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > > Well there are real limits about what knowledge you can have in a
> split
> > > > brain and how much coordination there can be.
> > > >
> > > > Having exactly one master in such situation is impossible.  You get
> to
> > > pick
> > > > your error scenario, however.  One option is to have one master
> almost
> > > all
> > > > the time with a failure mode of having zero acting masters a bit of
> the
> > > > time.  The other option is to have one master almost all the time
> with
> > a
> > > > failure mode that has two masters a bit of the time.  You get to pick
> > > which
> > > > one.
> > > >
> > > > As Ben stated, the philosophy of ZK is to report facts that can be
> > > > demonstrated.  Your application will work pretty well with a timer
> even
> > > > though that could result in momentary double master situations.  Of
> > > course,
> > > > it can also result in periods of zero master as well since a master
> cut
> > > off
> > > > from ZK may well be cut off from the clients who want to be served.
> > > >
> > > > So the API isn't making a promise it can't keep.  It is promising to
> > > report
> > > > to you as soon as it is certain of things.  And it does.
> > > >
> > > > On Fri, Apr 22, 2011 at 6:51 AM, Scott Fines <sc...@gmail.com>
> > > wrote:
> > > >
> > > >> I guess my objection would be that the API is making a promise that
> it
> > > can
> > > >> only deliver part of the time. If the client can't reconnect to
> > > ZooKeeper,
> > > >> then the client hasn't expired, which is an unusual state to find
> > > oneself
> > > >> in, and in leader-election systems like mine could result in having
> > two
> > > >> practical leaders, while ZooKeeper is insisting that there is only
> > one.
> > > >> This
> > > >> kind of split-brain scenario seems unavoidable in the absence of
> > > >> probabilistic failure checking (like timeouts).
> > > >>
> > > >> The FAQ, I've noticed, does make mention of this phenomenon. Perhaps
> > > >> something should be indicated there regarding the why and not just
> the
> > > >> mechanics. Otherwise, developers such as myself might find
> themselves
> > > >> unduly
> > > >> confused by it :)
> > > >>
> > > >> Thanks for all your help,
> > > >>
> > > >
> > >
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Scott Fines <sc...@gmail.com>.

That is one option. It seems like it might complicate what is already a
fairly subtle system of considerations, though.

An alternative might be to have an option like "fireOnProbableExpiration" in
the ZooKeeper instance.The default would then have to be set to false, which
would preserve the current behavior, but setting it to true would provide an
option for when we absolutely NEED the behavior.

Scott

On Fri, Apr 22, 2011 at 5:38 PM, Ted Dunning <te...@gmail.com> wrote:

> Ben, Everybody,
>
> What would you think if there were additional events such as
> "PossibleSessionExpiration", "EstimatedSessionExpiration" and
> "ProbableSessionExpiration"?  This event would be delivered
> by the client at a time based on the last successful heartbeat, an
> intermediate point or the connection loss event respectively.
>
> Does this sound interesting?
>
> On Fri, Apr 22, 2011 at 3:06 PM, Dave Wright <wr...@gmail.com> wrote:
>
> > We ran into this exact scenario, and while it would have been nice to
> > have the timer option implemented internally by ZK, we ended up
> > implementing it externally ourself. We start a timer on the
> > disconnected event, and when it gets "close" to the session timeout,
> > we trigger the session lost behavior on the master.
> > We may be without a master for a second or two, but that's OK in our
> > case. As Ted mentioned, without a connection to ZK, there is no way to
> > time it exactly anyway.
> >
> > The one advantage of having the session-lost timer running within
> > zkclient instead of our app, is that it could track the timer from the
> > last actual heartbeat, rather than the disconnected event. Depending
> > on the network conditions that caused the disconnection, it may have
> > been a while from when we actually lost connectivity to ZK to when the
> > disconnection event triggers, so our own timer may not be super
> > accurate. Having zkclient set a timer based on the last heartbeat, and
> > triggering the session lost event when that timer expires would be
> > more accurate.
> >
> > -Dave
> >
> >
> > On Fri, Apr 22, 2011 at 10:03 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > Well there are real limits about what knowledge you can have in a split
> > > brain and how much coordination there can be.
> > >
> > > Having exactly one master in such situation is impossible.  You get to
> > pick
> > > your error scenario, however.  One option is to have one master almost
> > all
> > > the time with a failure mode of having zero acting masters a bit of the
> > > time.  The other option is to have one master almost all the time with
> a
> > > failure mode that has two masters a bit of the time.  You get to pick
> > which
> > > one.
> > >
> > > As Ben stated, the philosophy of ZK is to report facts that can be
> > > demonstrated.  Your application will work pretty well with a timer even
> > > though that could result in momentary double master situations.  Of
> > course,
> > > it can also result in periods of zero master as well since a master cut
> > off
> > > from ZK may well be cut off from the clients who want to be served.
> > >
> > > So the API isn't making a promise it can't keep.  It is promising to
> > report
> > > to you as soon as it is certain of things.  And it does.
> > >
> > > On Fri, Apr 22, 2011 at 6:51 AM, Scott Fines <sc...@gmail.com>
> > wrote:
> > >
> > >> I guess my objection would be that the API is making a promise that it
> > can
> > >> only deliver part of the time. If the client can't reconnect to
> > ZooKeeper,
> > >> then the client hasn't expired, which is an unusual state to find
> > oneself
> > >> in, and in leader-election systems like mine could result in having
> two
> > >> practical leaders, while ZooKeeper is insisting that there is only
> one.
> > >> This
> > >> kind of split-brain scenario seems unavoidable in the absence of
> > >> probabilistic failure checking (like timeouts).
> > >>
> > >> The FAQ, I've noticed, does make mention of this phenomenon. Perhaps
> > >> something should be indicated there regarding the why and not just the
> > >> mechanics. Otherwise, developers such as myself might find themselves
> > >> unduly
> > >> confused by it :)
> > >>
> > >> Thanks for all your help,
> > >>
> > >
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ted Dunning <te...@gmail.com>.

Ben, Everybody,

What would you think if there were additional events such as
"PossibleSessionExpiration", "EstimatedSessionExpiration" and
"ProbableSessionExpiration"?  This event would be delivered
by the client at a time based on the last successful heartbeat, an
intermediate point or the connection loss event respectively.

Does this sound interesting?

On Fri, Apr 22, 2011 at 3:06 PM, Dave Wright <wr...@gmail.com> wrote:

> We ran into this exact scenario, and while it would have been nice to
> have the timer option implemented internally by ZK, we ended up
> implementing it externally ourself. We start a timer on the
> disconnected event, and when it gets "close" to the session timeout,
> we trigger the session lost behavior on the master.
> We may be without a master for a second or two, but that's OK in our
> case. As Ted mentioned, without a connection to ZK, there is no way to
> time it exactly anyway.
>
> The one advantage of having the session-lost timer running within
> zkclient instead of our app, is that it could track the timer from the
> last actual heartbeat, rather than the disconnected event. Depending
> on the network conditions that caused the disconnection, it may have
> been a while from when we actually lost connectivity to ZK to when the
> disconnection event triggers, so our own timer may not be super
> accurate. Having zkclient set a timer based on the last heartbeat, and
> triggering the session lost event when that timer expires would be
> more accurate.
>
> -Dave
>
>
> On Fri, Apr 22, 2011 at 10:03 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > Well there are real limits about what knowledge you can have in a split
> > brain and how much coordination there can be.
> >
> > Having exactly one master in such situation is impossible.  You get to
> pick
> > your error scenario, however.  One option is to have one master almost
> all
> > the time with a failure mode of having zero acting masters a bit of the
> > time.  The other option is to have one master almost all the time with a
> > failure mode that has two masters a bit of the time.  You get to pick
> which
> > one.
> >
> > As Ben stated, the philosophy of ZK is to report facts that can be
> > demonstrated.  Your application will work pretty well with a timer even
> > though that could result in momentary double master situations.  Of
> course,
> > it can also result in periods of zero master as well since a master cut
> off
> > from ZK may well be cut off from the clients who want to be served.
> >
> > So the API isn't making a promise it can't keep.  It is promising to
> report
> > to you as soon as it is certain of things.  And it does.
> >
> > On Fri, Apr 22, 2011 at 6:51 AM, Scott Fines <sc...@gmail.com>
> wrote:
> >
> >> I guess my objection would be that the API is making a promise that it
> can
> >> only deliver part of the time. If the client can't reconnect to
> ZooKeeper,
> >> then the client hasn't expired, which is an unusual state to find
> oneself
> >> in, and in leader-election systems like mine could result in having two
> >> practical leaders, while ZooKeeper is insisting that there is only one.
> >> This
> >> kind of split-brain scenario seems unavoidable in the absence of
> >> probabilistic failure checking (like timeouts).
> >>
> >> The FAQ, I've noticed, does make mention of this phenomenon. Perhaps
> >> something should be indicated there regarding the why and not just the
> >> mechanics. Otherwise, developers such as myself might find themselves
> >> unduly
> >> confused by it :)
> >>
> >> Thanks for all your help,
> >>
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Dave Wright <wr...@gmail.com>.

We ran into this exact scenario, and while it would have been nice to
have the timer option implemented internally by ZK, we ended up
implementing it externally ourself. We start a timer on the
disconnected event, and when it gets "close" to the session timeout,
we trigger the session lost behavior on the master.
We may be without a master for a second or two, but that's OK in our
case. As Ted mentioned, without a connection to ZK, there is no way to
time it exactly anyway.

The one advantage of having the session-lost timer running within
zkclient instead of our app, is that it could track the timer from the
last actual heartbeat, rather than the disconnected event. Depending
on the network conditions that caused the disconnection, it may have
been a while from when we actually lost connectivity to ZK to when the
disconnection event triggers, so our own timer may not be super
accurate. Having zkclient set a timer based on the last heartbeat, and
triggering the session lost event when that timer expires would be
more accurate.

-Dave

On Fri, Apr 22, 2011 at 10:03 AM, Ted Dunning <te...@gmail.com> wrote:
> Well there are real limits about what knowledge you can have in a split
> brain and how much coordination there can be.
>
> Having exactly one master in such situation is impossible.  You get to pick
> your error scenario, however.  One option is to have one master almost all
> the time with a failure mode of having zero acting masters a bit of the
> time.  The other option is to have one master almost all the time with a
> failure mode that has two masters a bit of the time.  You get to pick which
> one.
>
> As Ben stated, the philosophy of ZK is to report facts that can be
> demonstrated.  Your application will work pretty well with a timer even
> though that could result in momentary double master situations.  Of course,
> it can also result in periods of zero master as well since a master cut off
> from ZK may well be cut off from the clients who want to be served.
>
> So the API isn't making a promise it can't keep.  It is promising to report
> to you as soon as it is certain of things.  And it does.
>
> On Fri, Apr 22, 2011 at 6:51 AM, Scott Fines <sc...@gmail.com> wrote:
>
>> I guess my objection would be that the API is making a promise that it can
>> only deliver part of the time. If the client can't reconnect to ZooKeeper,
>> then the client hasn't expired, which is an unusual state to find oneself
>> in, and in leader-election systems like mine could result in having two
>> practical leaders, while ZooKeeper is insisting that there is only one.
>> This
>> kind of split-brain scenario seems unavoidable in the absence of
>> probabilistic failure checking (like timeouts).
>>
>> The FAQ, I've noticed, does make mention of this phenomenon. Perhaps
>> something should be indicated there regarding the why and not just the
>> mechanics. Otherwise, developers such as myself might find themselves
>> unduly
>> confused by it :)
>>
>> Thanks for all your help,
>>
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ted Dunning <te...@gmail.com>.

Well there are real limits about what knowledge you can have in a split
brain and how much coordination there can be.

Having exactly one master in such situation is impossible.  You get to pick
your error scenario, however.  One option is to have one master almost all
the time with a failure mode of having zero acting masters a bit of the
time.  The other option is to have one master almost all the time with a
failure mode that has two masters a bit of the time.  You get to pick which
one.

As Ben stated, the philosophy of ZK is to report facts that can be
demonstrated.  Your application will work pretty well with a timer even
though that could result in momentary double master situations.  Of course,
it can also result in periods of zero master as well since a master cut off
from ZK may well be cut off from the clients who want to be served.

So the API isn't making a promise it can't keep.  It is promising to report
to you as soon as it is certain of things.  And it does.

On Fri, Apr 22, 2011 at 6:51 AM, Scott Fines <sc...@gmail.com> wrote:

> I guess my objection would be that the API is making a promise that it can
> only deliver part of the time. If the client can't reconnect to ZooKeeper,
> then the client hasn't expired, which is an unusual state to find oneself
> in, and in leader-election systems like mine could result in having two
> practical leaders, while ZooKeeper is insisting that there is only one.
> This
> kind of split-brain scenario seems unavoidable in the absence of
> probabilistic failure checking (like timeouts).
>
> The FAQ, I've noticed, does make mention of this phenomenon. Perhaps
> something should be indicated there regarding the why and not just the
> mechanics. Otherwise, developers such as myself might find themselves
> unduly
> confused by it :)
>
> Thanks for all your help,
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Scott Fines <sc...@gmail.com>.

Okay, I can buy that it's not in keeping with the ZooKeeper design
philosophy to use timeouts in the way that I am describing. I'm guessing
this is so that it avoids situations where clients preemptively time
themselves out and leave sessions hanging?

I guess my objection would be that the API is making a promise that it can
only deliver part of the time. If the client can't reconnect to ZooKeeper,
then the client hasn't expired, which is an unusual state to find oneself
in, and in leader-election systems like mine could result in having two
practical leaders, while ZooKeeper is insisting that there is only one. This
kind of split-brain scenario seems unavoidable in the absence of
probabilistic failure checking (like timeouts).

The FAQ, I've noticed, does make mention of this phenomenon. Perhaps
something should be indicated there regarding the why and not just the
mechanics. Otherwise, developers such as myself might find themselves unduly
confused by it :)

Thanks for all your help,

Scott

On Thu, Apr 21, 2011 at 11:51 PM, Ted Dunning <te...@gmail.com> wrote:

> I like philosophical design points.  This is a good one.
>
> On Thu, Apr 21, 2011 at 5:46 PM, Benjamin Reed <br...@apache.org> wrote:
>
> > i think the perspective to have is that zookeeper tries to deal with
> > facts, and when it doesn't have the facts, it tells you so.
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ted Dunning <te...@gmail.com>.

I like philosophical design points.  This is a good one.

On Thu, Apr 21, 2011 at 5:46 PM, Benjamin Reed <br...@apache.org> wrote:

> i think the perspective to have is that zookeeper tries to deal with
> facts, and when it doesn't have the facts, it tells you so.
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Benjamin Reed <br...@apache.org>.

i think the perspective to have is that zookeeper tries to deal with
facts, and when it doesn't have the facts, it tells you so. when a
client loses a connection to zookeeper it does a callback to let your
application know that it doesn't know the state of the system anymore.
when it reconnects, it tells you that it now knows the system state
and informs of changes.

we have to use timeouts for failure detection, but we don't use
timeouts for figuring out facts. that is why we wait to reconnect for
the session timeout. if you want to use timeouts for session
expiration, you can do it yourself by starting a timer on the
disconnect event and then do an explicit close() when the timer fires.

one thing to keep in mind this delayed session expiration event is
only relevant to the disconnected client. the session itself will get
killed and a new leader elected in the meantime.

for your situation, i wouldn't kill the external connection until the
session expired event, but i would stop consuming data from the
connection while disconnected. i imagine you have some sort of
acknowledgement or consume mechanism for flow control and packet loss
that you can use.

ben

On Thu, Apr 21, 2011 at 3:41 PM, Scott Fines <sc...@gmail.com> wrote:
> Perhaps I am not being clear in my description.
>
> I'm building a system, that receives data events from an external source.
> This external source is not necessarily under my control, and starting up a
> connection to that source is highly expensive, as it entails a high-latency,
> low-bandwith transfer of data. It's more stable than packet radio, but not a
> whole lot faster. This system retains the ability to recreate events, and
> can do so upon request, but the cost to recreate is extremely high.
>
> On the other end, the system pushes these events on to distributed
> processing and (eventually) a long-term storage situation, like Cassandra or
> Hadoop. Each event that is received can be idempotently applied, so the
> system can safely process duplicate messages if they come in.
>
> If the world were perfect and we had Infiniband connections to all of our
> external sources, then there would be no reason for a leader-election
> protocol in this scenario. I would just boot the system on every node, and
> have them do their thing, and why worry? Idempotency is a beautiful thing.
> Sadly, the world is not perfect, and trying to deal with an already slow
> external connection by asking it to send the same data 10 or 15 times is not
> a great idea, performance-wise. In addition to slowing everything down on
> the receiving end, it also has an adverse affect on the source's
> performance; the source, it must be noted, has other things to do besides
> just feeding my system data.
>
> So my solution is to limit the number of external connection to 1, and use
> ZooKeeper leader-elections to manage which machine is running at which time.
> This way, we keep the number of external connections down as low as we go,
> we can guarantee that messages are received and processed idempotently, and
> in the normal situation where there is no trouble at all, life is fine.
>
> What I am trying to deal with right now is how to manage the corner cases of
> when communication with ZooKeeper breaks down.
>
> To answer your question about the ZooKeeper cluster installation: no, it is
> not located in multiple data centers. It is, however, co-located with other
> processes. For about 90-95% (we have an actual measurement, but I can't
> remember it off the top of my head) of the time, the resource utilization is
> low enough and ZooKeeper is lightweight enough that it makes sense to
> co-locate. Occasionally, however, we do see a spike in an individual
> machine's utilization. Even more occasionally, that spike can result in
> clients being disconnected from that ZooKeeper node. Since almost all the
> remainder of the cluster is reachable and appropriately utilized, clients
> typically reconnect to another node, and all is well. Of course, with this
> arrangement, there is a small chance that the entire cluster could blow out
> in usage, which would result in catastrophic ZooKeeper failure. This hasn't
> happened yet, but I'm sure it's only a matter of time. However, if this were
> to happen, the co-located services would also be at the tipping point, and
> would probably crash soon after. Since that system is mission-critical, we
> have lots of monitoring in place, as well as a couple of spare nodes to
> throw at the cluster in case the overall usage is high. So, in short, it
> isn't terribly surprising that we might occasionally get a connection loss
> from ZooKeeper, but in almost all cases, that connection loss can be
> resolved very quickly.
>
> What can NOT be resolved quickly is our leader failing. This is not
> catastrophic, but it does negatively impact performance. Therefore, I wish
> to avoid needlessly creating external connections. If we shut down the
> external connection upon every connection loss, we would pay a price in
> minutes what could easily be resolved in milliseconds. If, however, we wait,
> keeping a connection open, knowing that we are still probabilistically the
> leader, we can continue to receive events. Once the leadership has been
> passed on, however, we should shut down to avoid excess connections.
>
> What's puzzling to me here is the client expectation. On the one hand, there
> is a SessionExpired Event, which clearly indicates that a session has
> expired. However, the session only expires if ZooKeeper fails to hear from
> you within a specified time frame. This specified time is agreed upon, and
> is known by both ZK and the client. Therefore, one expectation would be that
> the SessionExpired event would be fired when that timeframe has been
> exhausted before a connection with ZK could be reestablished. On the other
> hand, Session Expired events only happen AFTER a reconnection to the
> cluster. This is strange to me, because it seems that the client has all the
> information it needs to know that it's session has been expired by
> ZooKeeper, and it doesn't really need to first talk to the Server to pass
> that information back to the application. It's made particularly strange to
> me when I think that, for most cases, a failure to talk to ZooKeeper within
> the session timeout probably means that you aren't going to be able to
> reconnect with it for a lot longer yet, and client is probably going to just
> keep failing. If you were talking about total machine failure, then who
> cares? The application is dead anyway. But if you're talking about a network
> partition, where the node could conceivably still do its job successfully,
> and is only prevented from doing so by a connection to ZooKeeper, as in my
> use case, you want that information in a reasonably timely manner.
>
> Judging from my questions, the relative number of questions about
> SessionExpiration versus disconnect that have been answered, and the fact
> that Twitter and Travis (and presumably others as well) have felt the need
> to implement this behavior outside of the system, it seems like I'm not the
> only one confused about this. So my question is: Why does the ZooKeeper
> client do notification only upon reconnect? Is there a
> performance/consistency reason that I'm not seeing?
>
> Thanks, and sorry for the uber-long message,
>
> Scott
>
> On Thu, Apr 21, 2011 at 4:47 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> Scott,
>>
>> Having your master enter a suspended state is fine, but it cannot act as
>> master during this time (because somebody else may have become master
>> during
>> this time).
>>
>> It is fine to enter a suspended mode, but the suspended master cannot
>> commit
>> to any actions as a master.  Any transactions that it accepts must be
>> considered as expendable.  Usually that means that whoever sent the
>> transactions must retain them until the suspended master regains its senses
>> or relinquishes its master state.
>>
>> The other question that comes up from your description is how your ZK
>> cluster works.  Do you have zookeeper split across data centers?
>>
>> On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <sc...@gmail.com> wrote:
>>
>> > Ryan,
>> >
>> > That is a fair point in that I would have consistency of services--that
>> is,
>> > that I would be pretty sure I'd only have one service running at a time.
>> > However, my particular application demands are such that just stopping
>> and
>> > re-starting on disconnected events is not a good idea.
>> >
>> > What I'm writing is a connector between two data centers, where the
>> > measured
>> > latency is on the order of seconds, and each time a service connects, it
>> > must transfer (hopefully only a few) megabytes of data, which I've
>> measured
>> > to take on the order of minutes. On the other hand, it is not unusual for
>> > us
>> > to receive a disconnected event every now and then, which is generally
>> > resolved on the order of milliseconds. Clearly, I don't want to recreate
>> a
>> > minutes-long process every time we get a milliseconds-long disconnection
>> > which does not remove the service's existing leadership.
>> >
>> > So, when the leader receives a disconnected event, it queues up events to
>> > process, but holds on to its connections and continues to receive events
>> > while it waits for a connection to ZK to be re-established. If the
>> > connection to ZK comes back online within the session timeout window,
>> then
>> > it will just turn processing back on as if nothing happened. However, if
>> > the
>> > session timeout happens, then the client must cut all of its connections
>> > and
>> > kill itself with fire, rather than overwrite what the next leader does.
>> > Then
>> > the next leader has to go through the expensive process of starting the
>> > service back up.
>> >
>> > Hopefully that will give some color for why I'm concerned about this
>> > situation.
>> >
>> > Thanks,
>> >
>> > Scott
>> >
>> > On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rc...@gmail.com>
>> wrote:
>> >
>> > > Scott:
>> > >
>> > >  the right answer in this case is for the leader to watch for the
>> > > "disconnected" event and shut down. If the connection re-establishes,
>> > > the leader should still be the leader (their ephemeral sequential node
>> > > should still be there), in which case it can go back to work. If the
>> > > connection doesn't re-establish, one of two things may happen…
>> > >
>> > > 1) Your leader stays in the disconnected state (because it's unable to
>> > > reconnect), meanwhile the zookeeper server expires the session
>> > > (because it hasn't seen a heartbeat), deletes the ephemeral sequential
>> > > node and a new worker is promoted to leader.
>> > >
>> > > 2) Your leader quickly transitions to the expired state, the ephemeral
>> > > node is lost and a new worker is promoted to leader.
>> > >
>> > > In both cases, your initial leader should see a disconnected event
>> > > first. If it shuts down when it sees that event, you should be
>> > > relatively safe in thinking that you only have one worker going at a
>> > > time.
>> > >
>> > > Once your initial leader sees the expiration event, it can try to
>> > > reconnect to the ensemble, create the new ephemeral sequential node
>> > > and get back into the queue for being a leader.
>> > >
>> > > Ryan
>> > >
>> >
>>
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Scott Fines <sc...@gmail.com>.

Perhaps I am not being clear in my description.

I'm building a system, that receives data events from an external source.
This external source is not necessarily under my control, and starting up a
connection to that source is highly expensive, as it entails a high-latency,
low-bandwith transfer of data. It's more stable than packet radio, but not a
whole lot faster. This system retains the ability to recreate events, and
can do so upon request, but the cost to recreate is extremely high.

On the other end, the system pushes these events on to distributed
processing and (eventually) a long-term storage situation, like Cassandra or
Hadoop. Each event that is received can be idempotently applied, so the
system can safely process duplicate messages if they come in.

If the world were perfect and we had Infiniband connections to all of our
external sources, then there would be no reason for a leader-election
protocol in this scenario. I would just boot the system on every node, and
have them do their thing, and why worry? Idempotency is a beautiful thing.
Sadly, the world is not perfect, and trying to deal with an already slow
external connection by asking it to send the same data 10 or 15 times is not
a great idea, performance-wise. In addition to slowing everything down on
the receiving end, it also has an adverse affect on the source's
performance; the source, it must be noted, has other things to do besides
just feeding my system data.

So my solution is to limit the number of external connection to 1, and use
ZooKeeper leader-elections to manage which machine is running at which time.
This way, we keep the number of external connections down as low as we go,
we can guarantee that messages are received and processed idempotently, and
in the normal situation where there is no trouble at all, life is fine.

What I am trying to deal with right now is how to manage the corner cases of
when communication with ZooKeeper breaks down.

To answer your question about the ZooKeeper cluster installation: no, it is
not located in multiple data centers. It is, however, co-located with other
processes. For about 90-95% (we have an actual measurement, but I can't
remember it off the top of my head) of the time, the resource utilization is
low enough and ZooKeeper is lightweight enough that it makes sense to
co-locate. Occasionally, however, we do see a spike in an individual
machine's utilization. Even more occasionally, that spike can result in
clients being disconnected from that ZooKeeper node. Since almost all the
remainder of the cluster is reachable and appropriately utilized, clients
typically reconnect to another node, and all is well. Of course, with this
arrangement, there is a small chance that the entire cluster could blow out
in usage, which would result in catastrophic ZooKeeper failure. This hasn't
happened yet, but I'm sure it's only a matter of time. However, if this were
to happen, the co-located services would also be at the tipping point, and
would probably crash soon after. Since that system is mission-critical, we
have lots of monitoring in place, as well as a couple of spare nodes to
throw at the cluster in case the overall usage is high. So, in short, it
isn't terribly surprising that we might occasionally get a connection loss
from ZooKeeper, but in almost all cases, that connection loss can be
resolved very quickly.

What can NOT be resolved quickly is our leader failing. This is not
catastrophic, but it does negatively impact performance. Therefore, I wish
to avoid needlessly creating external connections. If we shut down the
external connection upon every connection loss, we would pay a price in
minutes what could easily be resolved in milliseconds. If, however, we wait,
keeping a connection open, knowing that we are still probabilistically the
leader, we can continue to receive events. Once the leadership has been
passed on, however, we should shut down to avoid excess connections.

What's puzzling to me here is the client expectation. On the one hand, there
is a SessionExpired Event, which clearly indicates that a session has
expired. However, the session only expires if ZooKeeper fails to hear from
you within a specified time frame. This specified time is agreed upon, and
is known by both ZK and the client. Therefore, one expectation would be that
the SessionExpired event would be fired when that timeframe has been
exhausted before a connection with ZK could be reestablished. On the other
hand, Session Expired events only happen AFTER a reconnection to the
cluster. This is strange to me, because it seems that the client has all the
information it needs to know that it's session has been expired by
ZooKeeper, and it doesn't really need to first talk to the Server to pass
that information back to the application. It's made particularly strange to
me when I think that, for most cases, a failure to talk to ZooKeeper within
the session timeout probably means that you aren't going to be able to
reconnect with it for a lot longer yet, and client is probably going to just
keep failing. If you were talking about total machine failure, then who
cares? The application is dead anyway. But if you're talking about a network
partition, where the node could conceivably still do its job successfully,
and is only prevented from doing so by a connection to ZooKeeper, as in my
use case, you want that information in a reasonably timely manner.

Judging from my questions, the relative number of questions about
SessionExpiration versus disconnect that have been answered, and the fact
that Twitter and Travis (and presumably others as well) have felt the need
to implement this behavior outside of the system, it seems like I'm not the
only one confused about this. So my question is: Why does the ZooKeeper
client do notification only upon reconnect? Is there a
performance/consistency reason that I'm not seeing?

Thanks, and sorry for the uber-long message,

Scott

On Thu, Apr 21, 2011 at 4:47 PM, Ted Dunning <te...@gmail.com> wrote:

> Scott,
>
> Having your master enter a suspended state is fine, but it cannot act as
> master during this time (because somebody else may have become master
> during
> this time).
>
> It is fine to enter a suspended mode, but the suspended master cannot
> commit
> to any actions as a master.  Any transactions that it accepts must be
> considered as expendable.  Usually that means that whoever sent the
> transactions must retain them until the suspended master regains its senses
> or relinquishes its master state.
>
> The other question that comes up from your description is how your ZK
> cluster works.  Do you have zookeeper split across data centers?
>
> On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <sc...@gmail.com> wrote:
>
> > Ryan,
> >
> > That is a fair point in that I would have consistency of services--that
> is,
> > that I would be pretty sure I'd only have one service running at a time.
> > However, my particular application demands are such that just stopping
> and
> > re-starting on disconnected events is not a good idea.
> >
> > What I'm writing is a connector between two data centers, where the
> > measured
> > latency is on the order of seconds, and each time a service connects, it
> > must transfer (hopefully only a few) megabytes of data, which I've
> measured
> > to take on the order of minutes. On the other hand, it is not unusual for
> > us
> > to receive a disconnected event every now and then, which is generally
> > resolved on the order of milliseconds. Clearly, I don't want to recreate
> a
> > minutes-long process every time we get a milliseconds-long disconnection
> > which does not remove the service's existing leadership.
> >
> > So, when the leader receives a disconnected event, it queues up events to
> > process, but holds on to its connections and continues to receive events
> > while it waits for a connection to ZK to be re-established. If the
> > connection to ZK comes back online within the session timeout window,
> then
> > it will just turn processing back on as if nothing happened. However, if
> > the
> > session timeout happens, then the client must cut all of its connections
> > and
> > kill itself with fire, rather than overwrite what the next leader does.
> > Then
> > the next leader has to go through the expensive process of starting the
> > service back up.
> >
> > Hopefully that will give some color for why I'm concerned about this
> > situation.
> >
> > Thanks,
> >
> > Scott
> >
> > On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rc...@gmail.com>
> wrote:
> >
> > > Scott:
> > >
> > >  the right answer in this case is for the leader to watch for the
> > > "disconnected" event and shut down. If the connection re-establishes,
> > > the leader should still be the leader (their ephemeral sequential node
> > > should still be there), in which case it can go back to work. If the
> > > connection doesn't re-establish, one of two things may happen…
> > >
> > > 1) Your leader stays in the disconnected state (because it's unable to
> > > reconnect), meanwhile the zookeeper server expires the session
> > > (because it hasn't seen a heartbeat), deletes the ephemeral sequential
> > > node and a new worker is promoted to leader.
> > >
> > > 2) Your leader quickly transitions to the expired state, the ephemeral
> > > node is lost and a new worker is promoted to leader.
> > >
> > > In both cases, your initial leader should see a disconnected event
> > > first. If it shuts down when it sees that event, you should be
> > > relatively safe in thinking that you only have one worker going at a
> > > time.
> > >
> > > Once your initial leader sees the expiration event, it can try to
> > > reconnect to the ensemble, create the new ephemeral sequential node
> > > and get back into the queue for being a leader.
> > >
> > > Ryan
> > >
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ted Dunning <te...@gmail.com>.

Scott,

Having your master enter a suspended state is fine, but it cannot act as
master during this time (because somebody else may have become master during
this time).

It is fine to enter a suspended mode, but the suspended master cannot commit
to any actions as a master.  Any transactions that it accepts must be
considered as expendable.  Usually that means that whoever sent the
transactions must retain them until the suspended master regains its senses
or relinquishes its master state.

The other question that comes up from your description is how your ZK
cluster works.  Do you have zookeeper split across data centers?

On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <sc...@gmail.com> wrote:

> Ryan,
>
> That is a fair point in that I would have consistency of services--that is,
> that I would be pretty sure I'd only have one service running at a time.
> However, my particular application demands are such that just stopping and
> re-starting on disconnected events is not a good idea.
>
> What I'm writing is a connector between two data centers, where the
> measured
> latency is on the order of seconds, and each time a service connects, it
> must transfer (hopefully only a few) megabytes of data, which I've measured
> to take on the order of minutes. On the other hand, it is not unusual for
> us
> to receive a disconnected event every now and then, which is generally
> resolved on the order of milliseconds. Clearly, I don't want to recreate a
> minutes-long process every time we get a milliseconds-long disconnection
> which does not remove the service's existing leadership.
>
> So, when the leader receives a disconnected event, it queues up events to
> process, but holds on to its connections and continues to receive events
> while it waits for a connection to ZK to be re-established. If the
> connection to ZK comes back online within the session timeout window, then
> it will just turn processing back on as if nothing happened. However, if
> the
> session timeout happens, then the client must cut all of its connections
> and
> kill itself with fire, rather than overwrite what the next leader does.
> Then
> the next leader has to go through the expensive process of starting the
> service back up.
>
> Hopefully that will give some color for why I'm concerned about this
> situation.
>
> Thanks,
>
> Scott
>
> On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rc...@gmail.com> wrote:
>
> > Scott:
> >
> >  the right answer in this case is for the leader to watch for the
> > "disconnected" event and shut down. If the connection re-establishes,
> > the leader should still be the leader (their ephemeral sequential node
> > should still be there), in which case it can go back to work. If the
> > connection doesn't re-establish, one of two things may happen…
> >
> > 1) Your leader stays in the disconnected state (because it's unable to
> > reconnect), meanwhile the zookeeper server expires the session
> > (because it hasn't seen a heartbeat), deletes the ephemeral sequential
> > node and a new worker is promoted to leader.
> >
> > 2) Your leader quickly transitions to the expired state, the ephemeral
> > node is lost and a new worker is promoted to leader.
> >
> > In both cases, your initial leader should see a disconnected event
> > first. If it shuts down when it sees that event, you should be
> > relatively safe in thinking that you only have one worker going at a
> > time.
> >
> > Once your initial leader sees the expiration event, it can try to
> > reconnect to the ensemble, create the new ephemeral sequential node
> > and get back into the queue for being a leader.
> >
> > Ryan
> >
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Scott Fines <sc...@gmail.com>.

Ryan,

That is a fair point in that I would have consistency of services--that is,
that I would be pretty sure I'd only have one service running at a time.
However, my particular application demands are such that just stopping and
re-starting on disconnected events is not a good idea.

What I'm writing is a connector between two data centers, where the measured
latency is on the order of seconds, and each time a service connects, it
must transfer (hopefully only a few) megabytes of data, which I've measured
to take on the order of minutes. On the other hand, it is not unusual for us
to receive a disconnected event every now and then, which is generally
resolved on the order of milliseconds. Clearly, I don't want to recreate a
minutes-long process every time we get a milliseconds-long disconnection
which does not remove the service's existing leadership.

So, when the leader receives a disconnected event, it queues up events to
process, but holds on to its connections and continues to receive events
while it waits for a connection to ZK to be re-established. If the
connection to ZK comes back online within the session timeout window, then
it will just turn processing back on as if nothing happened. However, if the
session timeout happens, then the client must cut all of its connections and
kill itself with fire, rather than overwrite what the next leader does. Then
the next leader has to go through the expensive process of starting the
service back up.

Hopefully that will give some color for why I'm concerned about this
situation.

Thanks,

Scott

On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rc...@gmail.com> wrote:

> Scott:
>
>  the right answer in this case is for the leader to watch for the
> "disconnected" event and shut down. If the connection re-establishes,
> the leader should still be the leader (their ephemeral sequential node
> should still be there), in which case it can go back to work. If the
> connection doesn't re-establish, one of two things may happen…
>
> 1) Your leader stays in the disconnected state (because it's unable to
> reconnect), meanwhile the zookeeper server expires the session
> (because it hasn't seen a heartbeat), deletes the ephemeral sequential
> node and a new worker is promoted to leader.
>
> 2) Your leader quickly transitions to the expired state, the ephemeral
> node is lost and a new worker is promoted to leader.
>
> In both cases, your initial leader should see a disconnected event
> first. If it shuts down when it sees that event, you should be
> relatively safe in thinking that you only have one worker going at a
> time.
>
> Once your initial leader sees the expiration event, it can try to
> reconnect to the ensemble, create the new ephemeral sequential node
> and get back into the queue for being a leader.
>
> Ryan
>

Re: Unexpected behavior with Session Timeouts in Java Client

Posted by Ryan Kennedy <rc...@gmail.com>.

Scott:

 the right answer in this case is for the leader to watch for the
"disconnected" event and shut down. If the connection re-establishes,
the leader should still be the leader (their ephemeral sequential node
should still be there), in which case it can go back to work. If the
connection doesn't re-establish, one of two things may happen…

1) Your leader stays in the disconnected state (because it's unable to
reconnect), meanwhile the zookeeper server expires the session
(because it hasn't seen a heartbeat), deletes the ephemeral sequential
node and a new worker is promoted to leader.

2) Your leader quickly transitions to the expired state, the ephemeral
node is lost and a new worker is promoted to leader.

In both cases, your initial leader should see a disconnected event
first. If it shuts down when it sees that event, you should be
relatively safe in thinking that you only have one worker going at a
time.

Once your initial leader sees the expiration event, it can try to
reconnect to the ensemble, create the new ephemeral sequential node
and get back into the queue for being a leader.

Ryan