You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Martin Serrano <ma...@attivio.com> on 2011/09/13 13:25:49 UTC

disconnects and auto renewal

Hi,

We have added code to our application to reconnect and re-establish watches when we receive a Disconnected event.  I am running tests on a heavily loaded system where the zookeeper server and clients are all impacted.  On this test system we regularly experience session timeouts and appropriately react to reconnect and set up our watches.  There is an uncommon case that I am having trouble puzzling out.  When running one of our tests in a loop about 1% of the time we hit a case where on the client side we think the session has expired but on the server side it has been renewed.   We will then fail to be able to create an ephemeral node because it already exists and does not ever get cleaned up (since the previous session is still valid).  I'm trying to figure out if we are misusing the API or if we have encountered a bug.   I'm happy to provide more details.  One thing I am wondering is if it is inappropriate to create a new session within the event thread of another session which has received the disconnected event.

Thanks,
Martin Serrano

  The logs are filtered a little just to keep this email a reasonable size, but everything relevant to the session in question is here.

Associated client application logs (the reconnecting message is ours where we re-establish watches, etc using a new session):

2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Session level event, session 1325fd35663000d: WatchedEvent state:Disconnected type:None path:null
2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Reconnecting, current session 1325fd35663000d
2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Connecting to zookeeper on localhost:16015
2011-09-12 18:50:08,823 INFO [main_master_10.2.1.100_16040] - obj[76a9b9c] Reestablishing agent watches
2011-09-12 18:50:09,120 WARN [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - obj[1e5b04ae] Failed to write status node multinode-monitor/master.  Saw previous value null, retrying
  org.apache.zookeeper.KeeperException$NodeExistsException - KeeperErrorCode = NodeExists for multinode-monitor/master

Associated client zookeeper logs:

2011-09-12 18:48:15,995 INFO  ClientCnxn [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment complete on server localhost/127.0.0.1:16015, sessionid = 0x1325fd35663000d, negotiated timeout = 40000
2011-09-12 18:49:31,245 INFO  ClientCnxn [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Client session timed out, have not heard from server in 26672ms for sessionid 0x1325fd35663000d, closing socket connection and attempting reconnect
2011-09-12 18:49:55,729 INFO  ZooKeeper [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Initiating client connection, connectString=localhost:16015 sessionTimeout=600000 watcher=com.attivio.global.AieZooKeeper@4f536eec<ma...@4f536eec>
2011-09-12 18:50:08,057 INFO  ClientCnxn [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment complete on server localhost/127.0.0.1:16015, sessionid = 0x1325fd35663000d, negotiated timeout = 40000

Associated server zookeeper logs:

2011-09-12 18:48:16,010 INFO  NIOServerCnxn [SyncThread:0] - Established session 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1:4257
2011-09-12 18:49:54,120 WARN  NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Exception causing close of session 0x1325fd35663000d due to java.io.IOException: An established connection was aborted by the software in your host machine
2011-09-12 18:49:54,120 INFO  NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Closed socket connection for client /127.0.0.1:4257 which had sessionid 0x1325fd35663000d
2011-09-12 18:50:08,057 INFO  NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Client attempting to renew session 0x1325fd35663000d at /127.0.0.1:4566
2011-09-12 18:50:08,057 INFO  NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Established session 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1:4566

Re: disconnects and auto renewal

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Sep 14, 2011 at 8:16 PM, Martin Serrano <ma...@attivio.com> wrote:

> ....  The original reason the code was there was to handle the case where a
> client is mainly used for listening to remote events.  So once it starts, it
> sets up a few watches and really doesn't interact with the server after
> that.  The thought was that if such a client was disconnected and did not
> handle that case, we'd never know about it and it would seem like no remote
> events occurred.


The issue here is that disconnects can occur for many reasons and are
generally rather benign.

For instance, if you are doing a rolling upgrade of the Zookeeper servers,
you will get a disconnect and a quick reconnect.  There can be other reasons
for this as well.

In general, any client should be designed with the realization that during a
disconnected period, it is not clear what the situation is.  For instance,
if you are doing leader elect and need to have at least one server running,
then the client should continue serving in whatever roll it was before it
was disconnected.  If, on the other hand, you must avoid having more than
one such server then the client should stop serving in whatever capacity it
is serving in until it either gets a reconnect or an expiration.  If you
want to have your best estimate and minimize time with zero or more than one
server running but put equal weight on both kinds of errors, then the client
should start a timer on disconnect and only stop serving when the client
estimates that the quorum would have marked it as expired.

Clearly, all such strategies are subject to error.  For instance, the timer
approach is subject to error if the clock does not operate the way you
expect or if the entire Zookeeper quorum is down.  These problems are
inherent in distributed systems with fallible communications.  Deal with it.

Most importantly, do not expect that a disconnected client will receive a
session expiration event while it is still disconnected.  These expiration
events come from the server and if you are disconnected, you cannot receive
them.  If and when you reconnect you will receive the right event, but not
until then.


> I have since changed this code to loop trying to check existence of some
> znode upon receipt of a disconnect.  If a session expiration occurs in this
> loop then I trigger the reconnect logic.  Otherwise once we reconnect, the
> check will succeed and the loop will exit.  Does this sound like a
> reasonable way to handle the issue?
>
> Thanks,
> Martin
>
> >
> > Ted,
> >
> > Sorry to trouble you on this one.  I do understand the difference, but at
> > some point I did not.  :)
> >
> > Your question inspired me to look deeper at our code (to see if we were
> > confused) and I found one case that was triggering our reconnect response
> > from Disconnected event.  Everywhere else we only do this in response to
> a
> > SessionExpiredException.
> >
> > Thanks for the quick response and your work on ZooKeeper in general!  I
> > have also run into the "can't create ephemeral yet case" and our code
> > generally loops until successful.
> >
> > -Martin
>
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > >
> > > Martin,
> > >
> > > From your email, it sounds like there might be a bit of confusion
> > > between disconnection and session expiration.  Are you sure you are
> > > clear on the difference between these?
> > >
> > > Also, I have seen cases in my own code where I confused myself by
> > > trying to re-create ephemeral files after a client program crashed.  I
> > > knew that the client had crashed as soon as it happened, but the
> > > Zookeeper servers could only determine this after a bit of time.  My
> > > new program tried to recreate the ephemerals to indicate that it was
> > > back but since the old ephemerals were still there, that failed.  Then
> > > a short time later when the ZK cluster understood that the old client
> > > was gone, the ephemerals disappeared even though the new client was
> > > humming along nicely.  My solution was to delete the ephemerals when
> > creating them.
> > >
> > > Is it possible you have a similar confusion?
> > >
> > > On Tue, Sep 13, 2011 at 11:25 AM, Martin Serrano <ma...@attivio.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > We have added code to our application to reconnect and re-establish
> > > > watches when we receive a Disconnected event.  I am running tests on
> > > > a heavily loaded system where the zookeeper server and clients are
> > > > all impacted.  On this test system we regularly experience session
> > > > timeouts and appropriately react to reconnect and set up our watches.
> > > > There is an uncommon case that I am having trouble puzzling out.
> > > > When running one of our tests in a loop about 1% of the time we hit
> > > > a case where
> > > on the client side we think the
> > > > session has expired but on the server side it has been renewed.   We
> will
> > > > then fail to be able to create an ephemeral node because it already
> > > > exists and does not ever get cleaned up (since the previous session
> > > > is still valid).  I'm trying to figure out if we are misusing the API
> or if we
> > have
> > > > encountered a bug.   I'm happy to provide more details.  One thing I
> am
> > > > wondering is if it is inappropriate to create a new session within
> > > > the event thread of another session which has received the
> > > > disconnected
> > > event.
> > > >
> > > > Thanks,
> > > > Martin Serrano
> > > > ...
>
>

RE: disconnects and auto renewal

Posted by Martin Serrano <ma...@attivio.com>.
Ted,

Once I modified the code to not respond to disconnects like they were session expirations my issue is resolved.  But it did bring up a new question.  The original reason the code was there was to handle the case where a client is mainly used for listening to remote events.  So once it starts, it sets up a few watches and really doesn't interact with the server after that.  The thought was that if such a client was disconnected and did not handle that case, we'd never know about it and it would seem like no remote events occurred.  I have since changed this code to loop trying to check existence of some znode upon receipt of a disconnect.  If a session expiration occurs in this loop then I trigger the reconnect logic.  Otherwise once we reconnect, the check will succeed and the loop will exit.  Does this sound like a reasonable way to handle the issue?

Thanks,
Martin

> 
> Ted,
> 
> Sorry to trouble you on this one.  I do understand the difference, but at
> some point I did not.  :)
> 
> Your question inspired me to look deeper at our code (to see if we were
> confused) and I found one case that was triggering our reconnect response
> from Disconnected event.  Everywhere else we only do this in response to a
> SessionExpiredException.
> 
> Thanks for the quick response and your work on ZooKeeper in general!  I
> have also run into the "can't create ephemeral yet case" and our code
> generally loops until successful.
> 
> -Martin
 
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >
> > Martin,
> >
> > From your email, it sounds like there might be a bit of confusion
> > between disconnection and session expiration.  Are you sure you are
> > clear on the difference between these?
> >
> > Also, I have seen cases in my own code where I confused myself by
> > trying to re-create ephemeral files after a client program crashed.  I
> > knew that the client had crashed as soon as it happened, but the
> > Zookeeper servers could only determine this after a bit of time.  My
> > new program tried to recreate the ephemerals to indicate that it was
> > back but since the old ephemerals were still there, that failed.  Then
> > a short time later when the ZK cluster understood that the old client
> > was gone, the ephemerals disappeared even though the new client was
> > humming along nicely.  My solution was to delete the ephemerals when
> creating them.
> >
> > Is it possible you have a similar confusion?
> >
> > On Tue, Sep 13, 2011 at 11:25 AM, Martin Serrano <ma...@attivio.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We have added code to our application to reconnect and re-establish
> > > watches when we receive a Disconnected event.  I am running tests on
> > > a heavily loaded system where the zookeeper server and clients are
> > > all impacted.  On this test system we regularly experience session
> > > timeouts and appropriately react to reconnect and set up our watches.
> > > There is an uncommon case that I am having trouble puzzling out.
> > > When running one of our tests in a loop about 1% of the time we hit
> > > a case where
> > on the client side we think the
> > > session has expired but on the server side it has been renewed.   We will
> > > then fail to be able to create an ephemeral node because it already
> > > exists and does not ever get cleaned up (since the previous session
> > > is still valid).  I'm trying to figure out if we are misusing the API or if we
> have
> > > encountered a bug.   I'm happy to provide more details.  One thing I am
> > > wondering is if it is inappropriate to create a new session within
> > > the event thread of another session which has received the
> > > disconnected
> > event.
> > >
> > > Thanks,
> > > Martin Serrano
> > > ...


RE: disconnects and auto renewal

Posted by Martin Serrano <ma...@attivio.com>.
Ted,

Sorry to trouble you on this one.  I do understand the difference, but at some point I did not.  :)

Your question inspired me to look deeper at our code (to see if we were confused) and I found one case that was triggering our reconnect response from Disconnected event.  Everywhere else we only do this in response to a SessionExpiredException.

Thanks for the quick response and your work on ZooKeeper in general!  I have also run into the "can't create ephemeral yet case" and our code generally loops until successful.

-Martin

> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> 
> Martin,
> 
> From your email, it sounds like there might be a bit of confusion between
> disconnection and session expiration.  Are you sure you are clear on the
> difference between these?
> 
> Also, I have seen cases in my own code where I confused myself by trying to
> re-create ephemeral files after a client program crashed.  I knew that the
> client had crashed as soon as it happened, but the Zookeeper servers could
> only determine this after a bit of time.  My new program tried to recreate the
> ephemerals to indicate that it was back but since the old ephemerals were
> still there, that failed.  Then a short time later when the ZK cluster
> understood that the old client was gone, the ephemerals disappeared even
> though the new client was humming along nicely.  My solution was to delete
> the ephemerals when creating them.
> 
> Is it possible you have a similar confusion?
> 
> On Tue, Sep 13, 2011 at 11:25 AM, Martin Serrano <ma...@attivio.com>
> wrote:
> 
> > Hi,
> >
> > We have added code to our application to reconnect and re-establish
> > watches when we receive a Disconnected event.  I am running tests on a
> > heavily loaded system where the zookeeper server and clients are all
> > impacted.  On this test system we regularly experience session
> > timeouts and appropriately react to reconnect and set up our watches.
> > There is an uncommon case that I am having trouble puzzling out.  When
> > running one of our tests in a loop about 1% of the time we hit a case where
> on the client side we think the
> > session has expired but on the server side it has been renewed.   We will
> > then fail to be able to create an ephemeral node because it already
> > exists and does not ever get cleaned up (since the previous session is
> > still valid).  I'm trying to figure out if we are misusing the API or if we have
> > encountered a bug.   I'm happy to provide more details.  One thing I am
> > wondering is if it is inappropriate to create a new session within the
> > event thread of another session which has received the disconnected
> event.
> >
> > Thanks,
> > Martin Serrano
> > ...


Re: disconnects and auto renewal

Posted by Ted Dunning <te...@gmail.com>.
Martin,

>From your email, it sounds like there might be a bit of confusion between
disconnection and session expiration.  Are you sure you are clear on the
difference between these?

Also, I have seen cases in my own code where I confused myself by trying to
re-create ephemeral files after a client program crashed.  I knew that the
client had crashed as soon as it happened, but the Zookeeper servers could
only determine this after a bit of time.  My new program tried to recreate
the ephemerals to indicate that it was back but since the old ephemerals
were still there, that failed.  Then a short time later when the ZK cluster
understood that the old client was gone, the ephemerals disappeared even
though the new client was humming along nicely.  My solution was to delete
the ephemerals when creating them.

Is it possible you have a similar confusion?

On Tue, Sep 13, 2011 at 11:25 AM, Martin Serrano <ma...@attivio.com> wrote:

> Hi,
>
> We have added code to our application to reconnect and re-establish watches
> when we receive a Disconnected event.  I am running tests on a heavily
> loaded system where the zookeeper server and clients are all impacted.  On
> this test system we regularly experience session timeouts and appropriately
> react to reconnect and set up our watches.  There is an uncommon case that I
> am having trouble puzzling out.  When running one of our tests in a loop
> about 1% of the time we hit a case where on the client side we think the
> session has expired but on the server side it has been renewed.   We will
> then fail to be able to create an ephemeral node because it already exists
> and does not ever get cleaned up (since the previous session is still
> valid).  I'm trying to figure out if we are misusing the API or if we have
> encountered a bug.   I'm happy to provide more details.  One thing I am
> wondering is if it is inappropriate to create a new session within the event
> thread of another session which has received the disconnected event.
>
> Thanks,
> Martin Serrano
>
>  The logs are filtered a little just to keep this email a reasonable size,
> but everything relevant to the session in question is here.
>
> Associated client application logs (the reconnecting message is ours where
> we re-establish watches, etc using a new session):
>
> 2011-09-12 18:49:55,729 INFO
> [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Session level event,
> session 1325fd35663000d: WatchedEvent state:Disconnected type:None path:null
> 2011-09-12 18:49:55,729 INFO
> [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Reconnecting, current
> session 1325fd35663000d
> 2011-09-12 18:49:55,729 INFO
> [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Connecting to zookeeper
> on localhost:16015
> 2011-09-12 18:50:08,823 INFO [main_master_10.2.1.100_16040] - obj[76a9b9c]
> Reestablishing agent watches
> 2011-09-12 18:50:09,120 WARN
> [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - obj[1e5b04ae] Failed to
> write status node multinode-monitor/master.  Saw previous value null,
> retrying
>  org.apache.zookeeper.KeeperException$NodeExistsException - KeeperErrorCode
> = NodeExists for multinode-monitor/master
>
> Associated client zookeeper logs:
>
> 2011-09-12 18:48:15,995 INFO  ClientCnxn
> [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment
> complete on server localhost/127.0.0.1:16015, sessionid =
> 0x1325fd35663000d, negotiated timeout = 40000
> 2011-09-12 18:49:31,245 INFO  ClientCnxn
> [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Client session timed
> out, have not heard from server in 26672ms for sessionid 0x1325fd35663000d,
> closing socket connection and attempting reconnect
> 2011-09-12 18:49:55,729 INFO  ZooKeeper
> [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Initiating client
> connection, connectString=localhost:16015 sessionTimeout=600000
> watcher=com.attivio.global.AieZooKeeper@4f536eec<mailto:watcher
> =com.attivio.global.AieZooKeeper@4f536eec>
> 2011-09-12 18:50:08,057 INFO  ClientCnxn
> [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment
> complete on server localhost/127.0.0.1:16015, sessionid =
> 0x1325fd35663000d, negotiated timeout = 40000
>
> Associated server zookeeper logs:
>
> 2011-09-12 18:48:16,010 INFO  NIOServerCnxn [SyncThread:0] - Established
> session 0x1325fd35663000d with negotiated timeout 40000 for client /
> 127.0.0.1:4257
> 2011-09-12 18:49:54,120 WARN  NIOServerCnxn
> [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Exception causing close
> of session 0x1325fd35663000d due to java.io.IOException: An established
> connection was aborted by the software in your host machine
> 2011-09-12 18:49:54,120 INFO  NIOServerCnxn
> [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Closed socket
> connection for client /127.0.0.1:4257 which had sessionid
> 0x1325fd35663000d
> 2011-09-12 18:50:08,057 INFO  NIOServerCnxn
> [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Client attempting to
> renew session 0x1325fd35663000d at /127.0.0.1:4566
> 2011-09-12 18:50:08,057 INFO  NIOServerCnxn
> [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Established session
> 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1:4566
>