You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Yuriy Lopotun <yu...@gmail.com> on 2015/04/15 19:46:21 UTC

Zookeeper-Zoodiscovery auto reconnect issue

Hi guys,


In our client-server OSGI application we are using ECF Zoodiscovery
provider for remote services discovery which uses Zookeeper (v.3.3.3) under
the hood. When testing the application resiliency, we noticed that when
unplugging/plugging back the network cable, the client in some cases
doesn’t get back remote OSGI services from the server.


I started debugging this use case and found out that in case of session
timeout both Zookeeper internally and Zoodiscovery try reconnecting
simultaneously:

1) Zookeeper internally:

in ClientCnxn.SendThread.run() in case of SessionTimeoutException it closes
socket connection in cleanup(), sends the disconnect event to watchers and
reconnects in startConnect().

2) Zoodiscovery:

Watcher receives the disconnect event from Zookeeper and closes/reopens a
new connection by:

// discard the current stale reader

this.readKeeper.close();

// try reconnecting

this.readKeeper = new ZooKeeper(this.ip, 3000, this);



This results in a connect-disconnect-connect operation (since Zoodiscovery
closes the just reopened by Zookeeper connection and creates a new one)
instead of just one connect. Moreover, this also sometimes results in an
inconsistent client state – connection finally gets re-established, but the
client doesn’t ask the server for the remote services.


I think that the issue in this case is on the Zoodiscovery’s side – it
should not trigger hard disconnect/reconnect in cases when Zookeeper does
it internally. However, I’m not sure how it could distinguish these cases,
because Zookeeper sends an identical disconnect event regardless of whether
or not it’s going to re-connect internally:

eventThread.queueEvent(new WatchedEvent(

                       Event.EventType.None,

                       Event.KeeperState.Disconnected,

                       null));

is in both ClientCnxn.SendThread catch block within the while loop and just
after it.


So, I wanted to ask for your suggestion of how to better handle the
disconnect cases to avoid double reconnects and initiate hard reconnect
from Zoodiscovery only when Zookeper doesn’t do it internally.


Thanks,

Yuriy

Re: Zookeeper-Zoodiscovery auto reconnect issue

Posted by Yuriy Lopotun <yu...@gmail.com>.
Ah, maybe I didn't understand your suggestion correctly.

If you meant that zooKeeper.state.isAlive() should be checked on
Zoodiscovery side before triggering a reconnect - then this should indeed
fix the issue.

Thanks,
Yuriy

2015-04-15 16:49 GMT-04:00 Yuriy Lopotun <yu...@gmail.com>:

> Thanks for your reply.
> I agree that zooKeeper.getState().isAlive() is a good way to check the
> state.
>
> But notice that after sending the Disconnected event (inside the while
> loop) it would almost immediately proceed to the next loop iteration.
> So, "while (zooKeeper.state.isAlive())" at this moment has a hight chance
> to still evaluate to true, because Zoodiscovery would at the same time
> trigger a chain of method invocations:
> ZooKeeper.close() -> ClientCnxn.close() -> disconnect() ->
> sendThread.close() -> zooKeeper.state = States.CLOSED
> which has a high chance to take more time to execute than a condition
> evaluation.
>
> So, ZooKeeper will invoke startConnect() at least 1 time, which will
> trigger a re-connect. At the same time ZooDiscovery, as I mentioned,
> triggered ZooKeeper.close(), which will try to close the new ZooKeeper
> connection.
> I'm trying to find a way to avoid this situation...
>
> Yuriy
>
> 2015-04-15 16:15 GMT-04:00 Camille Fournier <ca...@apache.org>:
>
> So we have the notion of state that you can check.
>> zooKeeper.getState().isAlive() will tell you if the client is actually
>> alive or not.
>>
>> Looking through the code I'm not 100% sure why we are sending the
>> Disconnected state change after the while loop, or if the code ever would,
>> since the state should not be alive at that point (or else it wouldn't
>> have
>> left the while loop).
>>
>> In general though it sounds like a bug in the discovery side as you said.
>> A
>> check for the state liveness (are we closed/auth_failed or just
>> disconnected) should fix this, I think.
>>
>> C
>>
>> On Wed, Apr 15, 2015 at 1:46 PM, Yuriy Lopotun <yu...@gmail.com>
>> wrote:
>>
>> > Hi guys,
>> >
>> >
>> > In our client-server OSGI application we are using ECF Zoodiscovery
>> > provider for remote services discovery which uses Zookeeper (v.3.3.3)
>> under
>> > the hood. When testing the application resiliency, we noticed that when
>> > unplugging/plugging back the network cable, the client in some cases
>> > doesn’t get back remote OSGI services from the server.
>> >
>> >
>> > I started debugging this use case and found out that in case of session
>> > timeout both Zookeeper internally and Zoodiscovery try reconnecting
>> > simultaneously:
>> >
>> > 1) Zookeeper internally:
>> >
>> > in ClientCnxn.SendThread.run() in case of SessionTimeoutException it
>> closes
>> > socket connection in cleanup(), sends the disconnect event to watchers
>> and
>> > reconnects in startConnect().
>> >
>> > 2) Zoodiscovery:
>> >
>> > Watcher receives the disconnect event from Zookeeper and closes/reopens
>> a
>> > new connection by:
>> >
>> > // discard the current stale reader
>> >
>> > this.readKeeper.close();
>> >
>> > // try reconnecting
>> >
>> > this.readKeeper = new ZooKeeper(this.ip, 3000, this);
>> >
>> >
>> >
>> > This results in a connect-disconnect-connect operation (since
>> Zoodiscovery
>> > closes the just reopened by Zookeeper connection and creates a new one)
>> > instead of just one connect. Moreover, this also sometimes results in an
>> > inconsistent client state – connection finally gets re-established, but
>> the
>> > client doesn’t ask the server for the remote services.
>> >
>> >
>> > I think that the issue in this case is on the Zoodiscovery’s side – it
>> > should not trigger hard disconnect/reconnect in cases when Zookeeper
>> does
>> > it internally. However, I’m not sure how it could distinguish these
>> cases,
>> > because Zookeeper sends an identical disconnect event regardless of
>> whether
>> > or not it’s going to re-connect internally:
>> >
>> > eventThread.queueEvent(new WatchedEvent(
>> >
>> >                        Event.EventType.None,
>> >
>> >                        Event.KeeperState.Disconnected,
>> >
>> >                        null));
>> >
>> > is in both ClientCnxn.SendThread catch block within the while loop and
>> just
>> > after it.
>> >
>> >
>> > So, I wanted to ask for your suggestion of how to better handle the
>> > disconnect cases to avoid double reconnects and initiate hard reconnect
>> > from Zoodiscovery only when Zookeper doesn’t do it internally.
>> >
>> >
>> > Thanks,
>> >
>> > Yuriy
>> >
>>
>
>

Re: Zookeeper-Zoodiscovery auto reconnect issue

Posted by Yuriy Lopotun <yu...@gmail.com>.
Thanks for your reply.
I agree that zooKeeper.getState().isAlive() is a good way to check the
state.

But notice that after sending the Disconnected event (inside the while
loop) it would almost immediately proceed to the next loop iteration.
So, "while (zooKeeper.state.isAlive())" at this moment has a hight chance
to still evaluate to true, because Zoodiscovery would at the same time
trigger a chain of method invocations:
ZooKeeper.close() -> ClientCnxn.close() -> disconnect() ->
sendThread.close() -> zooKeeper.state = States.CLOSED
which has a high chance to take more time to execute than a condition
evaluation.

So, ZooKeeper will invoke startConnect() at least 1 time, which will
trigger a re-connect. At the same time ZooDiscovery, as I mentioned,
triggered ZooKeeper.close(), which will try to close the new ZooKeeper
connection.
I'm trying to find a way to avoid this situation...

Yuriy

2015-04-15 16:15 GMT-04:00 Camille Fournier <ca...@apache.org>:

> So we have the notion of state that you can check.
> zooKeeper.getState().isAlive() will tell you if the client is actually
> alive or not.
>
> Looking through the code I'm not 100% sure why we are sending the
> Disconnected state change after the while loop, or if the code ever would,
> since the state should not be alive at that point (or else it wouldn't have
> left the while loop).
>
> In general though it sounds like a bug in the discovery side as you said. A
> check for the state liveness (are we closed/auth_failed or just
> disconnected) should fix this, I think.
>
> C
>
> On Wed, Apr 15, 2015 at 1:46 PM, Yuriy Lopotun <yu...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> >
> > In our client-server OSGI application we are using ECF Zoodiscovery
> > provider for remote services discovery which uses Zookeeper (v.3.3.3)
> under
> > the hood. When testing the application resiliency, we noticed that when
> > unplugging/plugging back the network cable, the client in some cases
> > doesn’t get back remote OSGI services from the server.
> >
> >
> > I started debugging this use case and found out that in case of session
> > timeout both Zookeeper internally and Zoodiscovery try reconnecting
> > simultaneously:
> >
> > 1) Zookeeper internally:
> >
> > in ClientCnxn.SendThread.run() in case of SessionTimeoutException it
> closes
> > socket connection in cleanup(), sends the disconnect event to watchers
> and
> > reconnects in startConnect().
> >
> > 2) Zoodiscovery:
> >
> > Watcher receives the disconnect event from Zookeeper and closes/reopens a
> > new connection by:
> >
> > // discard the current stale reader
> >
> > this.readKeeper.close();
> >
> > // try reconnecting
> >
> > this.readKeeper = new ZooKeeper(this.ip, 3000, this);
> >
> >
> >
> > This results in a connect-disconnect-connect operation (since
> Zoodiscovery
> > closes the just reopened by Zookeeper connection and creates a new one)
> > instead of just one connect. Moreover, this also sometimes results in an
> > inconsistent client state – connection finally gets re-established, but
> the
> > client doesn’t ask the server for the remote services.
> >
> >
> > I think that the issue in this case is on the Zoodiscovery’s side – it
> > should not trigger hard disconnect/reconnect in cases when Zookeeper does
> > it internally. However, I’m not sure how it could distinguish these
> cases,
> > because Zookeeper sends an identical disconnect event regardless of
> whether
> > or not it’s going to re-connect internally:
> >
> > eventThread.queueEvent(new WatchedEvent(
> >
> >                        Event.EventType.None,
> >
> >                        Event.KeeperState.Disconnected,
> >
> >                        null));
> >
> > is in both ClientCnxn.SendThread catch block within the while loop and
> just
> > after it.
> >
> >
> > So, I wanted to ask for your suggestion of how to better handle the
> > disconnect cases to avoid double reconnects and initiate hard reconnect
> > from Zoodiscovery only when Zookeper doesn’t do it internally.
> >
> >
> > Thanks,
> >
> > Yuriy
> >
>

Re: Zookeeper-Zoodiscovery auto reconnect issue

Posted by Camille Fournier <ca...@apache.org>.
So we have the notion of state that you can check.
zooKeeper.getState().isAlive() will tell you if the client is actually
alive or not.

Looking through the code I'm not 100% sure why we are sending the
Disconnected state change after the while loop, or if the code ever would,
since the state should not be alive at that point (or else it wouldn't have
left the while loop).

In general though it sounds like a bug in the discovery side as you said. A
check for the state liveness (are we closed/auth_failed or just
disconnected) should fix this, I think.

C

On Wed, Apr 15, 2015 at 1:46 PM, Yuriy Lopotun <yu...@gmail.com>
wrote:

> Hi guys,
>
>
> In our client-server OSGI application we are using ECF Zoodiscovery
> provider for remote services discovery which uses Zookeeper (v.3.3.3) under
> the hood. When testing the application resiliency, we noticed that when
> unplugging/plugging back the network cable, the client in some cases
> doesn’t get back remote OSGI services from the server.
>
>
> I started debugging this use case and found out that in case of session
> timeout both Zookeeper internally and Zoodiscovery try reconnecting
> simultaneously:
>
> 1) Zookeeper internally:
>
> in ClientCnxn.SendThread.run() in case of SessionTimeoutException it closes
> socket connection in cleanup(), sends the disconnect event to watchers and
> reconnects in startConnect().
>
> 2) Zoodiscovery:
>
> Watcher receives the disconnect event from Zookeeper and closes/reopens a
> new connection by:
>
> // discard the current stale reader
>
> this.readKeeper.close();
>
> // try reconnecting
>
> this.readKeeper = new ZooKeeper(this.ip, 3000, this);
>
>
>
> This results in a connect-disconnect-connect operation (since Zoodiscovery
> closes the just reopened by Zookeeper connection and creates a new one)
> instead of just one connect. Moreover, this also sometimes results in an
> inconsistent client state – connection finally gets re-established, but the
> client doesn’t ask the server for the remote services.
>
>
> I think that the issue in this case is on the Zoodiscovery’s side – it
> should not trigger hard disconnect/reconnect in cases when Zookeeper does
> it internally. However, I’m not sure how it could distinguish these cases,
> because Zookeeper sends an identical disconnect event regardless of whether
> or not it’s going to re-connect internally:
>
> eventThread.queueEvent(new WatchedEvent(
>
>                        Event.EventType.None,
>
>                        Event.KeeperState.Disconnected,
>
>                        null));
>
> is in both ClientCnxn.SendThread catch block within the while loop and just
> after it.
>
>
> So, I wanted to ask for your suggestion of how to better handle the
> disconnect cases to avoid double reconnects and initiate hard reconnect
> from Zoodiscovery only when Zookeper doesn’t do it internally.
>
>
> Thanks,
>
> Yuriy
>