You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Yuriy Lopotun <yu...@gmail.com> on 2015/04/23 00:55:38 UTC

Zookeeper-based discovery provider: infinite re-connect loop after server restart

Hi guys,



In our client-server OSGI application we are using ECF Zookeeper-based
discovery provider for remote services discovery (based on Zookeeper
v.3.3.6).

In a standalone mode the plugin opens a dedicated Zookeeper connection from
the client to each of the servers.


When testing the application resiliency, we noticed that when we restart
the server, the connection never gets re-established. In the server logs I
found the following:

2015-04-22 18:20:53,763 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] INFO
org.apac.zook.serv.NIOServerCnxn - Accepted socket connection from /
10.36.64.250:53022

2015-04-22 18:20:53,763 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] DEBUG
org.apac.zook.serv.NIOServerCnxn - Session establishment request from
client /10.36.64.250:53022 client's lastZxid is 0x8

2015-04-22 18:20:53,764 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] INFO
org.apac.zook.serv.NIOServerCnxn - Refusing session request for client /
10.36.64.250:53022 as it has seen zxid 0x8 our last zxid is 0x7 client must
try another server

2015-04-22 18:20:53,764 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] INFO
org.apac.zook.serv.NIOServerCnxn - Closed socket connection for client /
10.36.64.250:53022 (no session established for client)



As far as I understood – this is an expected behaviour, since the server
(due to restart) cleaned up its DB and reset the transaction id.


The problem in this case is that the client session keeps trying
re-connecting to this only server, which causes an infinite loop:

2015-04-22 18:21:02,760 [pool-2-thread-3-SendThread(
ca-rd-mbernard.miranda.com:2001)] INFO  org.apac.zook.ClientCnxn - Opening
socket connection to server ca-rd-mbernard.miranda.com/10.36.64.250:2001

2015-04-22 18:21:02,761 [pool-2-thread-3-SendThread(
ca-rd-mbernard.miranda.com:2001)] INFO  org.apac.zook.ClientCnxn - Socket
connection established to ca-rd-mbernard.miranda.com/10.36.64.250:2001,
initiating session

2015-04-22 18:21:02,761 [pool-2-thread-3-SendThread(
ca-rd-mbernard.miranda.com:2001)] DEBUG org.apac.zook.ClientCnxn - Session
establishment request sent on ca-rd-mbernard.miranda.com/10.36.64.250:2001

2015-04-22 18:21:02,762 [pool-2-thread-3-SendThread(
ca-rd-mbernard.miranda.com:2001)] INFO  org.apac.zook.ClientCnxn - Unable
to read additional data from server sessionid 0x14ce32e178c0002, likely
server has closed socket, closing socket connection and attempting reconnect



Again, I think this is a correct behaviour in case of several servers. But
in our case – it’s always 1.

So, I wanted to ask you for a suggestion: what you think we can do in this
case to achieve automatic reconnect.

I thought, maybe we can close the connection in case of such exception if
there is only 1 server instead of retrying? Maybe this enhancement is
already done in more recent versions and could be back-ported?



Thanks,

Yuriy

Re: Zookeeper-based discovery provider: infinite re-connect loop after server restart

Posted by Yuriy Lopotun <yu...@gmail.com>.

Looks like there's an opened bug for the described issue:
https://issues.apache.org/jira/browse/ZOOKEEPER-832

There was some discussion in the comments but looks like the best solution
hasn't been found yet.

Yuriy

2015-04-22 18:55 GMT-04:00 Yuriy Lopotun <yu...@gmail.com>:

> Hi guys,
>
>
>
> In our client-server OSGI application we are using ECF Zookeeper-based
> discovery provider for remote services discovery (based on Zookeeper
> v.3.3.6).
>
> In a standalone mode the plugin opens a dedicated Zookeeper connection
> from the client to each of the servers.
>
>
> When testing the application resiliency, we noticed that when we restart
> the server, the connection never gets re-established. In the server logs I
> found the following:
>
> 2015-04-22 18:20:53,763 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] INFO
> org.apac.zook.serv.NIOServerCnxn - Accepted socket connection from /
> 10.36.64.250:53022
>
> 2015-04-22 18:20:53,763 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] DEBUG
> org.apac.zook.serv.NIOServerCnxn - Session establishment request from
> client /10.36.64.250:53022 client's lastZxid is 0x8
>
> 2015-04-22 18:20:53,764 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] INFO
> org.apac.zook.serv.NIOServerCnxn - Refusing session request for client /
> 10.36.64.250:53022 as it has seen zxid 0x8 our last zxid is 0x7 client
> must try another server
>
> 2015-04-22 18:20:53,764 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2001] INFO
> org.apac.zook.serv.NIOServerCnxn - Closed socket connection for client /
> 10.36.64.250:53022 (no session established for client)
>
>
>
> As far as I understood – this is an expected behaviour, since the server
> (due to restart) cleaned up its DB and reset the transaction id.
>
>
> The problem in this case is that the client session keeps trying
> re-connecting to this only server, which causes an infinite loop:
>
> 2015-04-22 18:21:02,760 [pool-2-thread-3-SendThread(
> ca-rd-mbernard.miranda.com:2001)] INFO  org.apac.zook.ClientCnxn -
> Opening socket connection to server
> ca-rd-mbernard.miranda.com/10.36.64.250:2001
>
> 2015-04-22 18:21:02,761 [pool-2-thread-3-SendThread(
> ca-rd-mbernard.miranda.com:2001)] INFO  org.apac.zook.ClientCnxn - Socket
> connection established to ca-rd-mbernard.miranda.com/10.36.64.250:2001,
> initiating session
>
> 2015-04-22 18:21:02,761 [pool-2-thread-3-SendThread(
> ca-rd-mbernard.miranda.com:2001)] DEBUG org.apac.zook.ClientCnxn -
> Session establishment request sent on
> ca-rd-mbernard.miranda.com/10.36.64.250:2001
>
> 2015-04-22 18:21:02,762 [pool-2-thread-3-SendThread(
> ca-rd-mbernard.miranda.com:2001)] INFO  org.apac.zook.ClientCnxn - Unable
> to read additional data from server sessionid 0x14ce32e178c0002, likely
> server has closed socket, closing socket connection and attempting reconnect
>
>
>
> Again, I think this is a correct behaviour in case of several servers. But
> in our case – it’s always 1.
>
> So, I wanted to ask you for a suggestion: what you think we can do in this
> case to achieve automatic reconnect.
>
> I thought, maybe we can close the connection in case of such exception if
> there is only 1 server instead of retrying? Maybe this enhancement is
> already done in more recent versions and could be back-ported?
>
>
>
> Thanks,
>
> Yuriy
>