You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by "Carroll James (Nokia-LC/Malvern)" <ja...@nokia.com> on 2012/11/28 07:18:59 UTC

Unrecoverable ConnectionLossException after server restart

I'm seeing (what I think) is incorrect behavior from ZooKeeper.

When I start a client, connect to a server, and then restart the server, the client (I thought) was supposed to eventually reconnect. It doesn't. It continually throws a ConnectionLossException on every use, the ZooKeeper client isAlive is true, I never get a SESSION_EXPIRATION, and I can see the client side ephemeral ports listed in the error message counting up as if it's continually attempting to reconnect.

If I recreate the ZooKeeper client, the new client connects and I can use it.

So I could simply react as if I got a SESSION_EXPIRATION exception and rebuild the client state, except the a ConnectionLossException is something I ALSO get when I get a network partition. When I periodically recreate the entire client from scratch in response to a ConnectionLossException I eventually run out of file descriptors and my entire process is hosed. This seems to be related to the use of nio and the repeated opening of pipes and anon_inodes (which show up in an lsof).

Am I doing something wrong? Any suggestions?

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

RE: Unrecoverable ConnectionLossException after server restart

Posted by "Carroll James (Nokia-LC/Malvern)" <ja...@nokia.com>.
Yes. This is running in a unit test with an embedded server that bounces and starts from scratch deleting the data directory between restarts. I guess "don't do that" would be the advice. :-)

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org]
Sent: Wednesday, November 28, 2012 12:55 PM
To: user@zookeeper.apache.org
Subject: Re: Unrecoverable ConnectionLossException after server restart

Are you running a standalone server or an ensemble? Any chance that your datadir is getting cleared btw runs of the server? (for example having data in /tmp and restarting the OS?)

Basically this error message is saying that the client has talked to a server that's at version 2, when it reconnects to the server the server is at version 0.

I've seen cases where people have seen this before when they clear the datadir when restarting the server. I've also seen cases where the user has an ensemble that's mis-configured - e.g. say 3 servers that are running standalone rather than as a single ensemble.

Patrick

On Wed, Nov 28, 2012 at 9:17 AM, Carroll James (Nokia-LC/Malvern) <ja...@nokia.com> wrote:
> This is apparently happening because the session establishment is being rejected on the server side:
>
> 2012-11-28 12:13:04,102 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:54551]
> INFO  ZooKeeperServer - Refusing session request for client
> /127.0.0.1:38095 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
>
> Unfortunately I can't see any indication on the client side that this is the problem. The server just decides to sever the connection and the client just keeps retrying (hence the counting up on the ephemeral ports). I could deal with this in the application if I could tell why the server decided to close the connection. Is there a way for me to do this?
>
> Thanks
> Jim
>
> -----Original Message-----
> From: Carroll James (Nokia-LC/Malvern)
> [mailto:james.carroll@nokia.com]
> Sent: Wednesday, November 28, 2012 1:19 AM
> To: user@zookeeper.apache.org
> Subject: Unrecoverable ConnectionLossException after server restart
>
> I'm seeing (what I think) is incorrect behavior from ZooKeeper.
>
> When I start a client, connect to a server, and then restart the server, the client (I thought) was supposed to eventually reconnect. It doesn't. It continually throws a ConnectionLossException on every use, the ZooKeeper client isAlive is true, I never get a SESSION_EXPIRATION, and I can see the client side ephemeral ports listed in the error message counting up as if it's continually attempting to reconnect.
>
> If I recreate the ZooKeeper client, the new client connects and I can use it.
>
> So I could simply react as if I got a SESSION_EXPIRATION exception and rebuild the client state, except the a ConnectionLossException is something I ALSO get when I get a network partition. When I periodically recreate the entire client from scratch in response to a ConnectionLossException I eventually run out of file descriptors and my entire process is hosed. This seems to be related to the use of nio and the repeated opening of pipes and anon_inodes (which show up in an lsof).
>
> Am I doing something wrong? Any suggestions?
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: Unrecoverable ConnectionLossException after server restart

Posted by Patrick Hunt <ph...@apache.org>.
Are you running a standalone server or an ensemble? Any chance that
your datadir is getting cleared btw runs of the server? (for example
having data in /tmp and restarting the OS?)

Basically this error message is saying that the client has talked to a
server that's at version 2, when it reconnects to the server the
server is at version 0.

I've seen cases where people have seen this before when they clear the
datadir when restarting the server. I've also seen cases where the
user has an ensemble that's mis-configured - e.g. say 3 servers that
are running standalone rather than as a single ensemble.

Patrick

On Wed, Nov 28, 2012 at 9:17 AM, Carroll James (Nokia-LC/Malvern)
<ja...@nokia.com> wrote:
> This is apparently happening because the session establishment is being rejected on the server side:
>
> 2012-11-28 12:13:04,102 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:54551] INFO  ZooKeeperServer - Refusing session request for client /127.0.0.1:38095 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>
> Unfortunately I can't see any indication on the client side that this is the problem. The server just decides to sever the connection and the client just keeps retrying (hence the counting up on the ephemeral ports). I could deal with this in the application if I could tell why the server decided to close the connection. Is there a way for me to do this?
>
> Thanks
> Jim
>
> -----Original Message-----
> From: Carroll James (Nokia-LC/Malvern) [mailto:james.carroll@nokia.com]
> Sent: Wednesday, November 28, 2012 1:19 AM
> To: user@zookeeper.apache.org
> Subject: Unrecoverable ConnectionLossException after server restart
>
> I'm seeing (what I think) is incorrect behavior from ZooKeeper.
>
> When I start a client, connect to a server, and then restart the server, the client (I thought) was supposed to eventually reconnect. It doesn't. It continually throws a ConnectionLossException on every use, the ZooKeeper client isAlive is true, I never get a SESSION_EXPIRATION, and I can see the client side ephemeral ports listed in the error message counting up as if it's continually attempting to reconnect.
>
> If I recreate the ZooKeeper client, the new client connects and I can use it.
>
> So I could simply react as if I got a SESSION_EXPIRATION exception and rebuild the client state, except the a ConnectionLossException is something I ALSO get when I get a network partition. When I periodically recreate the entire client from scratch in response to a ConnectionLossException I eventually run out of file descriptors and my entire process is hosed. This seems to be related to the use of nio and the repeated opening of pipes and anon_inodes (which show up in an lsof).
>
> Am I doing something wrong? Any suggestions?
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

RE: Unrecoverable ConnectionLossException after server restart

Posted by "Carroll James (Nokia-LC/Malvern)" <ja...@nokia.com>.
Ok. So the only difference between a network partition failure and a zookeeper server cluster bounce that I can see from the client side is that in former case the ConnectionLossException happens on a ZooKeeper client where the state is CONNECTED and in the later it's CONNECTING. Is this a reliable means of determining I should recreate the client state from scratch?

-----Original Message-----
From: Carroll James (Nokia-LC/Malvern) [mailto:james.carroll@nokia.com]
Sent: Wednesday, November 28, 2012 12:18 PM
To: user@zookeeper.apache.org
Subject: RE: Unrecoverable ConnectionLossException after server restart

This is apparently happening because the session establishment is being rejected on the server side:

2012-11-28 12:13:04,102 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:54551] INFO  ZooKeeperServer - Refusing session request for client /127.0.0.1:38095 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server

Unfortunately I can't see any indication on the client side that this is the problem. The server just decides to sever the connection and the client just keeps retrying (hence the counting up on the ephemeral ports). I could deal with this in the application if I could tell why the server decided to close the connection. Is there a way for me to do this?

Thanks
Jim

-----Original Message-----
From: Carroll James (Nokia-LC/Malvern) [mailto:james.carroll@nokia.com]
Sent: Wednesday, November 28, 2012 1:19 AM
To: user@zookeeper.apache.org
Subject: Unrecoverable ConnectionLossException after server restart

I'm seeing (what I think) is incorrect behavior from ZooKeeper.

When I start a client, connect to a server, and then restart the server, the client (I thought) was supposed to eventually reconnect. It doesn't. It continually throws a ConnectionLossException on every use, the ZooKeeper client isAlive is true, I never get a SESSION_EXPIRATION, and I can see the client side ephemeral ports listed in the error message counting up as if it's continually attempting to reconnect.

If I recreate the ZooKeeper client, the new client connects and I can use it.

So I could simply react as if I got a SESSION_EXPIRATION exception and rebuild the client state, except the a ConnectionLossException is something I ALSO get when I get a network partition. When I periodically recreate the entire client from scratch in response to a ConnectionLossException I eventually run out of file descriptors and my entire process is hosed. This seems to be related to the use of nio and the repeated opening of pipes and anon_inodes (which show up in an lsof).

Am I doing something wrong? Any suggestions?

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

RE: Unrecoverable ConnectionLossException after server restart

Posted by "Carroll James (Nokia-LC/Malvern)" <ja...@nokia.com>.
This is apparently happening because the session establishment is being rejected on the server side:

2012-11-28 12:13:04,102 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:54551] INFO  ZooKeeperServer - Refusing session request for client /127.0.0.1:38095 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server

Unfortunately I can't see any indication on the client side that this is the problem. The server just decides to sever the connection and the client just keeps retrying (hence the counting up on the ephemeral ports). I could deal with this in the application if I could tell why the server decided to close the connection. Is there a way for me to do this?

Thanks
Jim

-----Original Message-----
From: Carroll James (Nokia-LC/Malvern) [mailto:james.carroll@nokia.com]
Sent: Wednesday, November 28, 2012 1:19 AM
To: user@zookeeper.apache.org
Subject: Unrecoverable ConnectionLossException after server restart

I'm seeing (what I think) is incorrect behavior from ZooKeeper.

When I start a client, connect to a server, and then restart the server, the client (I thought) was supposed to eventually reconnect. It doesn't. It continually throws a ConnectionLossException on every use, the ZooKeeper client isAlive is true, I never get a SESSION_EXPIRATION, and I can see the client side ephemeral ports listed in the error message counting up as if it's continually attempting to reconnect.

If I recreate the ZooKeeper client, the new client connects and I can use it.

So I could simply react as if I got a SESSION_EXPIRATION exception and rebuild the client state, except the a ConnectionLossException is something I ALSO get when I get a network partition. When I periodically recreate the entire client from scratch in response to a ConnectionLossException I eventually run out of file descriptors and my entire process is hosed. This seems to be related to the use of nio and the repeated opening of pipes and anon_inodes (which show up in an lsof).

Am I doing something wrong? Any suggestions?

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.