You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Brian Tarbox <br...@gmail.com> on 2013/02/25 17:51:40 UTC

Server seems not to be sending keep-alives so I lose my session ("have not heard from server....")

I am getting the dreaded message:

 10:59:45,871 INFO [org.apache.zookeeper.ClientCnxn] - <Client session
timed out, have not heard from server in 31482ms for sessionid
0x13d11dd08160007, closing socket connection and attempting reconnect>

and from looking at the logs it certainly seems that the keep alive
messages are sometime just not being sent.

In my case I see a bunch of these:
10:58:00,164 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
for sessionid: 0x13d11dd08160007 after 0ms>
10:58:13,511 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
for sessionid: 0x13d11dd08160007 after 0ms>
10:58:26,857 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
for sessionid: 0x13d11dd08160007 after 0ms>
10:58:40,205 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
for sessionid: 0x13d11dd08160007 after 0ms>
10:59:14,140 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
for sessionid: 0x13d11dd08160007 after 0ms>

But then nothing from 10:59:14 until 10:59:45 when my client decides its
been too long and so times out.

I'm running 3.4.5 on EC2 ...any suggestions welcome.

Thanks.

Brian Tarbox
-- 
http://about.me/BrianTarbox

Re: Server seems not to be sending keep-alives so I lose my session ("have not heard from server....")

Posted by Camille Fournier <ca...@apache.org>.
I'm not sure that having more pings will help you with a lossy network. The
right thing to do here I think is to increase the timeout if you want to be
able to stay connected through a lossy network, not send even more pings.
I'm not a networking expect but TCP should be retrying the send of those
packets long before you reach the client liveness timeout. I would also
just in general question how much you want your ZK clients thinking they
are alive when the network is so lossy that 2 pings are unable to be acked
in a reasonable period. The control needs to be the timeouts in your
network both at the ZK layer and the tcp layer (if you have access to
that), not the number of pings you send.

C


On Mon, Feb 25, 2013 at 1:05 PM, Brian Tarbox <br...@gmail.com> wrote:

> The server logs don't say anything.  I do have a theory based on reading
> the code, specifically the SendThread class within ClientCnxn.java
>
> It took me a while to figure that its the client that sends the ping due to
> the error message being "have not heard from the *server *..."
> Once I got past that the key line in the code is:
>
> int timeToNextPing = readTimeout / 2  - clientCnxnSocket.getIdleSend()
>
> This basically means that the client will get at most 2 tries to send the
> ping within the timeout interval, no matter what you set the timeout value
> to.
> In a lossy network this may be insufficient...as can be seen from my client
> logs where I can go 30 seconds without sending a ping.
>
> I'm running a test now where I've changed the "2" to a "4".  I trade a tiny
> increase in network traffic for a much higher chance of getting a
> successful ping even in a bad network environment.
>
> Brian
>
>
> On Mon, Feb 25, 2013 at 11:56 AM, Camille Fournier <camille@apache.org
> >wrote:
>
> > What do your server logs say during this time?
> >
> >
> > On Mon, Feb 25, 2013 at 11:51 AM, Brian Tarbox <briantarbox@gmail.com
> > >wrote:
> >
> > > I am getting the dreaded message:
> > >
> > >  10:59:45,871 INFO [org.apache.zookeeper.ClientCnxn] - <Client session
> > > timed out, have not heard from server in 31482ms for sessionid
> > > 0x13d11dd08160007, closing socket connection and attempting reconnect>
> > >
> > > and from looking at the logs it certainly seems that the keep alive
> > > messages are sometime just not being sent.
> > >
> > > In my case I see a bunch of these:
> > > 10:58:00,164 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:58:13,511 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:58:26,857 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:58:40,205 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:59:14,140 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > >
> > > But then nothing from 10:59:14 until 10:59:45 when my client decides
> its
> > > been too long and so times out.
> > >
> > > I'm running 3.4.5 on EC2 ...any suggestions welcome.
> > >
> > > Thanks.
> > >
> > > Brian Tarbox
> > > --
> > > http://about.me/BrianTarbox
> > >
> >
>
>
>
> --
> http://about.me/BrianTarbox
>

Re: Server seems not to be sending keep-alives so I lose my session ("have not heard from server....")

Posted by Brian Tarbox <br...@gmail.com>.
The server logs don't say anything.  I do have a theory based on reading
the code, specifically the SendThread class within ClientCnxn.java

It took me a while to figure that its the client that sends the ping due to
the error message being "have not heard from the *server *..."
Once I got past that the key line in the code is:

int timeToNextPing = readTimeout / 2  - clientCnxnSocket.getIdleSend()

This basically means that the client will get at most 2 tries to send the
ping within the timeout interval, no matter what you set the timeout value
to.
In a lossy network this may be insufficient...as can be seen from my client
logs where I can go 30 seconds without sending a ping.

I'm running a test now where I've changed the "2" to a "4".  I trade a tiny
increase in network traffic for a much higher chance of getting a
successful ping even in a bad network environment.

Brian


On Mon, Feb 25, 2013 at 11:56 AM, Camille Fournier <ca...@apache.org>wrote:

> What do your server logs say during this time?
>
>
> On Mon, Feb 25, 2013 at 11:51 AM, Brian Tarbox <briantarbox@gmail.com
> >wrote:
>
> > I am getting the dreaded message:
> >
> >  10:59:45,871 INFO [org.apache.zookeeper.ClientCnxn] - <Client session
> > timed out, have not heard from server in 31482ms for sessionid
> > 0x13d11dd08160007, closing socket connection and attempting reconnect>
> >
> > and from looking at the logs it certainly seems that the keep alive
> > messages are sometime just not being sent.
> >
> > In my case I see a bunch of these:
> > 10:58:00,164 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> > for sessionid: 0x13d11dd08160007 after 0ms>
> > 10:58:13,511 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> > for sessionid: 0x13d11dd08160007 after 0ms>
> > 10:58:26,857 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> > for sessionid: 0x13d11dd08160007 after 0ms>
> > 10:58:40,205 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> > for sessionid: 0x13d11dd08160007 after 0ms>
> > 10:59:14,140 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> > for sessionid: 0x13d11dd08160007 after 0ms>
> >
> > But then nothing from 10:59:14 until 10:59:45 when my client decides its
> > been too long and so times out.
> >
> > I'm running 3.4.5 on EC2 ...any suggestions welcome.
> >
> > Thanks.
> >
> > Brian Tarbox
> > --
> > http://about.me/BrianTarbox
> >
>



-- 
http://about.me/BrianTarbox

Re: Server seems not to be sending keep-alives so I lose my session ("have not heard from server....")

Posted by Camille Fournier <ca...@apache.org>.
What do your server logs say during this time?


On Mon, Feb 25, 2013 at 11:51 AM, Brian Tarbox <br...@gmail.com>wrote:

> I am getting the dreaded message:
>
>  10:59:45,871 INFO [org.apache.zookeeper.ClientCnxn] - <Client session
> timed out, have not heard from server in 31482ms for sessionid
> 0x13d11dd08160007, closing socket connection and attempting reconnect>
>
> and from looking at the logs it certainly seems that the keep alive
> messages are sometime just not being sent.
>
> In my case I see a bunch of these:
> 10:58:00,164 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> for sessionid: 0x13d11dd08160007 after 0ms>
> 10:58:13,511 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> for sessionid: 0x13d11dd08160007 after 0ms>
> 10:58:26,857 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> for sessionid: 0x13d11dd08160007 after 0ms>
> 10:58:40,205 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> for sessionid: 0x13d11dd08160007 after 0ms>
> 10:59:14,140 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping response
> for sessionid: 0x13d11dd08160007 after 0ms>
>
> But then nothing from 10:59:14 until 10:59:45 when my client decides its
> been too long and so times out.
>
> I'm running 3.4.5 on EC2 ...any suggestions welcome.
>
> Thanks.
>
> Brian Tarbox
> --
> http://about.me/BrianTarbox
>