You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Stack <st...@duboce.net> on 2010/02/22 19:18:42 UTC

Bit of help debugging a TIMED OUT session please

Hey Lads:

Any chance of some pointers debugging a session TIMED OUT?

Client is hosted inside an hbase regionserver.  Usually session
timeout is because of some fat GC pause that is longer than session
timeout but thats not the case here.  It seems to be a connectivity
problem.  Let me post a few logs.  You fellas might have seen it
before or at least can give me pointers or where to go look next.

Alls well for hours -- I see pings on the problem session reported at
some regularity with 1ms latency usually -- and then:

2010-02-21 18:23:50,158 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.SelectionKeyImpl@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and....

2010-02-21 18:23:54,240 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.SelectionKeyImpl@33bcd1d5
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

There is one more of the above and then:

2010-02-21 18:23:55,095 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown
input
java.net.SocketException: Transport endpoint is not connected
        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
        at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
        at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
        at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)

Then another of the above follows immediately.

Then I see:

2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.SelectionKeyImpl@2a50e9a3
java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

The server 14.u.XXX.com was likely under load at the time.  The
ensemble member does not have a dedicated disk (this is a test
cluster).  Could that have been it?  I have logs from under the
14.u.XXX.com zk ensemble member but nothing in there, no exceptions at
least.

Here is a zk dump:

  Quorum Server Statistics:
    - 14.u.XXX.com:2181
        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
        Clients:
         /X.X.X.139:49535[1](queued=0,recved=194,sent=0)
         /X.X.X.141:48336[1](queued=0,recved=0,sent=0)

        Latency min/avg/max: 2/125/345
        Received: 194
        Sent: 0
        Outstanding: 0
        Zxid: 0x300000004
        Mode: follower
        Node count: 11
    - 13.u.XXX.com:2181
        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
        Clients:
         /fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:41308[1](queued=0,recved=10,sent=0)
         /X.X.X.141:39900[1](queued=0,recved=3773,sent=0)
         /X.X.X.141:36583[1](queued=0,recved=0,sent=0)
         /X.X.X.139:59343[1](queued=0,recved=3773,sent=0)

        Latency min/avg/max: 0/7/81
        Received: 7556
        Sent: 0
        Outstanding: 0
        Zxid: 0x300000004
        Mode: leader
        Node count: 11
    - 15.u.XXX.com:2181
        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
        Clients:
         /fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:50573[1](queued=0,recved=3773,sent=0)
         /X.X.X.141:57536[1](queued=0,recved=0,sent=0)

        Latency min/avg/max: 1/1/1
        Received: 3773
        Sent: 0
        Outstanding: 0
        Zxid: 0x300000004
        Mode: follower
        Node count: 11

Thanks for any pointers on how to debug.

St.Ack

Re: Bit of help debugging a TIMED OUT session please

Posted by Patrick Hunt <ph...@apache.org>.

Hard to say based on the bits/pieces of the log we have access to. I'd 
have to see the full log, preferably from both the server and client, to 
gain more insight.

re low numbers, this is the received count for the server, this should 
always increase never decrease. The fact that it is so low either 
indicates that the server recently restarted, or clients are not 
attaching to it. Seems like it should be near the other servers but 
again, hard to tell based on the small aperture we have via mail.

Patrick

Stack wrote:
> Thanks Patrick.  See below.
> 
> On Tue, Feb 23, 2010 at 1:19 PM, Patrick Hunt <ph...@apache.org> wrote:
>> Stack you might look at the following:
>>
>> 1) why does server 14 have such a low recv count?
>>
>>        Received: 194
>>
>> while the other servers are at 3.7k + received. Did server 14 fail at some
>> point? Or it's network? This may have caused the timeout seen by the client:
>>
> 
> Ok.  Will check into this the next time.  I did take the dump after
> the observed TIMED_OUT, a good while after.  Could this be why the
> numbers are low?
> 
>> ------snippet-----
>> 2010-02-21 18:23:55,583 [main-SendThread] INFO
>> org.apache.zookeeper.ClientCnxn: Attempting connection to server
>> 14.u.XXX.com/X.X.X.141:2181
>> 2010-02-21 18:24:00,423
>> [regionserver/208.76.44.140:60020.compactor-SendThread] WARN
>> org.apache.zookeeper.ClientCnxn: Exception closing session
>> 0x226ed968a270003 to sun.nio.ch.SelectionKeyImpl@2a50e9a3
>> java.io.IOException: TIMED OUT
>>        at
>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> -----------
>>
>> 2) connection timeout is different from session timeout. connection timeout
>> is the amount of time we allow for connection establishment (socket open)
>> until the server accepts the connection, this value is the session timeout
>> (as requested by the client) divided by the number of hosts in the host
>> list. This could account for why the timeout (above snippet) occurred after
>> 5 seconds. What timeout value is this client using? 15 seconds?
>>
> We ask for a session timeout of 60 seconds -- the hbase default -- and
> our ticktime is 3 seconds.
> 
> You are not troubled at all by the exceptions closing sessions above?
>  Are these just noise?
> 
> Thanks for the input,
> St.Ack

Re: Bit of help debugging a TIMED OUT session please

Posted by Stack <st...@duboce.net>.

Thanks Patrick.  See below.

On Tue, Feb 23, 2010 at 1:19 PM, Patrick Hunt <ph...@apache.org> wrote:
> Stack you might look at the following:
>
> 1) why does server 14 have such a low recv count?
>
>        Received: 194
>
> while the other servers are at 3.7k + received. Did server 14 fail at some
> point? Or it's network? This may have caused the timeout seen by the client:
>

Ok.  Will check into this the next time.  I did take the dump after
the observed TIMED_OUT, a good while after.  Could this be why the
numbers are low?

> ------snippet-----
> 2010-02-21 18:23:55,583 [main-SendThread] INFO
> org.apache.zookeeper.ClientCnxn: Attempting connection to server
> 14.u.XXX.com/X.X.X.141:2181
> 2010-02-21 18:24:00,423
> [regionserver/208.76.44.140:60020.compactor-SendThread] WARN
> org.apache.zookeeper.ClientCnxn: Exception closing session
> 0x226ed968a270003 to sun.nio.ch.SelectionKeyImpl@2a50e9a3
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> -----------
>
> 2) connection timeout is different from session timeout. connection timeout
> is the amount of time we allow for connection establishment (socket open)
> until the server accepts the connection, this value is the session timeout
> (as requested by the client) divided by the number of hosts in the host
> list. This could account for why the timeout (above snippet) occurred after
> 5 seconds. What timeout value is this client using? 15 seconds?
>
We ask for a session timeout of 60 seconds -- the hbase default -- and
our ticktime is 3 seconds.

You are not troubled at all by the exceptions closing sessions above?
 Are these just noise?

Thanks for the input,
St.Ack

Re: Bit of help debugging a TIMED OUT session please

Posted by Patrick Hunt <ph...@apache.org>.

Stack you might look at the following:

1) why does server 14 have such a low recv count?

         Received: 194

while the other servers are at 3.7k + received. Did server 14 fail at 
some point? Or it's network? This may have caused the timeout seen by 
the client:

------snippet-----
2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.SelectionKeyImpl@2a50e9a3
java.io.IOException: TIMED OUT
         at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
-----------

2) connection timeout is different from session timeout. connection 
timeout is the amount of time we allow for connection establishment 
(socket open) until the server accepts the connection, this value is the 
session timeout (as requested by the client) divided by the number of 
hosts in the host list. This could account for why the timeout (above 
snippet) occurred after 5 seconds. What timeout value is this client 
using? 15 seconds?

Patrick

Stack wrote:
> Dang.  Didn't save the log.  Pardon me.
> 
> I pasted exceptions only and thought it all about 0x26ed968d880001
> session but now I see that what I posted above has TIMED_OUT on
> another session altogether.  Above I skipped pasting exceptions
> thinking them on the same session but now it seems they probably were
> not.
> 
> I'm trying to track a case where zk seems of a sudden, client-side, to
> give up the ghost w/ exceptions like those pasted above --
> connectivity probs.  There has been pollution in here where long gc
> pauses that are > session timeout would trigger TIMED_OUT but those
> have been tamed.
> 
> I'll be back if I get another instance on my hook.
> 
> Meantime, thanks for the comments.
> 
> St.Ack
> 
> On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
>> HI stack,
>>  the other interesting part is with the session:
>> 0x26ed968d880001
>>
>> Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
>> see any of these messages: "Attempting connection to server" in the logs
>> before you see all the consecutive
>>
>> org.apache.zookeeper.ClientCnxn: Exception closing session
>> 0x26ed968d880001 to sun.nio.ch.SelectionKeyImpl@788ab708
>> java.io.IOException: Read error rc = -1
>> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>>        at
>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>>        at
>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>>
>> and....
>>
>>
>> From the cient 0x26ed968d880001?
>>
>> Thanks
>> mahadev
>>
>>
>> On 2/22/10 11:42 AM, "Stack" <st...@duboce.net> wrote:
>>
>>> The thing that seems odd to me is that the connectivity complaints are
>>> out of the zk client, right?, why is it failing getting to member 14
>>> and why not move to another ensemble member if issue w/ 14?, and if
>>> there were a general connectivity issue, I'd think that the running
>>> hbase cluster would be complaining at about the same time (its talking
>>> to datanodes and masters at this time).
>>>
>>> (Thanks for the input lads)
>>>
>>> St.Ack
>>>
>>>
>>> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
>>>> I also looked at the logs. Ted might have a point. It does look like that
>>>> zookeeper server's are doing fine (though as ted mentions the skew is a
>>>> little concerning, though that might be due to very few packets served by
>>>> the first server). Other than that the latencies of 300 ms at max should not
>>>> cause any timeouts.
>>>> Also, the number of packets received is pretty low - meaning that it wasn't
>>>> serving huge traffic. Is there anyway we can check if the network connection
>>>> from the client to the server is not flaky?
>>>>
>>>> Thanks
>>>> mahadev
>>>>
>>>>
>>>> On 2/22/10 10:40 AM, "Ted Dunning" <te...@gmail.com> wrote:
>>>>
>>>>> Not sure this helps at all, but these times are remarkably asymmetrical.  I
>>>>> would expect members of a ZK  cluster to have very comparable times.
>>>>>
>>>>> Additionally, 345 ms is nowhere near large enough to cause a session to
>>>>> expire.  My take is that ZK doesn't think it caused the timeout.
>>>>>
>>>>> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote:
>>>>>
>>>>>>        Latency min/avg/max: 2/125/345
>>>>>> ...
>>>>>>        Latency min/avg/max: 0/7/81
>>>>>> ...
>>>>>>        Latency min/avg/max: 1/1/1
>>>>>>
>>>>>> Thanks for any pointers on how to debug.
>>>>>>
>>>>
>>

Re: Bit of help debugging a TIMED OUT session please

Posted by Stack <st...@duboce.net>.

Dang.  Didn't save the log.  Pardon me.

I pasted exceptions only and thought it all about 0x26ed968d880001
session but now I see that what I posted above has TIMED_OUT on
another session altogether.  Above I skipped pasting exceptions
thinking them on the same session but now it seems they probably were
not.

I'm trying to track a case where zk seems of a sudden, client-side, to
give up the ghost w/ exceptions like those pasted above --
connectivity probs.  There has been pollution in here where long gc
pauses that are > session timeout would trigger TIMED_OUT but those
have been tamed.

I'll be back if I get another instance on my hook.

Meantime, thanks for the comments.

St.Ack

On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
> HI stack,
>  the other interesting part is with the session:
> 0x26ed968d880001
>
> Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
> see any of these messages: "Attempting connection to server" in the logs
> before you see all the consecutive
>
> org.apache.zookeeper.ClientCnxn: Exception closing session
> 0x26ed968d880001 to sun.nio.ch.SelectionKeyImpl@788ab708
> java.io.IOException: Read error rc = -1
> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>
> and....
>
>
> From the cient 0x26ed968d880001?
>
> Thanks
> mahadev
>
>
> On 2/22/10 11:42 AM, "Stack" <st...@duboce.net> wrote:
>
>> The thing that seems odd to me is that the connectivity complaints are
>> out of the zk client, right?, why is it failing getting to member 14
>> and why not move to another ensemble member if issue w/ 14?, and if
>> there were a general connectivity issue, I'd think that the running
>> hbase cluster would be complaining at about the same time (its talking
>> to datanodes and masters at this time).
>>
>> (Thanks for the input lads)
>>
>> St.Ack
>>
>>
>> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
>>> I also looked at the logs. Ted might have a point. It does look like that
>>> zookeeper server's are doing fine (though as ted mentions the skew is a
>>> little concerning, though that might be due to very few packets served by
>>> the first server). Other than that the latencies of 300 ms at max should not
>>> cause any timeouts.
>>> Also, the number of packets received is pretty low - meaning that it wasn't
>>> serving huge traffic. Is there anyway we can check if the network connection
>>> from the client to the server is not flaky?
>>>
>>> Thanks
>>> mahadev
>>>
>>>
>>> On 2/22/10 10:40 AM, "Ted Dunning" <te...@gmail.com> wrote:
>>>
>>>> Not sure this helps at all, but these times are remarkably asymmetrical.  I
>>>> would expect members of a ZK  cluster to have very comparable times.
>>>>
>>>> Additionally, 345 ms is nowhere near large enough to cause a session to
>>>> expire.  My take is that ZK doesn't think it caused the timeout.
>>>>
>>>> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote:
>>>>
>>>>>        Latency min/avg/max: 2/125/345
>>>>> ...
>>>>>        Latency min/avg/max: 0/7/81
>>>>> ...
>>>>>        Latency min/avg/max: 1/1/1
>>>>>
>>>>> Thanks for any pointers on how to debug.
>>>>>
>>>
>>>
>
>

Re: Bit of help debugging a TIMED OUT session please

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

HI stack,
 the other interesting part is with the session:
0x26ed968d880001

Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
see any of these messages: "Attempting connection to server" in the logs
before you see all the consecutive

org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.SelectionKeyImpl@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
        at 
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and....


>From the cient 0x26ed968d880001?

Thanks
mahadev


On 2/22/10 11:42 AM, "Stack" <st...@duboce.net> wrote:

> The thing that seems odd to me is that the connectivity complaints are
> out of the zk client, right?, why is it failing getting to member 14
> and why not move to another ensemble member if issue w/ 14?, and if
> there were a general connectivity issue, I'd think that the running
> hbase cluster would be complaining at about the same time (its talking
> to datanodes and masters at this time).
> 
> (Thanks for the input lads)
> 
> St.Ack
> 
> 
> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
>> I also looked at the logs. Ted might have a point. It does look like that
>> zookeeper server's are doing fine (though as ted mentions the skew is a
>> little concerning, though that might be due to very few packets served by
>> the first server). Other than that the latencies of 300 ms at max should not
>> cause any timeouts.
>> Also, the number of packets received is pretty low - meaning that it wasn't
>> serving huge traffic. Is there anyway we can check if the network connection
>> from the client to the server is not flaky?
>> 
>> Thanks
>> mahadev
>> 
>> 
>> On 2/22/10 10:40 AM, "Ted Dunning" <te...@gmail.com> wrote:
>> 
>>> Not sure this helps at all, but these times are remarkably asymmetrical.  I
>>> would expect members of a ZK  cluster to have very comparable times.
>>> 
>>> Additionally, 345 ms is nowhere near large enough to cause a session to
>>> expire.  My take is that ZK doesn't think it caused the timeout.
>>> 
>>> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote:
>>> 
>>>>        Latency min/avg/max: 2/125/345
>>>> ...
>>>>        Latency min/avg/max: 0/7/81
>>>> ...
>>>>        Latency min/avg/max: 1/1/1
>>>> 
>>>> Thanks for any pointers on how to debug.
>>>> 
>> 
>>

Re: Bit of help debugging a TIMED OUT session please

Posted by Stack <st...@duboce.net>.

The thing that seems odd to me is that the connectivity complaints are
out of the zk client, right?, why is it failing getting to member 14
and why not move to another ensemble member if issue w/ 14?, and if
there were a general connectivity issue, I'd think that the running
hbase cluster would be complaining at about the same time (its talking
to datanodes and masters at this time).

(Thanks for the input lads)

St.Ack


On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
> I also looked at the logs. Ted might have a point. It does look like that
> zookeeper server's are doing fine (though as ted mentions the skew is a
> little concerning, though that might be due to very few packets served by
> the first server). Other than that the latencies of 300 ms at max should not
> cause any timeouts.
> Also, the number of packets received is pretty low - meaning that it wasn't
> serving huge traffic. Is there anyway we can check if the network connection
> from the client to the server is not flaky?
>
> Thanks
> mahadev
>
>
> On 2/22/10 10:40 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
>> Not sure this helps at all, but these times are remarkably asymmetrical.  I
>> would expect members of a ZK  cluster to have very comparable times.
>>
>> Additionally, 345 ms is nowhere near large enough to cause a session to
>> expire.  My take is that ZK doesn't think it caused the timeout.
>>
>> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote:
>>
>>>        Latency min/avg/max: 2/125/345
>>> ...
>>>        Latency min/avg/max: 0/7/81
>>> ...
>>>        Latency min/avg/max: 1/1/1
>>>
>>> Thanks for any pointers on how to debug.
>>>
>
>

Re: Bit of help debugging a TIMED OUT session please

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

I also looked at the logs. Ted might have a point. It does look like that
zookeeper server's are doing fine (though as ted mentions the skew is a
little concerning, though that might be due to very few packets served by
the first server). Other than that the latencies of 300 ms at max should not
cause any timeouts.
Also, the number of packets received is pretty low - meaning that it wasn't
serving huge traffic. Is there anyway we can check if the network connection
from the client to the server is not flaky?

Thanks
mahadev

On 2/22/10 10:40 AM, "Ted Dunning" <te...@gmail.com> wrote:

> Not sure this helps at all, but these times are remarkably asymmetrical.  I
> would expect members of a ZK  cluster to have very comparable times.
> 
> Additionally, 345 ms is nowhere near large enough to cause a session to
> expire.  My take is that ZK doesn't think it caused the timeout.
> 
> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote:
> 
>>        Latency min/avg/max: 2/125/345
>> ...
>>        Latency min/avg/max: 0/7/81
>> ...
>>        Latency min/avg/max: 1/1/1
>> 
>> Thanks for any pointers on how to debug.
>>

Re: Bit of help debugging a TIMED OUT session please

Posted by Ted Dunning <te...@gmail.com>.

Not sure this helps at all, but these times are remarkably asymmetrical.  I
would expect members of a ZK  cluster to have very comparable times.

Additionally, 345 ms is nowhere near large enough to cause a session to
expire.  My take is that ZK doesn't think it caused the timeout.

On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote:

>        Latency min/avg/max: 2/125/345
> ...
>        Latency min/avg/max: 0/7/81
> ...
>        Latency min/avg/max: 1/1/1
>
> Thanks for any pointers on how to debug.
>