You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Muhammad Gelbana <m....@gmail.com> on 2019/04/03 11:42:18 UTC

Unexpected delay between pings sent from the client to server

I'm trying to debug a problem where our client application suddenly loses
its Zookeeper session. I concluded that by looking at the Zookeeper server
logs.

I increased the logging details for the client and found the following log
messages

> DEBUG: [07:33:33] [demo | HA | Manager] Got ping response for sessionid:
> 0x3000da76fa904b6 after 0ms
> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> DEBUG: [07:34:07] [demo | HA | Manager] Got ping response for sessionid:
> 0x3000da76fa904b6 after 0ms
> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> DEBUG: [07:34:40] [demo | HA | Manager] Got ping response for sessionid:
> 0x3000da76fa904b6 after 0ms
> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> DEBUG: [07:35:13] [demo | HA | Manager] Got ping response for sessionid:
> 0x3000da76fa904b6 after 0ms
> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> DEBUG: [07:35:47] [demo | HA | Manager] Got ping response for sessionid:
> 0x3000da76fa904b6 after 0ms
> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> DEBUG: [07:36:20] [demo | HA | Manager] Got ping response for sessionid:
> 0x3000da76fa904b6 after 0ms
> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>

I noticed that the duration between each log message is ~33 seconds while
on another environment (my laptop), the duration goes down to ~1 second.
What could be causing this huge difference ? I doubt that whatever is
causing this effect causes the delay to increase significantly at some
point to the extend that makes my client lose its session.

Re: Unexpected delay between pings sent from the client to server

Posted by Patrick Hunt <ph...@apache.org>.
Try turning on verbose GC logging and look if it's GC. In my experience
that's often the cause. "client application suddenly loses
its Zookeeper session" - might be client GC, or it could be server GC. See
if either of those shed light.

Patrick

On Sat, Apr 6, 2019 at 1:10 PM Norbert Kalmar <nk...@cloudera.com.invalid>
wrote:

> Hi Gelbana,
>
> max_latency tells you the time elapsed between creating the request and
> FinalRequestProcessor processing it. So the cause for being that high could
> be basically anything.
> Turning on debug log for ZooKeeper server could help pinpoint at what point
> the request get stuck for so long.
>
> Regards,
> Norbert
>
> On Wed, Apr 3, 2019 at 3:07 PM Muhammad Gelbana <m....@gmail.com>
> wrote:
>
> > Another couple of things I found:
> >
> > *A couple of Zookeeper client threads are stuck at these stacktraces for
> > ~30 seconds*
> > "pool-2-thread-1-EventThread" #1218 daemon prio=5 os_prio=0
> > tid=0x00007ff3f5e23800 nid=0x5cd8 waiting on condition
> [0x00007ff3ef803000]
> >    java.lang.Thread.State: WAITING (parking)
> >     at sun.misc.Unsafe.park(Native Method)
> >     - parking to wait for  <0x000000018d6d8ed8> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >     at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >     at
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >     at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:504)
> >
> > "pool-2-thread-1-SendThread(72.55.136.25:2181)" #1217 daemon prio=5
> > os_prio=0 tid=0x00007ff3f5e23000 nid=0x5cd7 runnable [0x00007ff3ef904000]
> >    java.lang.Thread.State: RUNNABLE
> >     at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >     at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> >     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
> >     - locked <0x000000018d68a730> (a sun.nio.ch.Util$3)
> >     - locked <0x000000018d68a720> (a
> java.util.Collections$UnmodifiableSet)
> >     - locked <0x000000018d68a258> (a sun.nio.ch.EPollSelectorImpl)
> >     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> >     at
> >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349)
> >     at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
> >
> > *Running the mntr command returned the following stats*
> > zk_version    3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on
> > 06/29/2018 04:05 GMT
> > zk_avg_latency    0
> > zk_max_latency    *17657*
> > zk_min_latency    0
> > zk_packets_received    1427134
> > zk_packets_sent    1596974
> > zk_num_alive_connections    64
> > zk_outstanding_requests    0
> > zk_server_state    follower
> > zk_znode_count    1394
> > zk_watch_count    592
> > zk_ephemerals_count    192
> > zk_approximate_data_size    181257
> > zk_open_file_descriptor_count    94
> > zk_max_file_descriptor_count    1048576
> > zk_fsync_threshold_exceed_count    1
> >
> > I find the *zk_max_latency* extremely hight. I'm wondering what kind of
> > latency is that ? How can I debug the reason for this value ?
> >
> > Thanks,
> > Gelbana
> >
> >
> >
> > On Wed, Apr 3, 2019 at 1:42 PM Muhammad Gelbana <m....@gmail.com>
> > wrote:
> >
> > > I'm trying to debug a problem where our client application suddenly
> loses
> > > its Zookeeper session. I concluded that by looking at the Zookeeper
> > server
> > > logs.
> > >
> > > I increased the logging details for the client and found the following
> > log
> > > messages
> > >
> > >> DEBUG: [07:33:33] [demo | HA | Manager] Got ping response for
> sessionid:
> > >> 0x3000da76fa904b6 after 0ms
> > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> > >> DEBUG: [07:34:07] [demo | HA | Manager] Got ping response for
> sessionid:
> > >> 0x3000da76fa904b6 after 0ms
> > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> > >> DEBUG: [07:34:40] [demo | HA | Manager] Got ping response for
> sessionid:
> > >> 0x3000da76fa904b6 after 0ms
> > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> > >> DEBUG: [07:35:13] [demo | HA | Manager] Got ping response for
> sessionid:
> > >> 0x3000da76fa904b6 after 0ms
> > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> > >> DEBUG: [07:35:47] [demo | HA | Manager] Got ping response for
> sessionid:
> > >> 0x3000da76fa904b6 after 0ms
> > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> > >> DEBUG: [07:36:20] [demo | HA | Manager] Got ping response for
> sessionid:
> > >> 0x3000da76fa904b6 after 0ms
> > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> > >>
> > >
> > > I noticed that the duration between each log message is ~33 seconds
> while
> > > on another environment (my laptop), the duration goes down to ~1
> second.
> > > What could be causing this huge difference ? I doubt that whatever is
> > > causing this effect causes the delay to increase significantly at some
> > > point to the extend that makes my client lose its session.
> > >
> >
>

Re: Unexpected delay between pings sent from the client to server

Posted by Norbert Kalmar <nk...@cloudera.com.INVALID>.
Hi Gelbana,

max_latency tells you the time elapsed between creating the request and
FinalRequestProcessor processing it. So the cause for being that high could
be basically anything.
Turning on debug log for ZooKeeper server could help pinpoint at what point
the request get stuck for so long.

Regards,
Norbert

On Wed, Apr 3, 2019 at 3:07 PM Muhammad Gelbana <m....@gmail.com> wrote:

> Another couple of things I found:
>
> *A couple of Zookeeper client threads are stuck at these stacktraces for
> ~30 seconds*
> "pool-2-thread-1-EventThread" #1218 daemon prio=5 os_prio=0
> tid=0x00007ff3f5e23800 nid=0x5cd8 waiting on condition [0x00007ff3ef803000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x000000018d6d8ed8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:504)
>
> "pool-2-thread-1-SendThread(72.55.136.25:2181)" #1217 daemon prio=5
> os_prio=0 tid=0x00007ff3f5e23000 nid=0x5cd7 runnable [0x00007ff3ef904000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>     - locked <0x000000018d68a730> (a sun.nio.ch.Util$3)
>     - locked <0x000000018d68a720> (a java.util.Collections$UnmodifiableSet)
>     - locked <0x000000018d68a258> (a sun.nio.ch.EPollSelectorImpl)
>     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>     at
>
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349)
>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
>
> *Running the mntr command returned the following stats*
> zk_version    3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on
> 06/29/2018 04:05 GMT
> zk_avg_latency    0
> zk_max_latency    *17657*
> zk_min_latency    0
> zk_packets_received    1427134
> zk_packets_sent    1596974
> zk_num_alive_connections    64
> zk_outstanding_requests    0
> zk_server_state    follower
> zk_znode_count    1394
> zk_watch_count    592
> zk_ephemerals_count    192
> zk_approximate_data_size    181257
> zk_open_file_descriptor_count    94
> zk_max_file_descriptor_count    1048576
> zk_fsync_threshold_exceed_count    1
>
> I find the *zk_max_latency* extremely hight. I'm wondering what kind of
> latency is that ? How can I debug the reason for this value ?
>
> Thanks,
> Gelbana
>
>
>
> On Wed, Apr 3, 2019 at 1:42 PM Muhammad Gelbana <m....@gmail.com>
> wrote:
>
> > I'm trying to debug a problem where our client application suddenly loses
> > its Zookeeper session. I concluded that by looking at the Zookeeper
> server
> > logs.
> >
> > I increased the logging details for the client and found the following
> log
> > messages
> >
> >> DEBUG: [07:33:33] [demo | HA | Manager] Got ping response for sessionid:
> >> 0x3000da76fa904b6 after 0ms
> >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> >> DEBUG: [07:34:07] [demo | HA | Manager] Got ping response for sessionid:
> >> 0x3000da76fa904b6 after 0ms
> >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> >> DEBUG: [07:34:40] [demo | HA | Manager] Got ping response for sessionid:
> >> 0x3000da76fa904b6 after 0ms
> >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> >> DEBUG: [07:35:13] [demo | HA | Manager] Got ping response for sessionid:
> >> 0x3000da76fa904b6 after 0ms
> >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> >> DEBUG: [07:35:47] [demo | HA | Manager] Got ping response for sessionid:
> >> 0x3000da76fa904b6 after 0ms
> >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> >> DEBUG: [07:36:20] [demo | HA | Manager] Got ping response for sessionid:
> >> 0x3000da76fa904b6 after 0ms
> >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
> >>
> >
> > I noticed that the duration between each log message is ~33 seconds while
> > on another environment (my laptop), the duration goes down to ~1 second.
> > What could be causing this huge difference ? I doubt that whatever is
> > causing this effect causes the delay to increase significantly at some
> > point to the extend that makes my client lose its session.
> >
>

Re: Unexpected delay between pings sent from the client to server

Posted by Muhammad Gelbana <m....@gmail.com>.
Another couple of things I found:

*A couple of Zookeeper client threads are stuck at these stacktraces for
~30 seconds*
"pool-2-thread-1-EventThread" #1218 daemon prio=5 os_prio=0
tid=0x00007ff3f5e23800 nid=0x5cd8 waiting on condition [0x00007ff3ef803000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000018d6d8ed8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:504)

"pool-2-thread-1-SendThread(72.55.136.25:2181)" #1217 daemon prio=5
os_prio=0 tid=0x00007ff3f5e23000 nid=0x5cd7 runnable [0x00007ff3ef904000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x000000018d68a730> (a sun.nio.ch.Util$3)
    - locked <0x000000018d68a720> (a java.util.Collections$UnmodifiableSet)
    - locked <0x000000018d68a258> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)

*Running the mntr command returned the following stats*
zk_version    3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on
06/29/2018 04:05 GMT
zk_avg_latency    0
zk_max_latency    *17657*
zk_min_latency    0
zk_packets_received    1427134
zk_packets_sent    1596974
zk_num_alive_connections    64
zk_outstanding_requests    0
zk_server_state    follower
zk_znode_count    1394
zk_watch_count    592
zk_ephemerals_count    192
zk_approximate_data_size    181257
zk_open_file_descriptor_count    94
zk_max_file_descriptor_count    1048576
zk_fsync_threshold_exceed_count    1

I find the *zk_max_latency* extremely hight. I'm wondering what kind of
latency is that ? How can I debug the reason for this value ?

Thanks,
Gelbana



On Wed, Apr 3, 2019 at 1:42 PM Muhammad Gelbana <m....@gmail.com> wrote:

> I'm trying to debug a problem where our client application suddenly loses
> its Zookeeper session. I concluded that by looking at the Zookeeper server
> logs.
>
> I increased the logging details for the client and found the following log
> messages
>
>> DEBUG: [07:33:33] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:34:07] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:34:40] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:35:13] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:35:47] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:36:20] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>>
>
> I noticed that the duration between each log message is ~33 seconds while
> on another environment (my laptop), the duration goes down to ~1 second.
> What could be causing this huge difference ? I doubt that whatever is
> causing this effect causes the delay to increase significantly at some
> point to the extend that makes my client lose its session.
>