You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Viktor Feklin (Jira)" <ji...@apache.org> on 2022/09/19 11:28:00 UTC
[jira] [Created] (ZOOKEEPER-4613) Zookeper client suddenly stops sending ping requests in idle state

Viktor Feklin created ZOOKEEPER-4613:
----------------------------------------

             Summary: Zookeper client suddenly stops sending ping requests in idle state
                 Key: ZOOKEEPER-4613
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4613
             Project: ZooKeeper
          Issue Type: Bug
          Components: java client
    Affects Versions: 3.5.7
            Reporter: Viktor Feklin
         Attachments: clint-lost-ping.log

Expected behavior: from documentation: "The session is kept alive by requests sent by the client. If the session is idle for a period of time that would timeout the session, the client will send a PING request to keep the session alive."
Actual behavior: periodically client not sending any ping after last successfuly request.

 

Client operates as expected (handling server notifications and sending pings during idle periods). But after handle another notification it suddenly stops sending any ping (no matter how long idle interval last). Client sleeps until external events is happend (server notification, or program call).
After wakeup client starts to check session timeouts, detects long pause and triggers connection to state SUSPENDED. Client "closing socket connection and attempting reconnect". If period of inactivity was smaller than the session timeout (session is alive) connection triggers to state RECONNECTED, otherwise new seession created and connection goes to state LOST and then to state RECONNECTED. 

 

In the attached log:
1) We have multiple clients working in one JVM.
2) There is a number of zookeeper servers in a cluster.
3) Zookeeper client with seession 0x3008266486f2172 connected to 10.36.193.111:2181 with negotiated timeout = 30_000
4) Client handles last notification at:

 
{code:java}
15:19:48,874 (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181)) DEBUG org.apache.zookeeper.ClientCnxn$SendThread:923 - Reading reply sessionid:0x3008266486f2172, packet:: clientPath:/tc/default/supervisors/wf_1057 serverPath:/tc/default/supervisors/wf_1057 finished:false header:: 6131,4  replyHeader:: 6131,42964550082,0  request:: '/tc/default/supervisors/wf_1057,T  response:: #***,s{42964550082,42964550082,1663244388854,1663244388854,0,0,0,144258549042716267,897,0,42964550082}{code}
 

5) Client shows no activity from this moment until watcher notification from server (client wakeups and detects session timeout):

 
{code:java}
15:20:08,884 INFO  [stdout] (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))  15:20:08 WARN  org.apache.zookeeper.ClientCnxn$SendThread:1190 - Client session timed out, have not heard from server in 20010ms for sessionid 0x3008266486f2172
15:20:08,885 INFO  [stdout] (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))  15:20:08 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1238 - Client session timed out, have not heard from server in 20010ms for sessionid 0x3008266486f2172, closing socket connection and attempting reconnect{code}
 

6) Ping period = 10 seconds (actualy there were no pings in 20 seconds).
7) Annother clients (running in the same jvm and connected to the same zk server) operate normaly while broken client do nothing.

 
{code:java}
15:19:58,885 INFO  [stdout] (mt-[15H15M11S3617]-[module-5]-SendThread(10.36.193.111:2181))  15:19:58 DEBUG org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for sessionid: 0x3008266486f216b after 0ms
15:19:58,975 INFO  [stdout] (mt-[15H15M11S3617]-[module-4]-SendThread(10.36.193.111:2181))  15:19:58 DEBUG org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for sessionid: 0x3008266486f2171 after 1ms{code}
 

So we can claim: 
 * there is no long GC pauses on client (we also has GC log for period of incident without any issues)
 * there is no problem with zk server (no reboots etc, cause it successfuly handles another clints requests)    

We continuously see this behavior after zk version upgrade:
zk client 3.4.8 ->  3.5.7 (curator 2.12.0 -> 4.3.0)
zk server upgrade to 3.5.8 (runing with default configuration)

We have state recovery after session exiration - so connection loss trigger nodes recreation. Another clients react to add/remove events (doing some computations). Having thousands of clients repeadly reconnecting - we get anomaly work load even in idle state...

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)