You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "karl (Jira)" <ji...@apache.org> on 2021/04/29 03:43:00 UTC

[jira] [Created] (ZOOKEEPER-4288) zkServer runok responese 'imok' but any request(include zkCli.sh) responese 'client session timed out'

karl created ZOOKEEPER-4288:
-------------------------------

             Summary: zkServer runok responese 'imok' but any request(include zkCli.sh)  responese 'client session timed out'
                 Key: ZOOKEEPER-4288
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4288
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.5.5
            Reporter: karl
         Attachments: itoa-zookeeper-dp1-5bc799c97f-t7wpz.jstack.txt, itoa-zookeeper-dp1-5bc799c97f-t7wpz.log, itoa-zookeeper-dp2-6d89756876-k2ds2.log

I'm running Zookeeper from the 3.5.5 docker image in k8s. 3 pod are located on three nodes,which communicates with each other with k8s cluset service. I did the following operates: shut down node 2 for 3 hours and then joined the cluster. At this time, I found that zookeeper could not respond to*{color:#FF0000} any requests(include zkCli.sh){color}*. I used the runok and mntr commands to test the server, and only found one exception: outstanding_requests is abnormally high and keeps growing (I have a Kafka cluster using it)

 

my cluster:

 
{code:java}
//my zookeeper pods
[root@node1 ~]# kubectl get po -nsa -o wide | grep zookeeper-dp
itoa-zookeeper-dp1-5bc799c97f-t7wpz                        1/1     Running            0          17h   177.177.166.244   node1   <none>           <none>
itoa-zookeeper-dp2-6d89756876-k2ds2                        1/1     Running            0          13h   177.177.104.35    node2   <none>           <none>
itoa-zookeeper-dp3-b55ddd6c7-8h5p7                         1/1     Running            0          16h   177.177.135.54    node3   <none>           <none>

//my zookeeper cluster svc
[root@node1 ~]# kubectl get svc -nsa -o wide | grep zookeeper-dp                                                                                      44h   node=itoa-zookeeper-dp3
itoa-zookeeper-service1                          ClusterIP   10.96.235.75    <none>        2181/TCP,2888/TCP,3888/TCP                                                                          44h   node=itoa-zookeeper-dp1
itoa-zookeeper-service2                          ClusterIP   10.96.161.122   <none>        2181/TCP,2888/TCP,3888/TCP                                                                          44h   node=itoa-zookeeper-dp2
itoa-zookeeper-service3                          ClusterIP   10.96.67.1      <none>        2181/TCP,2888/TCP,3888/TCP                                                                          44h   node=itoa-zookeeper-dp3{code}
 

 

my cluster conf:

 
{code:java}
[root@node1 ~]# echo conf | nc 177.177.166.244 2181
clientPort=2181
secureClientPort=-1
dataDir=/opt/zookeeper/data/version-2
dataDirSize=135540411
dataLogDir=/opt/zookeeper/data/version-2
dataLogSize=135540411
tickTime=2000
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
serverId=1
initLimit=10
syncLimit=5
electionAlg=3
electionPort=3888
quorumPort=2888
peerType=0
membership:
server.1=0.0.0.0:2888:3888:participant
server.2=itoa-zookeeper-service2:2888:3888:participant
server.3=itoa-zookeeper-service3:2888:3888:participant
version=0
{code}
 

runok & mntr response

 
{code:java}
//177.177.166.244

[root@node1 ~]# echo ruok | nc 177.177.166.244 2181 && echo mntr | nc 177.177.166.244 2181
imokzk_version  3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
zk_avg_latency  2
zk_max_latency  2
zk_min_latency  2
zk_packets_received     740
zk_packets_sent 105
zk_num_alive_connections        1
zk_outstanding_requests 634
zk_server_state follower
zk_znode_count  8092
zk_watch_count  0
zk_ephemerals_count     5
zk_approximate_data_size        708815
zk_open_file_descriptor_count   100
zk_max_file_descriptor_count    1048576

//177.177.135.54
[root@node1 ~]# echo ruok | nc 177.177.135.54 2181 && echo mntr | nc 177.177.135.54 2181
imokzk_version  3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received     727
zk_packets_sent 107
zk_num_alive_connections        1
zk_outstanding_requests 619
zk_server_state leader
zk_znode_count  8092
zk_watch_count  0
zk_ephemerals_count     5
zk_approximate_data_size        708815
zk_open_file_descriptor_count   101
zk_max_file_descriptor_count    1048576
zk_followers    1
zk_synced_followers     1
zk_pending_syncs        0
zk_last_proposal_size   32
zk_max_proposal_size    162
zk_min_proposal_size    32

{code}
 

zkCli..sh responese 'client session timed out'

 
{code:java}
[zk: itoa-zookeeper-service1:2181,itoa-zookeeper-service2:2181,itoa-zookeeper-service3:2181(CONNECTING) 0] ls /
2021-04-29 11:08:22,459 [myid:itoa-zookeeper-service3:2181] - WARN  [main-SendThread(itoa-zookeeper-service3:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server in 10007ms for sessionid 0x0
2021-04-29 11:08:22,459 [myid:itoa-zookeeper-service3:2181] - INFO  [main-SendThread(itoa-zookeeper-service3:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server in 10007ms for sessionid 0x0, closing socket connection and attempting reconnect
KeeperErrorCode = ConnectionLoss for /
[zk: itoa-zookeeper-service1:2181,itoa-zookeeper-service2:2181,itoa-zookeeper-service3:2181(CONNECTING) 1] 2021-04-29 11:08:22,743 [myid:itoa-zookeeper-service2:2181] - INFO  [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@1112] - Opening socket connection to server itoa-zookeeper-service2/10.96.161.122:2181. Will not attempt to authenticate using SASL (unknown error)
2021-04-29 11:08:22,744 [myid:itoa-zookeeper-service2:2181] - INFO  [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /177.177.166.244:45456, server: itoa-zookeeper-service2/10.96.161.122:2181
2021-04-29 11:08:22,747 [myid:itoa-zookeeper-service2:2181] - INFO  [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@1240] - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2021-04-29 11:08:23,761 [myid:itoa-zookeeper-service1:2181] - INFO  [main-SendThread(itoa-zookeeper-service1:2181):ClientCnxn$SendThread@1112] - Opening socket connection to server itoa-zookeeper-service1/10.96.235.75:2181. Will not attempt to authenticate using SASL (unknown error)
2021-04-29 11:08:23,762 [myid:itoa-zookeeper-service1:2181] - INFO  [main-SendThread(itoa-zookeeper-service1:2181):ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /177.177.166.244:42428, server: itoa-zookeeper-service1/10.96.235.75:2181
{code}
 

I shutted down node2 at 2021-04-28 17:12,uped at  20:40.  my pod logs see in the attachment

These other information I collected:
 * The tcp link information of one of the pods(no abnormalities in 3 pods):

 
{code:java}
[root@node1 ~]# nsenter -n -t $(docker inspect -f '{{.State.Pid}}' $(docker ps | grep zookeeper-dp1 | grep -v pause | awk '{print $1}')) netstat -plan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:35883 0.0.0.0:* LISTEN 109187/java
tcp 0 0 0.0.0.0:3888 0.0.0.0:* LISTEN 109187/java
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 109187/java
tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN 109187/java
tcp 0 0 177.177.166.244:3888 177.177.104.35:57886 ESTABLISHED 109187/java
tcp 0 0 177.177.166.244:42204 10.96.67.1:2888 ESTABLISHED 109187/java
tcp 0 0 177.177.166.244:3888 177.177.135.54:44462 ESTABLISHED 109187/java
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node PID/Program name Path
unix 2 [ ACC ] STREAM LISTENING 2342381971 109187/java /tmp/.java_pid1.tmp
unix 2 [ ] STREAM CONNECTED 2243039212 109187/java
unix 2 [ ] STREAM CONNECTED 2243034965 109187/java
{code}
 
 * Jstack information of one of the pods (see attachment,The other two pods have similar information), I suspect that a large number of waiting on condition threads are messing up

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)