You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Sebastian Schmitz (Jira)" <ji...@apache.org> on 2020/05/15 01:20:00 UTC

[jira] [Issue Comment Deleted] (ZOOKEEPER-3822) Zookeeper 3.6.1 EndOfStreamException

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schmitz updated ZOOKEEPER-3822:
-----------------------------------------
    Comment: was deleted

(was: Interestingly this didn't happen again even though I deployed everything around six more times in both environments.)

> Zookeeper 3.6.1 EndOfStreamException
> ------------------------------------
>
>                 Key: ZOOKEEPER-3822
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.1
>            Reporter: Sebastian Schmitz
>            Priority: Critical
>         Attachments: kafka.log, kafka_test.log, zookeeper.log, zookeeper_test.log
>
>
> Hello,
> after Zookeeper 3.6.1 solved the issue with leader-election containing the IP and so causing it to fail in separate networks, like in our docker-setup I updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went smoothly and ran for one day. This night I had a new Update of the environment as we deploy as a whole package of all containers (Kafka, Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with latest ones. In this case, there was no change, the containers were just removed and deployed again. As the config and data of zookeeper is not stored inside the containers that's not a problem but this night it broke the whole clusters of Zookeeper and so also Kafka was down.
>  * zookeeper_node_1 was stopped and the container removed and created again
>  * zookeeper_node_1 starts up and the election takes place
>  * zookeeper_node_2 is elected as leader again
>  * zookeeper_node_2 is stopped and the container removed and created again
>  * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
>  * zookeeper_node_2 starts up and zookeeper_node_3 remains leader
> And from there all servers just report
> 2020-05-07 14:07:57,187 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 14:07:57,187 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exceptionEndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /z.z.z.z:46060, session = 0x2014386bbde0000 at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)  at java.base/java.lang.Thread.run(Unknown Source)
> and don't recover.
> I was able to recover the cluster in Test-Environment by stopping and starting all the zookeeper-nodes. The cluster in dev is still in that state and I'm checking the logs to find out more...
> The full logs of the deployment of Zookeeper and Kafka that started at 02:00 are attached. The first time in local NZ-time and the second one is UTC. the IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for node_3
> The Kafka-Servers are running on the same machine. Which means that the EndOfStreamEceptions could also be connections from Kafka as I don't think that zookeeper_node_3 establish a session with itself?
>  
> Edit:
>  I just found some interesting log from Test-Environment:
>  zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO  [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client /f.f.f.f:42012 as it has seen zxid 0xc600000000 our last zxid is 0xc5000028f8 client must try another server
>  zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO  [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client /f.f.f.f:51506 as it has seen zxid 0xc600000000 our last zxid is 0xc5000028f8 client must try another server
>  These entried are repeated there before the EndOfStreamException shows up...
>  I found that was set by zookeeper_node_3:
>  zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO  [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):Leader@1501] Have quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last processed zxid: 0xc600000000
>  zookeeper_node_3: 2020-05-07 14:10:12,587 [myid:3] INFO  [LearnerHandler-/z.z.z.z:60156:LearnerHandler@800] Synchronizing with Learner sid: 2 maxCommittedLog=0xc5000028f8 minCommittedLog=0xc500002704 lastProcessedZxid=0xc600000000 peerLastZxid=0xc5000028f8
>  It looks like this update of zxid didn't reach nodes 1 and 2 and so they refuse connections from clients?
> I also added the full logs for our test-environment now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)