You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Gareth Humphries (JIRA)" <ji...@apache.org> on 2016/02/15 11:12:18 UTC

[jira] [Updated] (ZOOKEEPER-2357) Unhandled errors propagating through cluster

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gareth Humphries updated ZOOKEEPER-2357:
----------------------------------------
    Priority: Major  (was: Minor)

> Unhandled errors propagating through cluster
> --------------------------------------------
>
>                 Key: ZOOKEEPER-2357
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2357
>             Project: ZooKeeper
>          Issue Type: Task
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.4.6
>            Reporter: Gareth Humphries
>
> Hi,
> I need some help understanding a recurring problem we're seeing with our zookeeper cluster.  It's a five node cluster that ordinarily runs fine.  Occasionally we see an error from which the cluster recovers, but it causes a lot of grief and I'm sure is representative of an unhealthy situation.
> To my eye it looks like an invalid bit of data getting into the system and not being handled gracefully; I'm the first to say my eye is not expert though, so I humbly submit an annotated log exert in the hope some who knows more than me can provide some illumination.
> The cluster seems to be ticking along fine, until we get errors on 2 of the 5 nodes like so:
> 2016-01-19 13:12:49,698 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2016-01-19 13:12:49,698 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> This is immediately followed by 380 occurences of:
> 2016-01-19 13:12:49,699 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /X.Y.Z.56:59028 which had sessionid 0x151b01ee8330234
> and a:
> 2016-01-19 13:12:49,766 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FollowerZooKeeperServer@139] - Shutting down
> 2016-01-19 13:12:49,766 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@441] - shutting down
> 2016-01-19 13:12:49,766 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FollowerRequestProcessor@105] - Shutting down
> 2016-01-19 13:12:49,766 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:CommitProcessor@181] - Shutting down
> 2016-01-19 13:12:49,766 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FinalRequestProcessor@415] - shutdown of request processor complete
> 2016-01-19 13:12:49,767 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:SyncRequestProcessor@209] - Shutting down
> 2016-01-19 13:12:49,767 - INFO  [CommitProcessor:1:CommitProcessor@150] - CommitProcessor exited loop!
> 2016-01-19 13:12:49,767 - INFO  [FollowerRequestProcessor:1:FollowerRequestProcessor@95] - FollowerRequestProcessor exited loop!
> 2016-01-19 13:13:09,418 - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 30334ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2016-01-19 13:13:09,427 - WARN  [SyncThread:1:SendAckRequestProcessor@64] - Closing connection to leader, exception during packet send
> java.net.SocketException: Socket closed
>         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139)
>         at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:62)
>         at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:204)
>         at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> 2016-01-19 13:13:09,428 - INFO  [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited!
> As a small aside, the fsync log errors for the first two servers to be hit are:
> 2016-01-19 13:13:09,418 - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1
> took 30334ms which will adversely effect operation latency.
> 2016-01-19 13:13:09,539 - WARN  [SyncThread:2:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:2
> took 30456ms which will adversely effect operation latency.
> If rewind from date of the entry the milliseconds given, you arrive within one millisecond of the same time on each server.
> But I digress.
> For the next 12 minutes or so, the logs are full of the below sort of exceptions, in seemingly no consistent order or frequency:
> 2016-01-19 13:13:09,440 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2016-01-19 13:13:09,441 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /X.Y.Z.181:51381 (no session established for client)
> 2016-01-19 13:13:09,443 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:SendAckRequestProcessor@64] - Closing connection to leader, exception during packet send
> java.net.SocketException: Socket closed
>         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139)
>         at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:62)
>         at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:204)
>         at org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:216)
>         at org.apache.zookeeper.server.quorum.FollowerZooKeeperServer.shutdown(FollowerZooKeeperServer.java:147)
>         at org.apache.zookeeper.server.quorum.Learner.shutdown(Learner.java:546)
>         at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:167)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> 2016-01-19 13:13:09,443 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumPeer@714] - LOOKING
> 2016-01-19 13:13:11,782 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 0x1525a047dc20005, likely client has closed socket
>         at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>         at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>         at java.lang.Thread.run(Thread.java:744)
> 2016-01-19 13:13:11,783 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /X.Y.Z.1:59576 which had sessionid 0x1525a047dc20005
> 2016-01-19 13:13:11,784 - ERROR [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
> java.nio.channels.CancelledKeyException
>         at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>         at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>         at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>         at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>         at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>         at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
> 2016-01-19 13:25:43,898 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 2 (n.leader), 0x2a001d352d (n.zxid), 0xb (n.round), LOOKING (n.state), 2 (n.sid), 0x2a (n.peerEpoch) FOLLOWING (my state)
> 2016-01-19 13:25:43,901 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2016-01-19 13:25:43,901 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> Until eventually we get to:
> 2016-01-19 13:26:05,099 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumPeer@784] - FOLLOWING
> 2016-01-19 13:26:05,099 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@162] - Created server with tickTime 2000 minSessionTimeout 4000
> maxSessionTimeout 40000 datadir /var/lib/zookeeper_1/data/version-2 snapdir /var/lib/zookeeper_1/data/version-2
> 2016-01-19 13:26:05,099 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@63] - FOLLOWING - LEADER ELECTION TOOK - 21179
> 2016-01-19 13:26:05,100 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Learner@233] - Unexpected exception, tries=0, connecting to zoo005/X.Y.Z.71:2888
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> And things start to come right.  Right about now, the three member which had so far escaped begin to exhibit the same behaviour.
> Again, if we look at the fsync messages:
> 2016-01-19 13:26:06,192 - WARN  [SyncThread:3:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:3
> took 51394ms which will adversely effect operation latency.
> 2016-01-19 13:26:05,960 - WARN  [SyncThread:4:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:4
> took 51162ms which will adversely effect operation latency.
> 2016-01-19 13:26:04,524 - WARN  [SyncThread:5:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:5
> took 49726ms which will adversely effect operation latency.
> And we rewind the number of milliseconds from the log entry timestamp, we arrive at exactly 13:25:14,798 for all three events.
> So, it looks for all the world like something entered the system at 13:12:39,084, caused havoc on two nodes for 12.5 minutes, then at 13:25:14,798 it got off those and made to the other three, where it again caused havoc, before things eventually recovered and the world kept on ticking, only a medium sized log explosion worse for it.
> There is nothing in any of the logs within a second of either of those times.
> I'm hoping someone familiar with the code can look at those stack traces and understand what might cause such an incident.
> I'm to help anyway I can.  I have more complete logs, and we see this every couple of weeks or so, so can setup some additional logging if it would be of value.  Let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)