You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by mattgordon <ma...@crowdflower.com> on 2012/11/08 02:48:14 UTC

Connection closed exceptions with slow fsync and CancelledKeyExceptions

We have been trying to understand why our ZooKeeper cluster will occasionally
have a wave of connection closed exceptions. We have switched to
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode for garbage collection with
no noticeable improvements.

The symptoms are:

(1) All nodes show messages like "fsync-ing the write ahead log in
SyncThread:0 took 6309ms which will adversely effect operation latency. See
the ZooKeeper troubleshooting guide" with times typically around 5 seconds.
At least once, this fsync appeared in the leaders log immediately before a
wave of:

ERROR [CommitProcessor:0:NIOServerCnxn@445] - Unexpected Exception: 
java.nio.channels.CancelledKeyException
	at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
	at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
	at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
	at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
	at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171)
	at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

Our clients received ZookeeperConnectionClosed exceptions at this time and
all traffic on the ZooKeeper cluster essentially went to zero for a moment
before resuming normal operation with new connections.

(2) Probably unrelated since I haven't correlated it temporally with the
client errors, but running "sudo strace -r -T -f -p 9574 -e
trace=fsync,fdatasync -o trace.txt" turns up some messages like "10581
0.000246 — SIGSEGV (Segmentation fault) @ 0 (0) —"


ZK Version: 3.3.4
Cluster has 5 nodes running in EC2

Here is a screenshot showing ZooKeeper network traffic going to zero at the
time of the connection closed exceptions: http://i.imgur.com/dfNh0.png



Anyone have ideas on what the cause of these "waves" of
CancelledKeyExceptions could be from?



--
View this message in context: http://zookeeper-user.578899.n2.nabble.com/Connection-closed-exceptions-with-slow-fsync-and-CancelledKeyExceptions-tp7578166.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: Connection closed exceptions with slow fsync and CancelledKeyExceptions

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.

What's your filesystem & vm.dirty_ratio? May be your OS flushes a lot
sometimes. See e.g.
http://www.sysxperts.com/home/announce/vmdirtyratioandvmdirtybackgroundratio


2012/11/7 mattgordon <ma...@crowdflower.com>

> We have been trying to understand why our ZooKeeper cluster will
> occasionally
> have a wave of connection closed exceptions. We have switched to
> -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode for garbage collection with
> no noticeable improvements.
>
> The symptoms are:
>
> (1) All nodes show messages like "fsync-ing the write ahead log in
> SyncThread:0 took 6309ms which will adversely effect operation latency. See
> the ZooKeeper troubleshooting guide" with times typically around 5 seconds.
> At least once, this fsync appeared in the leaders log immediately before a
> wave of:
>
> ERROR [CommitProcessor:0:NIOServerCnxn@445] - Unexpected Exception:
> java.nio.channels.CancelledKeyException
>         at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>         at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>         at
>
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>         at
>
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>         at
>
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171)
>         at
>
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>
> Our clients received ZookeeperConnectionClosed exceptions at this time and
> all traffic on the ZooKeeper cluster essentially went to zero for a moment
> before resuming normal operation with new connections.
>
> (2) Probably unrelated since I haven't correlated it temporally with the
> client errors, but running "sudo strace -r -T -f -p 9574 -e
> trace=fsync,fdatasync -o trace.txt" turns up some messages like "10581
> 0.000246 — SIGSEGV (Segmentation fault) @ 0 (0) —"
>
>
> ZK Version: 3.3.4
> Cluster has 5 nodes running in EC2
>
> Here is a screenshot showing ZooKeeper network traffic going to zero at the
> time of the connection closed exceptions: http://i.imgur.com/dfNh0.png
>
>
>
> Anyone have ideas on what the cause of these "waves" of
> CancelledKeyExceptions could be from?
>
>
-- 
Best regards,
 Vitalii Tymchyshyn