You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Jeremy Stribling <st...@nicira.com> on 2012/04/04 01:51:13 UTC
uncaught exception handler
I'm curious about the origin of the uncaught exception handler that sits
in NIOServerCnxn (looking at ZK 3.3.5). It just logs the exception to
log.error. I wonder if it makes sense instead to do a System.exit(1) if
the exception is an OutOfMemoryError (or perhaps a java.lang.Error in
general, since those are not supposed to be caught).
I ask because our use of Zookeeper embeds it in a process where some
other code can cause the JVM to hit its memory limit. Instead of trying
to soldier on in the face of adversity like this, it seems better for
the whole process to come crashing down, to allow whatever monitor
process is in place to restart the JVM. When the process just logs and
ignores errors like this, it seems to lead to the ZK servers being
unable to make a quorum, even though they are up and running.
Here's a sample backtrace I've seen:
2012-04-03 19:40:03,643 600695063 [QuorumPeer:/172.29.1.220:2888] ERROR
org.apache.zookeeper.server.NIOServerCnxn - Thread
Thread[QuorumPeer:/172.29.1.220:2888,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:102)
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:232)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:602)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
at
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:131)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
at
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:242)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:279)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)
Any thoughts? Happy to create a JIRA and possibly a patch if there's
interest. Thanks,
Jeremy
Re: uncaught exception handler
Posted by Jeremy Stribling <st...@nicira.com>.
Done: https://issues.apache.org/jira/browse/ZOOKEEPER-1442 . I'll try
to get a patch together in the near future. Thanks.
Jeremy
On 04/03/2012 06:32 PM, Michi Mutsuzaki wrote:
> I agree we shouldn't swallow java.lang.Error. Please go ahead and open a jira.
>
> Thanks!
> --Michi
> ________________________________________
> From: Jeremy Stribling [strib@nicira.com]
> Sent: Tuesday, April 03, 2012 4:51 PM
> To: user@zookeeper.apache.org
> Subject: uncaught exception handler
>
> I'm curious about the origin of the uncaught exception handler that sits
> in NIOServerCnxn (looking at ZK 3.3.5). It just logs the exception to
> log.error. I wonder if it makes sense instead to do a System.exit(1) if
> the exception is an OutOfMemoryError (or perhaps a java.lang.Error in
> general, since those are not supposed to be caught).
>
> I ask because our use of Zookeeper embeds it in a process where some
> other code can cause the JVM to hit its memory limit. Instead of trying
> to soldier on in the face of adversity like this, it seems better for
> the whole process to come crashing down, to allow whatever monitor
> process is in place to restart the JVM. When the process just logs and
> ignores errors like this, it seems to lead to the ZK servers being
> unable to make a quorum, even though they are up and running.
>
> Here's a sample backtrace I've seen:
>
> 2012-04-03 19:40:03,643 600695063 [QuorumPeer:/172.29.1.220:2888] ERROR
> org.apache.zookeeper.server.NIOServerCnxn - Thread
> Thread[QuorumPeer:/172.29.1.220:2888,5,main] died
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:102)
> at
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:232)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:602)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:131)
> at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
> at
> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:242)
> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:279)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)
>
> Any thoughts? Happy to create a JIRA and possibly a patch if there's
> interest. Thanks,
>
> Jeremy
RE: uncaught exception handler
Posted by Michi Mutsuzaki <mi...@yahoo-inc.com>.
I agree we shouldn't swallow java.lang.Error. Please go ahead and open a jira.
Thanks!
--Michi
________________________________________
From: Jeremy Stribling [strib@nicira.com]
Sent: Tuesday, April 03, 2012 4:51 PM
To: user@zookeeper.apache.org
Subject: uncaught exception handler
I'm curious about the origin of the uncaught exception handler that sits
in NIOServerCnxn (looking at ZK 3.3.5). It just logs the exception to
log.error. I wonder if it makes sense instead to do a System.exit(1) if
the exception is an OutOfMemoryError (or perhaps a java.lang.Error in
general, since those are not supposed to be caught).
I ask because our use of Zookeeper embeds it in a process where some
other code can cause the JVM to hit its memory limit. Instead of trying
to soldier on in the face of adversity like this, it seems better for
the whole process to come crashing down, to allow whatever monitor
process is in place to restart the JVM. When the process just logs and
ignores errors like this, it seems to lead to the ZK servers being
unable to make a quorum, even though they are up and running.
Here's a sample backtrace I've seen:
2012-04-03 19:40:03,643 600695063 [QuorumPeer:/172.29.1.220:2888] ERROR
org.apache.zookeeper.server.NIOServerCnxn - Thread
Thread[QuorumPeer:/172.29.1.220:2888,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:102)
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:232)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:602)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
at
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:131)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
at
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:242)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:279)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)
Any thoughts? Happy to create a JIRA and possibly a patch if there's
interest. Thanks,
Jeremy