You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Jeremy Stribling <st...@nicira.com> on 2012/04/04 01:51:13 UTC

uncaught exception handler

I'm curious about the origin of the uncaught exception handler that sits 
in NIOServerCnxn (looking at ZK 3.3.5).  It just logs the exception to 
log.error.  I wonder if it makes sense instead to do a System.exit(1) if 
the exception is an OutOfMemoryError (or perhaps a java.lang.Error in 
general, since those are not supposed to be caught).

I ask because our use of Zookeeper embeds it in a process where some 
other code can cause the JVM to hit its memory limit.  Instead of trying 
to soldier on in the face of adversity like this, it seems better for 
the whole process to come crashing down, to allow whatever monitor 
process is in place to restart the JVM.  When the process just logs and 
ignores errors like this, it seems to lead to the ZK servers being 
unable to make a quorum, even though they are up and running.

Here's a sample backtrace I've seen:

2012-04-03 19:40:03,643 600695063 [QuorumPeer:/172.29.1.220:2888] ERROR 
org.apache.zookeeper.server.NIOServerCnxn  - Thread 
Thread[QuorumPeer:/172.29.1.220:2888,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
         at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:102)
         at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:232)
         at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:602)
         at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
         at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
         at 
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
         at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:131)
         at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
         at 
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:242)
         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:279)
         at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)

Any thoughts?  Happy to create a JIRA and possibly a patch if there's 
interest.  Thanks,

Jeremy

Re: uncaught exception handler

Posted by Jeremy Stribling <st...@nicira.com>.

Done: https://issues.apache.org/jira/browse/ZOOKEEPER-1442 .  I'll try 
to get a patch together in the near future.  Thanks.

Jeremy

On 04/03/2012 06:32 PM, Michi Mutsuzaki wrote:
> I agree we shouldn't swallow java.lang.Error. Please go ahead and open a jira.
>
> Thanks!
> --Michi
> ________________________________________
> From: Jeremy Stribling [strib@nicira.com]
> Sent: Tuesday, April 03, 2012 4:51 PM
> To: user@zookeeper.apache.org
> Subject: uncaught exception handler
>
> I'm curious about the origin of the uncaught exception handler that sits
> in NIOServerCnxn (looking at ZK 3.3.5).  It just logs the exception to
> log.error.  I wonder if it makes sense instead to do a System.exit(1) if
> the exception is an OutOfMemoryError (or perhaps a java.lang.Error in
> general, since those are not supposed to be caught).
>
> I ask because our use of Zookeeper embeds it in a process where some
> other code can cause the JVM to hit its memory limit.  Instead of trying
> to soldier on in the face of adversity like this, it seems better for
> the whole process to come crashing down, to allow whatever monitor
> process is in place to restart the JVM.  When the process just logs and
> ignores errors like this, it seems to lead to the ZK servers being
> unable to make a quorum, even though they are up and running.
>
> Here's a sample backtrace I've seen:
>
> 2012-04-03 19:40:03,643 600695063 [QuorumPeer:/172.29.1.220:2888] ERROR
> org.apache.zookeeper.server.NIOServerCnxn  - Thread
> Thread[QuorumPeer:/172.29.1.220:2888,5,main] died
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>           at
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:102)
>           at
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:232)
>           at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:602)
>           at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
>           at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
>           at
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
>           at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:131)
>           at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
>           at
> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:242)
>           at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:279)
>           at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)
>
> Any thoughts?  Happy to create a JIRA and possibly a patch if there's
> interest.  Thanks,
>
> Jeremy

RE: uncaught exception handler

Posted by Michi Mutsuzaki <mi...@yahoo-inc.com>.

I agree we shouldn't swallow java.lang.Error. Please go ahead and open a jira.

Thanks!
--Michi
________________________________________
From: Jeremy Stribling [strib@nicira.com]
Sent: Tuesday, April 03, 2012 4:51 PM
To: user@zookeeper.apache.org
Subject: uncaught exception handler

I'm curious about the origin of the uncaught exception handler that sits
in NIOServerCnxn (looking at ZK 3.3.5).  It just logs the exception to
log.error.  I wonder if it makes sense instead to do a System.exit(1) if
the exception is an OutOfMemoryError (or perhaps a java.lang.Error in
general, since those are not supposed to be caught).

I ask because our use of Zookeeper embeds it in a process where some
other code can cause the JVM to hit its memory limit.  Instead of trying
to soldier on in the face of adversity like this, it seems better for
the whole process to come crashing down, to allow whatever monitor
process is in place to restart the JVM.  When the process just logs and
ignores errors like this, it seems to lead to the ZK servers being
unable to make a quorum, even though they are up and running.

Here's a sample backtrace I've seen:

2012-04-03 19:40:03,643 600695063 [QuorumPeer:/172.29.1.220:2888] ERROR
org.apache.zookeeper.server.NIOServerCnxn  - Thread
Thread[QuorumPeer:/172.29.1.220:2888,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
         at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:102)
         at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:232)
         at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:602)
         at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
         at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
         at
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
         at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:131)
         at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
         at
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:242)
         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:279)
         at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)

Any thoughts?  Happy to create a JIRA and possibly a patch if there's
interest.  Thanks,

Jeremy