You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2014/07/01 13:12:27 UTC

[jira] [Commented] (ZOOKEEPER-1900) NullPointerException in truncate

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048769#comment-14048769 ] 

Hudson commented on ZOOKEEPER-1900:
-----------------------------------

FAILURE: Integrated in ZooKeeper-trunk #2352 (See [https://builds.apache.org/job/ZooKeeper-trunk/2352/])
ZOOKEEPER-1900 (shralex: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1606841)
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Observer.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/TruncateTest.java


>  NullPointerException in truncate
> ---------------------------------
>
>                 Key: ZOOKEEPER-1900
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1900
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.5, 3.4.6
>         Environment: linux java 1.6
>            Reporter: Steven Bower
>            Assignee: Camille Fournier
>            Priority: Blocker
>             Fix For: 3.4.7, 3.5.0
>
>         Attachments: ZOOKEEPER-1900-34.patch, ZOOKEEPER-1900.patch, ZOOKEEPER-190034v2.patch, ZOOKEEPER-1900v2.patch
>
>
> The other day we started up a ZK instance that had been down for a bit (1day) and started getting NPEs all over the place...
> {noformat}
> 2014-20-03 11:15:42.320 INFO  QuorumPeerConfig [main] - Reading configuration from: /xxx/bin/zk/etc/zk.cfg
> 2014-20-03 11:15:42.350 INFO  QuorumPeerConfig [main] - Defaulting to majority quorums
> 2014-20-03 11:15:42.353 INFO  DatadirCleanupManager [main] - autopurge.snapRetainCount set to 3
> 2014-20-03 11:15:42.353 INFO  DatadirCleanupManager [main] - autopurge.purgeInterval set to 0
> 2014-20-03 11:15:42.353 INFO  DatadirCleanupManager [main] - Purge task is not scheduled.
> 2014-20-03 11:15:42.385 INFO  QuorumPeerMain [main] - Starting quorum peer
> 2014-20-03 11:15:42.399 INFO  NIOServerCnxnFactory [main] - binding to port 0.0.0.0/0.0.0.0:5555
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - tickTime set to 2000
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - minSessionTimeout set to -1
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - maxSessionTimeout set to -1
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - initLimit set to 10
> 2014-20-03 11:15:42.456 INFO  FileSnap [main] - Reading snapshot /xxx/zk_data/version-2/snapshot.2c00000000
> 2014-20-03 11:15:42.463 INFO  QuorumCnxManager [Thread-3] - My election bind port: 0.0.0.0/0.0.0.0:7555
> 2014-20-03 11:15:42.470 INFO  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 11:15:42.471 INFO  FastLeaderElection [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id =  3, proposed zxid=0x8000000000000000
> 2014-20-03 11:15:42.479 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.479 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO  QuorumCnxManager [WorkerSender[myid=3]] - Have smaller server identifier, so dropping the connection: (5, 3)
> 2014-20-03 11:15:42.482 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 11:15:42.486 INFO  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - TCP NoDelay set to: true
> 2014-20-03 11:15:42.488 INFO  QuorumCnxManager [host1/###.###.###.###:7555] - Received connection request /###.###.###.###:64528
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:host.name=host1
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.version=1.6.0_20
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.vendor=Sun Microsystems Inc.
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.home=/xxx/util/common/jdk1.6.0_20_64bit/jre
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.class.path=/xxx/bin/zk/etc:/xxx/bin/zk/lib/slf4j-log4j12-1.7.2.jar:/xxx/bin/zk/lib/jline-0.9.94.jar:/xxx/bin/zk/lib/jul-to-slf4j-1.7.2.jar:/xxx/bin/zk/lib/ZooInspector-3.4.5.jar:/xxx/bin/zk/lib/jcl-over-slf4j-1.7.2.jar:/xxx/bin/zk/lib/log4j-1.2.17.jar:/xxx/bin/zk/lib/zookeeper-3.4.5.jar:/xxx/bin/zk/lib/slf4j-api-1.7.2.jar:/xxx/bin/zk/lib/netty-3.2.2.Final.jar
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.library.path=/xxx/util/common/jdk1.6.0_20_64bit/jre/lib/amd64/server:/xxx/util/common/jdk1.6.0_20_64bit/jre/lib/amd64:/xxx/util/common/jdk1.6.0_20_64bit/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.io.tmpdir=/tmp
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:java.compiler=<NA>
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.name=Linux
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.arch=amd64
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.version=2.6.32-220.2.1.el6.x86_64
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:user.name=op
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:user.home=/xxx/bin
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:user.dir=/xxx/bin
> 2014-20-03 11:15:42.491 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2 snapdir /xxx/zk_data/version-2
> 2014-20-03 11:15:42.493 INFO  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing host4/###.###.###.###:6555
> 2014-20-03 11:15:42.495 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 11:15:42.498 WARN  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Truncating log to get in sync with the leader 0x2b00000002
> 2014-20-03 11:15:42.499 WARN  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception
> java.lang.NullPointerException
>         at org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:352)
>         at org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
>         at org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
>         at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:339)
>         at org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:72)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2014-20-03 11:15:42.500 INFO  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - shutdown called
> java.lang.Exception: shutdown Observer
>         at org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:731)
> 2014-20-03 11:15:42.500 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - shutting down
> 2014-20-03 11:15:42.500 INFO  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 11:15:42.501 INFO  FastLeaderElection [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id =  3, proposed zxid=0x8000000000000000
> 2014-20-03 11:15:42.503 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.503 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.503 INFO  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 11:15:42.503 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2 snapdir /xxx/zk_data/version-2
> 2014-20-03 11:15:42.504 INFO  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing host4/###.###.###.###:6555
> 2014-20-03 11:15:42.504 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 11:15:42.514 INFO  FileSnap [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Reading snapshot /xxx/zk_data/version-2/snapshot.2c00000000
> 2014-20-03 11:15:42.517 WARN  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Truncating log to get in sync with the leader 0x2b00000002
> 2014-20-03 11:15:42.518 WARN  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception
> java.lang.NullPointerException
>         at org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:352)
>         at org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
>         at org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
>         at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:339)
>         at org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:72)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> {noformat}
> This exception went on and on over and over again (more than 1M times in a day) until it then began spewing this exception:
> {noformat}
> 2014-20-03 13:45:32.843 INFO  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 13:45:32.843 INFO  FastLeaderElection [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id =  3, proposed zxid=0x8000000000000000
> 2014-20-03 13:45:32.844 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 13:45:32.845 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 13:45:32.845 INFO  QuorumPeer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 13:45:32.845 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 13:45:32.845 INFO  ZooKeeperServer [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2 snapdir /xxx/zk_data/version-2
> 2014-20-03 13:45:32.845 INFO  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing host4/###.###.###.###:6555
> 2014-20-03 13:45:32.853 WARN  Learner [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception, tries=0, connecting to host4/###.###.###.###:6555
> java.net.ConnectException: Cannot assign requested address
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>         at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>         at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>         at java.net.Socket.connect(Socket.java:529)
>         at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:224)
>         at org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:69)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2014-20-03 13:45:33.863 INFO  FileSnap [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Reading snapshot /xxx/zk_data/version-2/snapshot.2c00000000
> {noformat}
> This exception for a while was interspersed with the NPEs but eventually it just was spewing the ConnectionException.
> Looking through the code a bit it seems if the FileTxnIterator when initialized cannot find any log files the {{inputStream}} is set to null which causes truncate() to NPE.. I see in 3.4.6 this has been wrapped in a try/finally which closes the iterator.. but i presume that this issue would still remain.
> Looking at the system in this state there were 29k+ sockets in CLOSE_WAIT state on the system and looking at a heap dump there were tons of Socket objects waiting for GC (ie not getting properly closed).. this eventually ran the system out of ephemeral ports and hence the ConnectionExceptions..
> It would seem that a quick check of {{itr.next()}} prior to attempting truncation would resolve the NPE, but it seems somewhere a connection is not being closed properly when an exception occurs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)