You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Daniel Klessing (JIRA)" <ji...@apache.org> on 2017/10/06 15:07:01 UTC

[jira] [Created] (CURATOR-435) ZNodes are created with incompatible CreateMode under racing conditions

Daniel Klessing created CURATOR-435:
---------------------------------------

             Summary: ZNodes are created with incompatible CreateMode under racing conditions
                 Key: CURATOR-435
                 URL: https://issues.apache.org/jira/browse/CURATOR-435
             Project: Apache Curator
          Issue Type: Bug
          Components: Client, Recipes
    Affects Versions: 2.12.0
         Environment: Apache Storm 1.1.1 (uses Zookeeper 3.4.6 and Curator 2.12.0)
Apache Zookeeper 3.4.10
OpenJDK 1.8.0_131
Debian 8
Docker 17.09.0-ce
Rancher 1.6.9
            Reporter: Daniel Klessing
         Attachments: zk_leader-lock_nimbus.log, zk_leader-lock_SUCCESS_nimbus.log, zk_leader-lock_SUCCESS_zookeeper.log, zk_leader-lock_zookeeper-1.log, zk_leader-lock_zookeeper-2.log, zk_leader-lock_zookeeper-3.log

We have a microservice-based software stack that starts its core services relatively simultaneously. For this we rely on Docker and Rancher.

We noticed that occasionally the Apache Storm Nimbus, that relies on Apache Curator's LeaderLatch, cannot agree on a leader. This comes due a missing ZNode:
{code}
org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /storm/leader-lock
	at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:230) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:151) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:133) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:453) ~[storm-core-1.1.1.jar:1.1.1]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?]
	at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) ~[clojure-1.7.0.jar:?]
	at org.apache.storm.zookeeper$zk_leader_elector$reify__1043.getLeader(zookeeper.clj:296) ~[storm-core-1.1.1.jar:1.1.1]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?]
	at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) ~[clojure-1.7.0.jar:?]
	at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1544) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10780.getClusterInfo(nimbus.clj:2006) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) ~[storm-core-1.1.1.jar:1.1.1]
	at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) ~[storm-core-1.1.1.jar:1.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
11:29:13.229 [timer] INFO  org.apache.storm.daemon.nimbus - not a leader, skipping assignments
11:29:13.229 [timer] INFO  org.apache.storm.daemon.nimbus - not a leader, skipping cleanup
{code}
even though a create request has been issued for this ZNode:
{code}
2017-10-06 11:27:04,387 [myid:3] - INFO  [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x35ef170475e0001 type:create cxid:0x1 zxid:0x100000013 txntype:-1 reqpath:n/a Error Path:/storm/leader-lock Error:KeeperErrorCode = NoNode for /storm/leader-lock
{code}
Verifying the Zookeeper data via {{zkCli.sh}} shows that the ZNode is in fact not present:
{code}
[zk: localhost:2181(CONNECTED) 0] ls /storm
[assignments, backpressure, nimbuses, logconfigs, storms, errors, supervisors, workerbeats, blobstore]
{code}

We noticed, that in such a case, the following log entry in the Nimbus log is *missing*
{code}
[Curator-Framework-0] WARN  org.apache.storm.shade.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
{code}
and if everything works, the log entry is *present*.

We assume that the correct {{CreateMode}} for the ZNode someone cannot be determined (as "Container" ZNodes are only available in Zookeeper 3.5.1 onwards).
Please notice the startup phase in which several attempts to connect to Zookeeper are not successful until it is fully reachable.

Full logs are attached. Zookeeper is run as a 3-node cluster in this setup. Happens with a single node instance also.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)