You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2022/06/30 13:57:00 UTC
[jira] [Updated] (CURATOR-645) LeaderLatch generates infinite loop with two LeaderLatch instances competing for the leadership
[ https://issues.apache.org/jira/browse/CURATOR-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl updated CURATOR-645:
----------------------------------
Description:
We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for the leadership resulting in an infinite loop.
The test includes three instances of a wrapper class that has a {{LeaderLatch}} as a member. This is about [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236]. In the test, the first {{LeaderLatch}} acquires the leadership, which results in the {{LeaderLatch}} to be closed and, as a consequence, losing the leadership. The odd thing now is that the two left-over {{LeaderLatch}} instances end up in a infinite loop as shown in the ZooKeeper server logs:
{code:java}
16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
{code}
It looks like the close call of the {{LeaderLatch}} with the initial leadership is in some kind of race condition deleting the corresponding ZNode and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} instances instead of retrieving the left-over children:
# The {{reset()}} triggers [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629] through the [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525] after a new child is created (I would assume {{create2}} entry in the logs before {{getChildren}} entry which is not the case; so, I might be wrong in my observation)
# The callback of {{getChildren}} triggers [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
# In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the [callback in checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607] to fail with a {{NONODE}} event and triggering the reset of the current {{LeaderLatch}} instance which again triggers the deletion of the current's {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later on.
was:We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for the leadership resulting in an infinite loop.
> LeaderLatch generates infinite loop with two LeaderLatch instances competing for the leadership
> -----------------------------------------------------------------------------------------------
>
> Key: CURATOR-645
> URL: https://issues.apache.org/jira/browse/CURATOR-645
> Project: Apache Curator
> Issue Type: Bug
> Components: Recipes
> Affects Versions: 5.2.0
> Reporter: Matthias Pohl
> Priority: Major
>
> We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for the leadership resulting in an infinite loop.
> The test includes three instances of a wrapper class that has a {{LeaderLatch}} as a member. This is about [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236]. In the test, the first {{LeaderLatch}} acquires the leadership, which results in the {{LeaderLatch}} to be closed and, as a consequence, losing the leadership. The odd thing now is that the two left-over {{LeaderLatch}} instances end up in a infinite loop as shown in the ZooKeeper server logs:
> {code:java}
> 16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
> 16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
> 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
> 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
> 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
> {code}
> It looks like the close call of the {{LeaderLatch}} with the initial leadership is in some kind of race condition deleting the corresponding ZNode and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} instances instead of retrieving the left-over children:
> # The {{reset()}} triggers [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629] through the [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525] after a new child is created (I would assume {{create2}} entry in the logs before {{getChildren}} entry which is not the case; so, I might be wrong in my observation)
> # The callback of {{getChildren}} triggers [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
> # In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the [callback in checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607] to fail with a {{NONODE}} event and triggering the reset of the current {{LeaderLatch}} instance which again triggers the deletion of the current's {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later on.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)