You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2022/06/30 13:57:00 UTC

[jira] [Updated] (CURATOR-645) LeaderLatch generates infinite loop with two LeaderLatch instances competing for the leadership

     [ https://issues.apache.org/jira/browse/CURATOR-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Pohl updated CURATOR-645:
----------------------------------
    Description: 
We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for the leadership resulting in an infinite loop.

The test includes three instances of a wrapper class that has a {{LeaderLatch}} as a member. This is about [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236]. In the test, the first {{LeaderLatch}} acquires the leadership, which results in the {{LeaderLatch}} to be closed and, as a consequence, losing the leadership. The odd thing now is that the two left-over {{LeaderLatch}} instances end up in a infinite loop as shown in the ZooKeeper server logs:
{code:java}
16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
{code}
It looks like the close call of the {{LeaderLatch}} with the initial leadership is in some kind of race condition deleting the corresponding ZNode and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} instances instead of retrieving the left-over children:
 # The {{reset()}} triggers [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629] through the [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525] after a new child is created (I would assume {{create2}} entry in the logs before {{getChildren}} entry which is not the case; so, I might be wrong in my observation)
 # The callback of {{getChildren}} triggers [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
 # In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the [callback in checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607] to fail with a {{NONODE}} event and triggering the reset of the current {{LeaderLatch}} instance which again triggers the deletion of the current's {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later on.

  was:We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for the leadership resulting in an infinite loop.


> LeaderLatch generates infinite loop with two LeaderLatch instances competing for the leadership
> -----------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-645
>                 URL: https://issues.apache.org/jira/browse/CURATOR-645
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 5.2.0
>            Reporter: Matthias Pohl
>            Priority: Major
>
> We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for the leadership resulting in an infinite loop.
> The test includes three instances of a wrapper class that has a {{LeaderLatch}} as a member. This is about [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236]. In the test, the first {{LeaderLatch}} acquires the leadership, which results in the {{LeaderLatch}} to be closed and, as a consequence, losing the leadership. The odd thing now is that the two left-over {{LeaderLatch}} instances end up in a infinite loop as shown in the ZooKeeper server logs:
> {code:java}
> 16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
> 16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
> {code}
> It looks like the close call of the {{LeaderLatch}} with the initial leadership is in some kind of race condition deleting the corresponding ZNode and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} instances instead of retrieving the left-over children:
>  # The {{reset()}} triggers [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629] through the [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525] after a new child is created (I would assume {{create2}} entry in the logs before {{getChildren}} entry which is not the case; so, I might be wrong in my observation)
>  # The callback of {{getChildren}} triggers [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
>  # In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the [callback in checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607] to fail with a {{NONODE}} event and triggering the reset of the current {{LeaderLatch}} instance which again triggers the deletion of the current's {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)