You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2022/06/21 14:37:00 UTC

[jira] [Comment Edited] (FLINK-28078) ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers runs into timeout

    [ https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556863#comment-17556863 ] 

Matthias Pohl edited comment on FLINK-28078 at 6/21/22 2:36 PM:
----------------------------------------------------------------

{code}
16:17:07,802 [ForkJoinPool-45-worker-25] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Starting
16:17:07,804 [ForkJoinPool-45-worker-25] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Default schema
16:17:07,814 [ForkJoinPool-45-worker-25-EventThread] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: CONNECTED
16:17:07,817 [ForkJoinPool-45-worker-25-EventThread] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {}
16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start.
16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start.
16:17:07,826 [ForkJoinPool-45-worker-25-EventThread] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {}
16:17:07,848 [ForkJoinPool-45-worker-25-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership.
16:17:07,860 [ForkJoinPool-45-worker-25] INFO  org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver.
{code}

The test itself usually creates three {{ElectionDriver}} instances and removes them one by one through a for loop. The logs of the failed test reveal that only two out of the three have the quorum connection established (i.e. the log message {{Connected to ZooKeeper quorum. Leader election can start.}} is printed). The first iteration picks the first instance, checks its leadership and closes it. 

The {{anyOf}} call in the next iteration should actually still succeed because there's one {{ElectionDriver}} that has an established connection. But the resulting {{anyOf}} composite future doesn't complete, i.e. non of the left Leadership futures completes resulting in the test getting stuck in the subsequent {{join}} call.


was (Author: mapohl):
{code}
16:17:07,802 [ForkJoinPool-45-worker-25] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Starting
16:17:07,804 [ForkJoinPool-45-worker-25] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Default schema
16:17:07,814 [ForkJoinPool-45-worker-25-EventThread] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: CONNECTED
16:17:07,817 [ForkJoinPool-45-worker-25-EventThread] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {}
16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start.
16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start.
16:17:07,826 [ForkJoinPool-45-worker-25-EventThread] INFO  org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {}
16:17:07,848 [ForkJoinPool-45-worker-25-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership.
16:17:07,860 [ForkJoinPool-45-worker-25] INFO  org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver.
{code}

The test itself usually creates three {{ElectionDriver}} instances and removes them one by one through a for loop. The logs of the failed test reveal that only two out of the three have the quorum connection established (i.e. the log message {{Connected to ZooKeeper quorum. Leader election can start.}} is printed). The first iteration picks the first instance, checks its leadership and closes it. It looks like the second iteration picks the instance for which the quorum connection is still not established. The leadership future could therefore never be completed which results in the test getting stuck in the {{join}} call.

> ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers runs into timeout
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28078
>                 URL: https://issues.apache.org/jira/browse/FLINK-28078
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: test-stability
>
> [Build #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455] got stuck in {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}}
> {code}
> "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000]
> May 30 16:36:10    java.lang.Thread.State: WAITING (parking)
> May 30 16:36:10 	at sun.misc.Unsafe.park(Native Method)
> May 30 16:36:10 	- parking to wait for  <0x00000000c2571b80> (a java.util.concurrent.CompletableFuture$Signaller)
> May 30 16:36:10 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 30 16:36:10 	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 30 16:36:10 	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 30 16:36:10 	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 30 16:36:10 	at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> May 30 16:36:10 	at org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256)
> May 30 16:36:10 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> May 30 16:36:10 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 30 16:36:10 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 30 16:36:10 	at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)