You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/08/03 15:04:01 UTC

[GitHub] [flink] XComp opened a new pull request #13055: FLINK-18677: [fix] Added handling of suspended or lost connections wi…

XComp opened a new pull request #13055:
URL: https://github.com/apache/flink/pull/13055


   …thin the ZooKeeperLeaderRetrievalService.
   
   The listener needs to be notified in case of a connection loss so that it is able to initiate necessary actions on its side.
   
   <!--
   *Thank you very much for contributing to Apache Flink - we are happy that you want to help us improve Flink. To help the community review your contribution in the best possible way, please go through the checklist below, which will get the contribution into a shape in which it can be best reviewed.*
   
   *Please understand that we do not do this to make contributions to Flink a hassle. In order to uphold a high standard of quality for code contributions, while at the same time managing a large number of contributions, we need contributors to prepare the contributions well, and give reviewers enough contextual information for the review. Please also understand that contributions that do not follow this guide will take longer to review and thus typically be picked up with lower priority by the community.*
   
   ## Contribution Checklist
   
     - Make sure that the pull request corresponds to a [JIRA issue](https://issues.apache.org/jira/projects/FLINK/issues). Exceptions are made for typos in JavaDoc or documentation files, which need no JIRA issue.
     
     - Name the pull request in the form "[FLINK-XXXX] [component] Title of the pull request", where *FLINK-XXXX* should be replaced by the actual issue number. Skip *component* if you are unsure about which is the best component.
     Typo fixes that have no associated JIRA issue should be named following this pattern: `[hotfix] [docs] Fix typo in event time introduction` or `[hotfix] [javadocs] Expand JavaDoc for PuncuatedWatermarkGenerator`.
   
     - Fill out the template below to describe the changes contributed by the pull request. That will give reviewers the context they need to do the review.
     
     - Make sure that the change passes the automated tests, i.e., `mvn clean verify` passes. You can set up Azure Pipelines CI to do that following [this guide](https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository).
   
     - Each pull request should address only one issue, not mix up code from multiple issues.
     
     - Each commit in the pull request has a meaningful commit message (including the JIRA id)
   
     - Once all items of the checklist are addressed, remove the above text and this checklist, leaving only the filled out template below.
   
   
   **(The sections below can be removed for hotfixes of typos)**
   -->
   
   ## What is the purpose of the change
   
   *(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)*
   
   
   ## Brief change log
   
   *(for example:)*
     - *The TaskInfo is stored in the blob store on job creation time as a persistent artifact*
     - *Deployments RPC transmits only the blob storage reference*
     - *TaskManagers retrieve the TaskInfo from the blob cache*
   
   
   ## Verifying this change
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change is already covered by existing tests, such as *(please describe tests)*.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
     - *Added integration tests for end-to-end deployment with large payloads (100MB)*
     - *Extended integration test for recovery after master (JobManager) failure*
     - *Added test that validates that TaskInfo is transferred only once across recoveries*
     - *Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / no)
     - The serializers: (yes / no / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / no / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
     - The S3 file system connector: (yes / no / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / no)
     - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13055:
URL: https://github.com/apache/flink/pull/13055#discussion_r464889065



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+    private TestingServer testingServer;
+
+    private Configuration config;
+
+    private CuratorFramework zooKeeperClient;
+
+    @Before
+    public void before() throws Exception {
+        testingServer = new TestingServer();
+
+        config = new Configuration();
+        config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+        config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+        zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+    }
+
+    @After
+    public void after() throws Exception {
+        stopTestServer();
+
+        if (zooKeeperClient != null) {
+            zooKeeperClient.close();
+            zooKeeperClient = null;
+        }
+    }
+
+    @Test
+    public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(1, Duration.ofSeconds(1));
+
+        ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+        testInstance.start(queueLeaderElectionListener);
+
+        // do the testing
+        CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+        assertThat("No results are expected, yet, since no leader was elected.", firstAddress, is(nullValue()));
+
+        stopTestServer();
+
+        CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+        assertThat("No result is expected since there was no leader elected before stopping the server, yet.", secondAddress, is(nullValue()));
+    }
+
+    @Test
+    public void testConnectionSuspendedHandling() throws Exception {
+        // initialize LeaderElection-related instances
+        String leaderAddress = "localhost";
+        LeaderElectionService leaderElectionService = ZooKeeperUtils.createLeaderElectionService(zooKeeperClient, config);
+        TestingContender contender = new TestingContender(leaderAddress, leaderElectionService);
+        leaderElectionService.start(contender);
+
+        // initialize LeaderRetrievalService

Review comment:
       This is somewhat obvious.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -209,4 +211,15 @@ protected void handleStateChange(ConnectionState newState) {
 	public void unhandledError(String s, Throwable throwable) {
 		leaderListener.handleError(new FlinkException("Unhandled error in ZooKeeperLeaderRetrievalService:" + s, throwable));
 	}
+
+	private void notifyLeaderLoss() {
+		if (lastLeaderAddress != null || lastLeaderSessionID != null) {
+			LOG.debug(
+					"No leader information could be retrieved. Any listeners will be notified.");
+
+			lastLeaderAddress = null;
+			lastLeaderSessionID = null;
+			leaderListener.notifyLeaderAddress(null, null);
+		}
+	}

Review comment:
       This method is duplicating logic which is already contained in `ZooKeeperLeaderRetrievalService.nodeChanged`. I would suggest to factor out a method `notifyIfNewLeaderAddress(String newLeaderAddress, UUID newLeaderSessionID)`. `notifyLeaderLoss` could call this method with `notifyIfNewLeaderAddress(null, null)`.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {

Review comment:
       The class uses whitespaces for indentation. The Flink community uses tabs. Here is a pointer on how to set up your IDE to import Flink's checkstyle definition: https://ci.apache.org/projects/flink/flink-docs-stable/flinkDev/ide_setup.html.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+    private TestingServer testingServer;
+
+    private Configuration config;
+
+    private CuratorFramework zooKeeperClient;
+
+    @Before
+    public void before() throws Exception {
+        testingServer = new TestingServer();
+
+        config = new Configuration();
+        config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+        config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+        zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+    }
+
+    @After
+    public void after() throws Exception {
+        stopTestServer();
+
+        if (zooKeeperClient != null) {
+            zooKeeperClient.close();
+            zooKeeperClient = null;
+        }
+    }
+
+    @Test
+    public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(1, Duration.ofSeconds(1));
+
+        ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+        testInstance.start(queueLeaderElectionListener);
+
+        // do the testing
+        CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+        assertThat("No results are expected, yet, since no leader was elected.", firstAddress, is(nullValue()));
+
+        stopTestServer();
+
+        CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+        assertThat("No result is expected since there was no leader elected before stopping the server, yet.", secondAddress, is(nullValue()));
+    }
+
+    @Test
+    public void testConnectionSuspendedHandling() throws Exception {
+        // initialize LeaderElection-related instances
+        String leaderAddress = "localhost";
+        LeaderElectionService leaderElectionService = ZooKeeperUtils.createLeaderElectionService(zooKeeperClient, config);
+        TestingContender contender = new TestingContender(leaderAddress, leaderElectionService);
+        leaderElectionService.start(contender);
+
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(2, Duration.ofSeconds(1));
+
+        ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+        testInstance.start(queueLeaderElectionListener);
+
+        // do the testing
+        CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+        assertThat("The first result is expected to be the initially set leader address.", firstAddress.get(), is(leaderAddress));
+
+        stopTestServer();
+
+        CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+        assertThat("The next result must not be missing.", secondAddress, not(is(nullValue())));
+        assertThat("The next result is expected to be null.", secondAddress.get(), is(nullValue()));
+    }
+
+    private void stopTestServer() throws IOException {
+        if (testingServer != null) {
+            testingServer.stop();
+            testingServer = null;
+        }
+    }
+
+    private static class QueueLeaderElectionListener implements LeaderRetrievalListener {
+
+        private final BlockingQueue<CompletableFuture<String>> queue;
+        private final Duration timeout;
+
+        public QueueLeaderElectionListener(int expectedCalls, Duration timeout) {
+            this.queue = new ArrayBlockingQueue<>(expectedCalls);
+            this.timeout = timeout;
+        }
+
+        @Override
+        public void notifyLeaderAddress(String leaderAddress, UUID leaderSessionID) {
+            try {
+                this.queue.offer(CompletableFuture.completedFuture(leaderAddress), timeout.toMillis(), TimeUnit.MILLISECONDS);
+            } catch (InterruptedException e) {
+                throw new IllegalStateException(e);
+            }
+        }
+
+        public CompletableFuture<String> next() {
+            try {
+                return this.queue.poll(timeout.toMillis(), TimeUnit.MILLISECONDS);
+            } catch (InterruptedException e) {
+                throw new IllegalStateException(e);
+            }
+        }
+
+        @Override
+        public void handleError(Exception exception) {
+            // nothing to do

Review comment:
       Let's throw the exception at least. Otherwise we might swallow it which could hide a problem and which makes debugging harder.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -209,4 +211,15 @@ protected void handleStateChange(ConnectionState newState) {
 	public void unhandledError(String s, Throwable throwable) {
 		leaderListener.handleError(new FlinkException("Unhandled error in ZooKeeperLeaderRetrievalService:" + s, throwable));
 	}
+
+	private void notifyLeaderLoss() {
+		if (lastLeaderAddress != null || lastLeaderSessionID != null) {
+			LOG.debug(
+					"No leader information could be retrieved. Any listeners will be notified.");
+
+			lastLeaderAddress = null;
+			lastLeaderSessionID = null;

Review comment:
       The state changes are not happening under the `lock`. If you introduce `notifyIfNewLeaderAddress`, then this call should happen under the `lock` and we could add `@GuardedBy("lock")` to `notifyIfNewLeaderAddress` to state this contract.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+    private TestingServer testingServer;
+
+    private Configuration config;
+
+    private CuratorFramework zooKeeperClient;
+
+    @Before
+    public void before() throws Exception {
+        testingServer = new TestingServer();
+
+        config = new Configuration();
+        config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+        config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+        zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+    }
+
+    @After
+    public void after() throws Exception {
+        stopTestServer();
+
+        if (zooKeeperClient != null) {
+            zooKeeperClient.close();
+            zooKeeperClient = null;
+        }
+    }
+
+    @Test
+    public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(1, Duration.ofSeconds(1));
+
+        ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+        testInstance.start(queueLeaderElectionListener);
+
+        // do the testing
+        CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+        assertThat("No results are expected, yet, since no leader was elected.", firstAddress, is(nullValue()));
+
+        stopTestServer();
+
+        CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+        assertThat("No result is expected since there was no leader elected before stopping the server, yet.", secondAddress, is(nullValue()));
+    }
+
+    @Test
+    public void testConnectionSuspendedHandling() throws Exception {
+        // initialize LeaderElection-related instances

Review comment:
       Somewhat obvious.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+    private TestingServer testingServer;
+
+    private Configuration config;
+
+    private CuratorFramework zooKeeperClient;
+
+    @Before
+    public void before() throws Exception {
+        testingServer = new TestingServer();
+
+        config = new Configuration();
+        config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+        config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+        zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+    }
+
+    @After
+    public void after() throws Exception {
+        stopTestServer();
+
+        if (zooKeeperClient != null) {
+            zooKeeperClient.close();
+            zooKeeperClient = null;
+        }
+    }
+
+    @Test
+    public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(1, Duration.ofSeconds(1));
+
+        ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+        testInstance.start(queueLeaderElectionListener);
+
+        // do the testing
+        CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+        assertThat("No results are expected, yet, since no leader was elected.", firstAddress, is(nullValue()));
+
+        stopTestServer();
+
+        CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+        assertThat("No result is expected since there was no leader elected before stopping the server, yet.", secondAddress, is(nullValue()));
+    }
+
+    @Test
+    public void testConnectionSuspendedHandling() throws Exception {
+        // initialize LeaderElection-related instances
+        String leaderAddress = "localhost";
+        LeaderElectionService leaderElectionService = ZooKeeperUtils.createLeaderElectionService(zooKeeperClient, config);
+        TestingContender contender = new TestingContender(leaderAddress, leaderElectionService);
+        leaderElectionService.start(contender);
+
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(2, Duration.ofSeconds(1));

Review comment:
       A timeout of 1 second can be too low for slow testing machines. I would suggest to not use a timeout here since we are expecting that the future will be delivered.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+    private TestingServer testingServer;
+
+    private Configuration config;
+
+    private CuratorFramework zooKeeperClient;
+
+    @Before
+    public void before() throws Exception {
+        testingServer = new TestingServer();
+
+        config = new Configuration();
+        config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+        config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+        zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+    }
+
+    @After
+    public void after() throws Exception {
+        stopTestServer();
+
+        if (zooKeeperClient != null) {
+            zooKeeperClient.close();
+            zooKeeperClient = null;
+        }
+    }
+
+    @Test
+    public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+        // initialize LeaderRetrievalService
+        QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(1, Duration.ofSeconds(1));
+
+        ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+        testInstance.start(queueLeaderElectionListener);
+
+        // do the testing
+        CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();

Review comment:
       This call adds a delay of 1s. Since we are calling it twice it adds 2s to the test duration. Since it is generally hard to test the absence of something I would suggest to decrease the timeout to 50 ms.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+    private TestingServer testingServer;
+
+    private Configuration config;
+
+    private CuratorFramework zooKeeperClient;
+
+    @Before
+    public void before() throws Exception {
+        testingServer = new TestingServer();
+
+        config = new Configuration();
+        config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+        config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+        zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+    }
+
+    @After
+    public void after() throws Exception {
+        stopTestServer();
+
+        if (zooKeeperClient != null) {
+            zooKeeperClient.close();
+            zooKeeperClient = null;
+        }
+    }
+
+    @Test
+    public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+        // initialize LeaderRetrievalService

Review comment:
       Somewhat obvious.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f5df4b17aae9f1abda5c07430bf24c38f08de3e6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5173",
       "triggerID" : "f5df4b17aae9f1abda5c07430bf24c38f08de3e6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed6fb2c77645701eb585d2700c852657b1f66fc9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133) 
   * 6acb8dae3e8cefea998c5ec78683a4dab16a1f78 UNKNOWN
   * f5df4b17aae9f1abda5c07430bf24c38f08de3e6 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5173) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13055: FLINK-18677: [fix] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed6fb2c77645701eb585d2700c852657b1f66fc9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tisonkun commented on a change in pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tisonkun commented on a change in pull request #13055:
URL: https://github.com/apache/flink/pull/13055#discussion_r616812510



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -193,14 +185,20 @@ protected void handleStateChange(ConnectionState newState) {
 				break;
 			case SUSPENDED:
 				LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+						"ZooKeeper.");
+				synchronized (lock) {
+					notifyLeaderLoss();

Review comment:
       This work does no harm with FLINK-10052. I have no other concern.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #13055: FLINK-18677: [fix] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668074502


   @tillrohrmann the PR is ready to be reviewed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13055:
URL: https://github.com/apache/flink/pull/13055#discussion_r616501354



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -193,14 +185,20 @@ protected void handleStateChange(ConnectionState newState) {
 				break;
 			case SUSPENDED:
 				LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+						"ZooKeeper.");
+				synchronized (lock) {
+					notifyLeaderLoss();

Review comment:
       I think we trigger an explicit `ZooKeeperLeaderRetrievalDriver.retrieveLeaderInformationFromZooKeeper` after we reconnect. Do you think that this won't be enough?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13055: FLINK-18677: [fix] Added handling of suspended or lost connections wi…

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668073840


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit ed6fb2c77645701eb585d2700c852657b1f66fc9 (Mon Aug 03 15:05:33 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann closed pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tillrohrmann closed pull request #13055:
URL: https://github.com/apache/flink/pull/13055


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed6fb2c77645701eb585d2700c852657b1f66fc9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133) 
   * 6acb8dae3e8cefea998c5ec78683a4dab16a1f78 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tisonkun commented on a change in pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tisonkun commented on a change in pull request #13055:
URL: https://github.com/apache/flink/pull/13055#discussion_r615871963



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -193,14 +185,20 @@ protected void handleStateChange(ConnectionState newState) {
 				break;
 			case SUSPENDED:
 				LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+						"ZooKeeper.");
+				synchronized (lock) {
+					notifyLeaderLoss();

Review comment:
       @tillrohrmann what if a suspend happens while the leader info node content actually not changed?
   
   It causes the task manager cannot find the job manager anymore. Because we actually filter same leader info inside `DefaultLeaderRetrievalService`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: FLINK-18677: [fix] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed6fb2c77645701eb585d2700c852657b1f66fc9 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13055:
URL: https://github.com/apache/flink/pull/13055#discussion_r465074689



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -192,15 +184,21 @@ protected void handleStateChange(ConnectionState newState) {
 				LOG.debug("Connected to ZooKeeper quorum. Leader retrieval can start.");
 				break;
 			case SUSPENDED:
-				LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+				synchronized (lock) {
+					LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
+							"ZooKeeper.");

Review comment:
       Logging statements can happen outside of the `lock`.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.leaderelection;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.configuration.HighAvailabilityOptions;
+import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalListener;
+import org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService;
+import org.apache.flink.runtime.util.ZooKeeperUtils;
+import org.apache.flink.util.TestLogger;
+
+import org.apache.flink.shaded.curator4.org.apache.curator.framework.CuratorFramework;
+
+import org.apache.curator.test.TestingServer;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.UUID;
+import java.util.concurrent.ArrayBlockingQueue;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.CoreMatchers.not;
+import static org.hamcrest.CoreMatchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the error handling in case of a suspended connection to the ZooKeeper instance.
+ */
+public class ZooKeeperLeaderElectionConnectionHandlingTest extends TestLogger {
+
+	private TestingServer testingServer;
+
+	private Configuration config;
+
+	private CuratorFramework zooKeeperClient;
+
+	@Before
+	public void before() throws Exception {
+		testingServer = new TestingServer();
+
+		config = new Configuration();
+		config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");
+		config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, testingServer.getConnectString());
+
+		zooKeeperClient = ZooKeeperUtils.startCuratorFramework(config);
+	}
+
+	@After
+	public void after() throws Exception {
+		stopTestServer();
+
+		if (zooKeeperClient != null) {
+			zooKeeperClient.close();
+			zooKeeperClient = null;
+		}
+	}
+
+	@Test
+	public void testConnectionSuspendedHandlingDuringInitialization() throws Exception {
+		QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(1, Duration.ofMillis(50));
+
+		ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+		testInstance.start(queueLeaderElectionListener);
+
+		// do the testing
+		CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+		assertThat("No results are expected, yet, since no leader was elected.", firstAddress, is(nullValue()));
+
+		stopTestServer();
+
+		CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+		assertThat("No result is expected since there was no leader elected before stopping the server, yet.", secondAddress, is(nullValue()));
+	}
+
+	@Test
+	public void testConnectionSuspendedHandling() throws Exception {
+		String leaderAddress = "localhost";
+		LeaderElectionService leaderElectionService = ZooKeeperUtils.createLeaderElectionService(zooKeeperClient, config);
+		TestingContender contender = new TestingContender(leaderAddress, leaderElectionService);
+		leaderElectionService.start(contender);
+
+		QueueLeaderElectionListener queueLeaderElectionListener = new QueueLeaderElectionListener(2);
+
+		ZooKeeperLeaderRetrievalService testInstance = ZooKeeperUtils.createLeaderRetrievalService(zooKeeperClient, config);
+		testInstance.start(queueLeaderElectionListener);
+
+		// do the testing
+		CompletableFuture<String> firstAddress = queueLeaderElectionListener.next();
+		assertThat("The first result is expected to be the initially set leader address.", firstAddress.get(), is(leaderAddress));
+
+		stopTestServer();
+
+		CompletableFuture<String> secondAddress = queueLeaderElectionListener.next();
+		assertThat("The next result must not be missing.", secondAddress, not(is(nullValue())));

Review comment:
       ```suggestion
   		assertThat("The next result must not be missing.", secondAddress, is(not(nullValue())));
   ```

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -192,15 +184,21 @@ protected void handleStateChange(ConnectionState newState) {
 				LOG.debug("Connected to ZooKeeper quorum. Leader retrieval can start.");
 				break;
 			case SUSPENDED:
-				LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+				synchronized (lock) {
+					LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
+							"ZooKeeper.");
+					notifyLeaderLoss();
+				}
 				break;
 			case RECONNECTED:
 				LOG.info("Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.");
 				break;
 			case LOST:
-				LOG.warn("Connection to ZooKeeper lost. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+				synchronized (lock) {
+					LOG.warn("Connection to ZooKeeper lost. Can no longer retrieve the leader from " +
+							"ZooKeeper.");

Review comment:
       Same here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668073840


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit f5df4b17aae9f1abda5c07430bf24c38f08de3e6 (Sat Aug 28 11:16:01 UTC 2021)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: FLINK-18677: [fix] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed6fb2c77645701eb585d2700c852657b1f66fc9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f5df4b17aae9f1abda5c07430bf24c38f08de3e6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5173",
       "triggerID" : "f5df4b17aae9f1abda5c07430bf24c38f08de3e6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6acb8dae3e8cefea998c5ec78683a4dab16a1f78 UNKNOWN
   * f5df4b17aae9f1abda5c07430bf24c38f08de3e6 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5173) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13055:
URL: https://github.com/apache/flink/pull/13055#issuecomment-668083003


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133",
       "triggerID" : "ed6fb2c77645701eb585d2700c852657b1f66fc9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6acb8dae3e8cefea998c5ec78683a4dab16a1f78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f5df4b17aae9f1abda5c07430bf24c38f08de3e6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f5df4b17aae9f1abda5c07430bf24c38f08de3e6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed6fb2c77645701eb585d2700c852657b1f66fc9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5133) 
   * 6acb8dae3e8cefea998c5ec78683a4dab16a1f78 UNKNOWN
   * f5df4b17aae9f1abda5c07430bf24c38f08de3e6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tisonkun commented on a change in pull request #13055: [FLINK-18677] Added handling of suspended or lost connections

Posted by GitBox <gi...@apache.org>.
tisonkun commented on a change in pull request #13055:
URL: https://github.com/apache/flink/pull/13055#discussion_r615882602



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -193,14 +185,20 @@ protected void handleStateChange(ConnectionState newState) {
 				break;
 			case SUSPENDED:
 				LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the leader from " +
-					"ZooKeeper.");
+						"ZooKeeper.");
+				synchronized (lock) {
+					notifyLeaderLoss();

Review comment:
       Will if the leadership changed, it will generate a new id




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org