You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ignite.apache.org by GitBox <gi...@apache.org> on 2020/08/11 14:18:41 UTC

[GitHub] [ignite] Vladsz83 opened a new pull request #8142: Fix measuring timers in discovery tests

Vladsz83 opened a new pull request #8142:
URL: https://github.com/apache/ignite/pull/8142


   Discovery test has inaccurate measuring of node detection failure delay. The node failre event timers should be fixated immediately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469199342



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -32,41 +37,58 @@
 # pylint: disable=W0223
 class DiscoveryTest(IgniteTest):
     """
-    Test basic discovery scenarious (TCP and Zookeeper).
+    Test various node failure scenarios (TCP and ZooKeeper).
     1. Start of ignite cluster.
     2. Kill random node.
     3. Wait that survived node detects node failure.
     """
     NUM_NODES = 7
 
+    FAILURE_DETECTION_TIMEOUT = 2000
+
+    __LOG_PATH = os.path.join(IgniteAwareService.PERSISTENT_ROOT, "console.log")

Review comment:
       Fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469225365



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       Yes, counter-intuitive. But you don't see it now. Now we have intuitive test names. It will become counter-inuitive with @matrix. This parametrization requires extra test organization. This PR do not include case for failing coordinator with some othe node. This could be done with some small changes and parameter. But looks like anohter ticket and PR to me.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469208452



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       > @matrix
   
   We won't distinguish easily which test failed because, for example, 0 nodes to kill means to kill coordinator in internal routines. This reduces parameters count. Now we clearly  see it by test name. There is a lot of parametrization to me. We got 6 test functions. They look not so many. Makes sense?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469220569



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       I see code duplication and we have to reduce it if possible.
   @matrix made for this.
   
   0 to stop coordinator looks counter-intuitive to me, what about the case when we have kill coordinator and some node?
   How to kill node after coordinator in one case and before at another?
   matrix with a list of nodes to kill will be more clear? eg. [0, 3, 7], [5, 0, 2], [0, 5, 2].




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469192064



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -32,41 +37,58 @@
 # pylint: disable=W0223
 class DiscoveryTest(IgniteTest):
     """
-    Test basic discovery scenarious (TCP and Zookeeper).
+    Test various node failure scenarios (TCP and ZooKeeper).
     1. Start of ignite cluster.
     2. Kill random node.
     3. Wait that survived node detects node failure.
     """
     NUM_NODES = 7
 
+    FAILURE_DETECTION_TIMEOUT = 2000
+
+    __LOG_PATH = os.path.join(IgniteAwareService.PERSISTENT_ROOT, "console.log")

Review comment:
       why not gain from IgniteAwareService?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469208452



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       > @matrix
   
   We won't distinguish easily which test failed because, for example, 0 nodes to kill means to kill coordinator in internal routines. This reduces parameters count. Now we clearly  see it by test name. There is a lot of parametrization to me. We got only 4 test functions. They look not so many. Makes sense?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#issuecomment-672852933


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469141540



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,28 +90,65 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_not_coordinator_two(self, version):
+        """
+        Test two-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, True)
+
+    @cluster(num_nodes=NUM_NODES + 3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_not_coordinator_single(self, version):
         """
-        Test basic discovery scenario with TcpDiscoverySpi.
+        Test single node failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, False)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1, coordinator=False, with_zk=True)
 
     @cluster(num_nodes=NUM_NODES + 3)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_zk(self, version):
+    def test_zk_not_coordinator_two(self, version):
         """
-        Test basic discovery scenario with ZookeeperDiscoverySpi.
+        Test two-node-failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, True)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2, coordinator=False, with_zk=True)
 
-    def __basic_test__(self, version, with_zk=False):
+    @cluster(num_nodes=NUM_NODES+3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with ZooKeeper.
+        """
+        return self.__simulate_nodes_failure(version, coordinator=True, with_zk=True)
+
+    # pylint: disable=R0913,R0914

Review comment:
       Fixed

##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -117,46 +163,81 @@ def __basic_test__(self, version, with_zk=False):
 
         start = self.monotonic()
         self.servers.start()
-        data = {'Ignite cluster start time (s)': self.monotonic() - start}
+        data = {'Ignite cluster start time (s)': round(self.monotonic() - start, 1)}
         self.stage("Topology is ready")
 
-        # Node failure detection
-        fail_node, survived_node = self.choose_random_node_to_kill(self.servers)
+        if nodes_to_kill > self.servers.num_nodes - 1 or coordinator and nodes_to_kill > 1:
+            raise Exception("Too many nodes to kill: " + str(nodes_to_kill))
+
+        if coordinator:
+            node_chooser = lambda nodes: \
+                next(node for node in nodes if node.discovery_info().node_id == nodes[0].discovery_info().coordinator)
+        else:
+            node_chooser = lambda nodes: \
+                random.sample([n for n in self.servers.nodes if n.discovery_info().node_id !=
+                               self.servers.nodes[0].discovery_info().coordinator], nodes_to_kill)
 
-        data["nodes"] = [node.node_id() for node in self.servers.nodes]
+        failed_nodes, survived_node = self.choose_node_to_kill(self.servers.nodes, node_chooser)
 
-        disco_infos = []
-        for node in self.servers.nodes:
-            disco_info = node.discovery_info()
-            disco_infos.append({
-                "id": disco_info.node_id,
-                "consistent_id": disco_info.consistent_id,
-                "coordinator": disco_info.coordinator,
-                "order": disco_info.order,
-                "int_order": disco_info.int_order,
-                "is_client": disco_info.is_client
-            })
+        ids_to_wait = [node.discovery_info().node_id for node in failed_nodes]
 
-        data["node_disco_info"] = disco_infos
+        self.stage("Stopping " + str(len(failed_nodes)) + " nodes.")
 
-        self.servers.stop_node(fail_node, clean_shutdown=False)
+        first_terminated = self.servers.stop_nodes_async(failed_nodes, clean_shutdown=False, wait_for_stop=False)
 
-        start = self.monotonic()
-        self.servers.await_event_on_node("Node FAILED", random.choice(survived_node), 60, True)
+        self.stage("Waiting for failure detection of " + str(len(failed_nodes)) + " nodes.")
+
+        # Keeps dates of logged node failures.
+        last_failure_detected = 0
+        logged_timestamps = []
+
+        for failed_id in ids_to_wait:
+            pattern = "Node FAILED: " + ("ZookeeperClusterNode" if with_zk else "TcpDiscoveryNode") + " \\[id=" \
+                      + failed_id
+
+            self.servers.await_event_on_node(pattern, survived_node, 10, from_the_beginning=True, backoff_sec=0.01)
 
-        data['Failure of node detected in time (s)'] = self.monotonic() - start
+            last_failure_detected = self.monotonic()
+
+            self.stage("Failure detection measured.")
+
+        for failed_id in ids_to_wait:
+            pattern = "Node FAILED: " + ("ZookeeperClusterNode" if with_zk else "TcpDiscoveryNode") + " \\[id=" \

Review comment:
       Fixed

##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,28 +90,65 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_not_coordinator_two(self, version):
+        """
+        Test two-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, True)
+
+    @cluster(num_nodes=NUM_NODES + 3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_not_coordinator_single(self, version):
         """
-        Test basic discovery scenario with TcpDiscoverySpi.
+        Test single node failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, False)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1, coordinator=False, with_zk=True)
 
     @cluster(num_nodes=NUM_NODES + 3)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_zk(self, version):
+    def test_zk_not_coordinator_two(self, version):
         """
-        Test basic discovery scenario with ZookeeperDiscoverySpi.
+        Test two-node-failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, True)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2, coordinator=False, with_zk=True)
 
-    def __basic_test__(self, version, with_zk=False):
+    @cluster(num_nodes=NUM_NODES+3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with ZooKeeper.
+        """
+        return self.__simulate_nodes_failure(version, coordinator=True, with_zk=True)
+
+    # pylint: disable=R0913,R0914
+    def __simulate_nodes_failure(self, version, coordinator=False, with_zk=False, nodes_to_kill=1):
         if with_zk:

Review comment:
       Fixed

##########
File path: modules/ducktests/tests/ignitetest/services/utils/concurrent.py
##########
@@ -0,0 +1,74 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This module contains concurrent utils.
+"""
+
+import threading
+
+
+class CountDownLatch(object):
+    """
+    A count-down latch.
+    """
+    def __init__(self, count=1):
+        self.count = count
+        self.cond_var = threading.Condition()
+
+    def count_down(self):
+        """
+        Decreases the latch counter.
+        """
+        with self.cond_var:
+            if self.count > 0:
+                self.count -= 1
+            if self.count == 0:
+                self.cond_var.notifyAll()
+
+    def wait(self):
+        """
+        Blocks current thread if the latch is not free.
+        """
+        with self.cond_var:
+            while self.count > 0:
+                self.cond_var.wait()
+
+
+# pylint: disable=C0116
+class AtomicValue:
+    """
+    An atomic reference holder.
+    """
+    def __init__(self, value=None):
+        self.value = value
+        self.cond_var = threading.Lock()

Review comment:
       Fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov merged pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov merged pull request #8142:
URL: https://github.com/apache/ignite/pull/8142


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] ivandasch commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
ivandasch commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r468646361



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -117,46 +163,81 @@ def __basic_test__(self, version, with_zk=False):
 
         start = self.monotonic()
         self.servers.start()
-        data = {'Ignite cluster start time (s)': self.monotonic() - start}
+        data = {'Ignite cluster start time (s)': round(self.monotonic() - start, 1)}
         self.stage("Topology is ready")
 
-        # Node failure detection
-        fail_node, survived_node = self.choose_random_node_to_kill(self.servers)
+        if nodes_to_kill > self.servers.num_nodes - 1 or coordinator and nodes_to_kill > 1:
+            raise Exception("Too many nodes to kill: " + str(nodes_to_kill))
+
+        if coordinator:
+            node_chooser = lambda nodes: \
+                next(node for node in nodes if node.discovery_info().node_id == nodes[0].discovery_info().coordinator)
+        else:
+            node_chooser = lambda nodes: \
+                random.sample([n for n in self.servers.nodes if n.discovery_info().node_id !=
+                               self.servers.nodes[0].discovery_info().coordinator], nodes_to_kill)
 
-        data["nodes"] = [node.node_id() for node in self.servers.nodes]
+        failed_nodes, survived_node = self.choose_node_to_kill(self.servers.nodes, node_chooser)
 
-        disco_infos = []
-        for node in self.servers.nodes:
-            disco_info = node.discovery_info()
-            disco_infos.append({
-                "id": disco_info.node_id,
-                "consistent_id": disco_info.consistent_id,
-                "coordinator": disco_info.coordinator,
-                "order": disco_info.order,
-                "int_order": disco_info.int_order,
-                "is_client": disco_info.is_client
-            })
+        ids_to_wait = [node.discovery_info().node_id for node in failed_nodes]
 
-        data["node_disco_info"] = disco_infos
+        self.stage("Stopping " + str(len(failed_nodes)) + " nodes.")
 
-        self.servers.stop_node(fail_node, clean_shutdown=False)
+        first_terminated = self.servers.stop_nodes_async(failed_nodes, clean_shutdown=False, wait_for_stop=False)
 
-        start = self.monotonic()
-        self.servers.await_event_on_node("Node FAILED", random.choice(survived_node), 60, True)
+        self.stage("Waiting for failure detection of " + str(len(failed_nodes)) + " nodes.")
+
+        # Keeps dates of logged node failures.
+        last_failure_detected = 0
+        logged_timestamps = []
+
+        for failed_id in ids_to_wait:
+            pattern = "Node FAILED: " + ("ZookeeperClusterNode" if with_zk else "TcpDiscoveryNode") + " \\[id=" \

Review comment:
       May be try to pass a regexp to grep? 

##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,28 +90,65 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_not_coordinator_two(self, version):
+        """
+        Test two-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, True)
+
+    @cluster(num_nodes=NUM_NODES + 3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_not_coordinator_single(self, version):
         """
-        Test basic discovery scenario with TcpDiscoverySpi.
+        Test single node failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, False)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1, coordinator=False, with_zk=True)
 
     @cluster(num_nodes=NUM_NODES + 3)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_zk(self, version):
+    def test_zk_not_coordinator_two(self, version):
         """
-        Test basic discovery scenario with ZookeeperDiscoverySpi.
+        Test two-node-failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, True)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2, coordinator=False, with_zk=True)
 
-    def __basic_test__(self, version, with_zk=False):
+    @cluster(num_nodes=NUM_NODES+3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with ZooKeeper.
+        """
+        return self.__simulate_nodes_failure(version, coordinator=True, with_zk=True)
+
+    # pylint: disable=R0913,R0914
+    def __simulate_nodes_failure(self, version, coordinator=False, with_zk=False, nodes_to_kill=1):
         if with_zk:

Review comment:
       We can move this part to additional function and call it in each zk steps. 

##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,28 +90,65 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_not_coordinator_two(self, version):
+        """
+        Test two-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2)
+
+    @cluster(num_nodes=NUM_NODES)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_tcp_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, True)
+
+    @cluster(num_nodes=NUM_NODES + 3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_not_coordinator_single(self, version):
         """
-        Test basic discovery scenario with TcpDiscoverySpi.
+        Test single node failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, False)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=1, coordinator=False, with_zk=True)
 
     @cluster(num_nodes=NUM_NODES + 3)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_zk(self, version):
+    def test_zk_not_coordinator_two(self, version):
         """
-        Test basic discovery scenario with ZookeeperDiscoverySpi.
+        Test two-node-failure scenario (not the coordinator) with ZooKeeper.
         """
-        return self.__basic_test__(version, True)
+        return self.__simulate_nodes_failure(version, nodes_to_kill=2, coordinator=False, with_zk=True)
 
-    def __basic_test__(self, version, with_zk=False):
+    @cluster(num_nodes=NUM_NODES+3)
+    @parametrize(version=str(DEV_BRANCH))
+    @parametrize(version=str(LATEST_2_7))
+    def test_zk_coordinator(self, version):
+        """
+        Test coordinator-failure scenario with ZooKeeper.
+        """
+        return self.__simulate_nodes_failure(version, coordinator=True, with_zk=True)
+
+    # pylint: disable=R0913,R0914

Review comment:
       I suppose, that we can rewrite a little bit test, it's really overcomplicated a little bit. 
   May be it is not a good idea to turn off linter warnings here?

##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -117,46 +163,81 @@ def __basic_test__(self, version, with_zk=False):
 
         start = self.monotonic()
         self.servers.start()
-        data = {'Ignite cluster start time (s)': self.monotonic() - start}
+        data = {'Ignite cluster start time (s)': round(self.monotonic() - start, 1)}
         self.stage("Topology is ready")
 
-        # Node failure detection
-        fail_node, survived_node = self.choose_random_node_to_kill(self.servers)
+        if nodes_to_kill > self.servers.num_nodes - 1 or coordinator and nodes_to_kill > 1:
+            raise Exception("Too many nodes to kill: " + str(nodes_to_kill))
+
+        if coordinator:
+            node_chooser = lambda nodes: \
+                next(node for node in nodes if node.discovery_info().node_id == nodes[0].discovery_info().coordinator)
+        else:
+            node_chooser = lambda nodes: \
+                random.sample([n for n in self.servers.nodes if n.discovery_info().node_id !=
+                               self.servers.nodes[0].discovery_info().coordinator], nodes_to_kill)
 
-        data["nodes"] = [node.node_id() for node in self.servers.nodes]
+        failed_nodes, survived_node = self.choose_node_to_kill(self.servers.nodes, node_chooser)
 
-        disco_infos = []
-        for node in self.servers.nodes:
-            disco_info = node.discovery_info()
-            disco_infos.append({
-                "id": disco_info.node_id,
-                "consistent_id": disco_info.consistent_id,
-                "coordinator": disco_info.coordinator,
-                "order": disco_info.order,
-                "int_order": disco_info.int_order,
-                "is_client": disco_info.is_client
-            })
+        ids_to_wait = [node.discovery_info().node_id for node in failed_nodes]
 
-        data["node_disco_info"] = disco_infos
+        self.stage("Stopping " + str(len(failed_nodes)) + " nodes.")
 
-        self.servers.stop_node(fail_node, clean_shutdown=False)
+        first_terminated = self.servers.stop_nodes_async(failed_nodes, clean_shutdown=False, wait_for_stop=False)
 
-        start = self.monotonic()
-        self.servers.await_event_on_node("Node FAILED", random.choice(survived_node), 60, True)
+        self.stage("Waiting for failure detection of " + str(len(failed_nodes)) + " nodes.")
+
+        # Keeps dates of logged node failures.
+        last_failure_detected = 0
+        logged_timestamps = []
+
+        for failed_id in ids_to_wait:
+            pattern = "Node FAILED: " + ("ZookeeperClusterNode" if with_zk else "TcpDiscoveryNode") + " \\[id=" \
+                      + failed_id
+
+            self.servers.await_event_on_node(pattern, survived_node, 10, from_the_beginning=True, backoff_sec=0.01)
 
-        data['Failure of node detected in time (s)'] = self.monotonic() - start
+            last_failure_detected = self.monotonic()
+
+            self.stage("Failure detection measured.")
+
+        for failed_id in ids_to_wait:
+            pattern = "Node FAILED: " + ("ZookeeperClusterNode" if with_zk else "TcpDiscoveryNode") + " \\[id=" \

Review comment:
       Same as above. Seems that it is a good candidate to separate function, isn't it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469149421



##########
File path: modules/ducktests/tests/ignitetest/services/utils/concurrent.py
##########
@@ -0,0 +1,74 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This module contains concurrent utils.
+"""
+
+import threading
+
+
+class CountDownLatch(object):
+    """
+    A count-down latch.
+    """
+    def __init__(self, count=1):
+        self.count = count
+        self.cond_var = threading.Condition()
+
+    def count_down(self):
+        """
+        Decreases the latch counter.
+        """
+        with self.cond_var:
+            if self.count > 0:
+                self.count -= 1
+            if self.count == 0:
+                self.cond_var.notifyAll()
+
+    def wait(self):
+        """
+        Blocks current thread if the latch is not free.
+        """
+        with self.cond_var:
+            while self.count > 0:
+                self.cond_var.wait()
+
+
+# pylint: disable=C0116

Review comment:
       Fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469236353



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       Ok for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] ivandasch commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
ivandasch commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469142943



##########
File path: modules/ducktests/tests/ignitetest/services/utils/concurrent.py
##########
@@ -0,0 +1,74 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This module contains concurrent utils.
+"""
+
+import threading
+
+
+class CountDownLatch(object):
+    """
+    A count-down latch.
+    """
+    def __init__(self, count=1):
+        self.count = count
+        self.cond_var = threading.Condition()
+
+    def count_down(self):
+        """
+        Decreases the latch counter.
+        """
+        with self.cond_var:
+            if self.count > 0:
+                self.count -= 1
+            if self.count == 0:
+                self.cond_var.notifyAll()
+
+    def wait(self):
+        """
+        Blocks current thread if the latch is not free.
+        """
+        with self.cond_var:
+            while self.count > 0:
+                self.cond_var.wait()
+
+
+# pylint: disable=C0116

Review comment:
       I suppose it's better to write some docs, isn't it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469203660



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       can we pass how many nodes to kill via @matrix to reduce methods amount?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469203660



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -81,31 +103,68 @@ def teardown(self):
     @cluster(num_nodes=NUM_NODES)
     @parametrize(version=str(DEV_BRANCH))
     @parametrize(version=str(LATEST_2_7))
-    def test_tcp(self, version):
+    def test_tcp_not_coordinator_single(self, version):
+        """
+        Test single-node-failure scenario (not the coordinator) with TcpDiscoverySpi.
+        """
+        return self.__simulate_nodes_failure(version, self.__properties(), 1)

Review comment:
       can we pass how many nodes to kill via @parametrize to reduce methods amount?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] ivandasch commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
ivandasch commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469302686



##########
File path: modules/ducktests/tests/ignitetest/services/ignite.py
##########
@@ -100,6 +107,59 @@ def stop_node(self, node, clean_shutdown=True, timeout_sec=60):
             self.thread_dump(node)
             raise
 
+    def stop_nodes_async(self, nodes, delay_ms=0, clean_shutdown=True, timeout_sec=20, wait_for_stop=False):
+        """
+        Stops the nodes asynchronously.
+        """
+        sig = signal.SIGTERM if clean_shutdown else signal.SIGKILL
+
+        sem = CountDownLatch(len(nodes))
+        time_holder = AtomicValue()

Review comment:
       Actually, you can use simple dict here, it is thread-safe (GIL)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r469141458



##########
File path: modules/ducktests/tests/ignitetest/tests/discovery_test.py
##########
@@ -117,46 +163,81 @@ def __basic_test__(self, version, with_zk=False):
 
         start = self.monotonic()
         self.servers.start()
-        data = {'Ignite cluster start time (s)': self.monotonic() - start}
+        data = {'Ignite cluster start time (s)': round(self.monotonic() - start, 1)}
         self.stage("Topology is ready")
 
-        # Node failure detection
-        fail_node, survived_node = self.choose_random_node_to_kill(self.servers)
+        if nodes_to_kill > self.servers.num_nodes - 1 or coordinator and nodes_to_kill > 1:
+            raise Exception("Too many nodes to kill: " + str(nodes_to_kill))
+
+        if coordinator:
+            node_chooser = lambda nodes: \
+                next(node for node in nodes if node.discovery_info().node_id == nodes[0].discovery_info().coordinator)
+        else:
+            node_chooser = lambda nodes: \
+                random.sample([n for n in self.servers.nodes if n.discovery_info().node_id !=
+                               self.servers.nodes[0].discovery_info().coordinator], nodes_to_kill)
 
-        data["nodes"] = [node.node_id() for node in self.servers.nodes]
+        failed_nodes, survived_node = self.choose_node_to_kill(self.servers.nodes, node_chooser)
 
-        disco_infos = []
-        for node in self.servers.nodes:
-            disco_info = node.discovery_info()
-            disco_infos.append({
-                "id": disco_info.node_id,
-                "consistent_id": disco_info.consistent_id,
-                "coordinator": disco_info.coordinator,
-                "order": disco_info.order,
-                "int_order": disco_info.int_order,
-                "is_client": disco_info.is_client
-            })
+        ids_to_wait = [node.discovery_info().node_id for node in failed_nodes]
 
-        data["node_disco_info"] = disco_infos
+        self.stage("Stopping " + str(len(failed_nodes)) + " nodes.")
 
-        self.servers.stop_node(fail_node, clean_shutdown=False)
+        first_terminated = self.servers.stop_nodes_async(failed_nodes, clean_shutdown=False, wait_for_stop=False)
 
-        start = self.monotonic()
-        self.servers.await_event_on_node("Node FAILED", random.choice(survived_node), 60, True)
+        self.stage("Waiting for failure detection of " + str(len(failed_nodes)) + " nodes.")
+
+        # Keeps dates of logged node failures.
+        last_failure_detected = 0
+        logged_timestamps = []
+
+        for failed_id in ids_to_wait:
+            pattern = "Node FAILED: " + ("ZookeeperClusterNode" if with_zk else "TcpDiscoveryNode") + " \\[id=" \

Review comment:
       Fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] ivandasch commented on a change in pull request #8142: Fix measuring timers in discovery tests

Posted by GitBox <gi...@apache.org>.
ivandasch commented on a change in pull request #8142:
URL: https://github.com/apache/ignite/pull/8142#discussion_r468642939



##########
File path: modules/ducktests/tests/ignitetest/services/utils/concurrent.py
##########
@@ -0,0 +1,74 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This module contains concurrent utils.
+"""
+
+import threading
+
+
+class CountDownLatch(object):
+    """
+    A count-down latch.
+    """
+    def __init__(self, count=1):
+        self.count = count
+        self.cond_var = threading.Condition()
+
+    def count_down(self):
+        """
+        Decreases the latch counter.
+        """
+        with self.cond_var:
+            if self.count > 0:
+                self.count -= 1
+            if self.count == 0:
+                self.cond_var.notifyAll()
+
+    def wait(self):
+        """
+        Blocks current thread if the latch is not free.
+        """
+        with self.cond_var:
+            while self.count > 0:
+                self.cond_var.wait()
+
+
+# pylint: disable=C0116
+class AtomicValue:
+    """
+    An atomic reference holder.
+    """
+    def __init__(self, value=None):
+        self.value = value
+        self.cond_var = threading.Lock()

Review comment:
       This is mutex or lock, not cond_var. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org