You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ignite.apache.org by "rpuch (via GitHub)" <gi...@apache.org> on 2023/01/31 07:35:33 UTC

[GitHub] [ignite-3] rpuch opened a new pull request, #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

rpuch opened a new pull request, #1604:
URL: https://github.com/apache/ignite-3/pull/1604

   https://issues.apache.org/jira/browse/IGNITE-18630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] rpuch commented on a diff in pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "rpuch (via GitHub)" <gi...@apache.org>.
rpuch commented on code in PR #1604:
URL: https://github.com/apache/ignite-3/pull/1604#discussion_r1094420348


##########
modules/cli/src/integrationTest/java/org/apache/ignite/internal/cli/IntegrationTestBase.java:
##########
@@ -117,7 +117,8 @@ public class IntegrationTestBase extends BaseIgniteAbstractTest {
             + "        gossipInterval: 10\n"
             + "      },\n"
             + "    }\n"
-            + "  }\n"
+            + "  },"
+            + "  cluster.failoverTimeout: 0\n"

Review Comment:
   This is for tests that wait for a node to be expelled from the logical topology. We don't want to wait for 30 seconds per such test. Do you suggest to set a non-zero value, like 1 second?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] rpuch commented on a diff in pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "rpuch (via GitHub)" <gi...@apache.org>.
rpuch commented on code in PR #1604:
URL: https://github.com/apache/ignite-3/pull/1604#discussion_r1097001887


##########
modules/runner/src/integrationTest/java/org/apache/ignite/internal/AbstractClusterIntegrationTest.java:
##########
@@ -71,7 +75,8 @@ public abstract class AbstractClusterIntegrationTest extends BaseIgniteAbstractT
             + "        gossipInterval: 10\n"
             + "      },\n"
             + "    }\n"
-            + "  }\n"
+            + "  },"
+            + "  cluster.failoverTimeout: 0\n"

Review Comment:
   Well, it seems to be about CMG, not about the network. You already suggested that we revisit all configuration properties, we could handle this property there as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] rpuch closed pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "rpuch (via GitHub)" <gi...@apache.org>.
rpuch closed pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology
URL: https://github.com/apache/ignite-3/pull/1604


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] rpuch commented on a diff in pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "rpuch (via GitHub)" <gi...@apache.org>.
rpuch commented on code in PR #1604:
URL: https://github.com/apache/ignite-3/pull/1604#discussion_r1094423789


##########
modules/cluster-management/src/main/java/org/apache/ignite/internal/cluster/management/ClusterManagementGroupManager.java:
##########
@@ -664,6 +679,26 @@ private void sendWithRetry(ClusterNode node, NetworkMessage msg, CompletableFutu
                 });
     }
 
+    @Override
+    public void beforeNodeStop() {
+        CompletableFuture<CmgRaftService> serviceFuture = raftService;
+
+        if (serviceFuture != null) {
+            tryToLeaveLogicalTopologyGracefully(serviceFuture);
+        }
+    }
+
+    private void tryToLeaveLogicalTopologyGracefully(CompletableFuture<CmgRaftService> serviceFuture) {
+        serviceFuture.thenCompose(service -> service.removeFromCluster(Set.of(clusterService.topologyService().localMember())))
+                .orTimeout(GRACEFUL_LEAVE_TIMEOUT_MILLIS, TimeUnit.MILLISECONDS)
+                .exceptionally(ex -> {
+                    LOG.info("Was not able to send a bye-bye in time", ex);
+
+                    return null;
+                })
+                .join();

Review Comment:
   We probably want to give the 'bye-bye' message a chance to be delivered. If we don't wait, we could cut ourselves from RAFT services (or even network), which would prevent it to be delivered. We'll wait for at most 1 second, this does not seem too bad.
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] rpuch commented on pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "rpuch (via GitHub)" <gi...@apache.org>.
rpuch commented on PR #1604:
URL: https://github.com/apache/ignite-3/pull/1604#issuecomment-1423764415

   Closed as irrelevant


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] ibessonov commented on a diff in pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "ibessonov (via GitHub)" <gi...@apache.org>.
ibessonov commented on code in PR #1604:
URL: https://github.com/apache/ignite-3/pull/1604#discussion_r1094194205


##########
modules/network/src/integrationTest/java/org/apache/ignite/network/scalecube/ItScaleCubeNetworkMessagingTest.java:
##########
@@ -320,6 +322,39 @@ public void testMessageGroupsHandlers(TestInfo testInfo) throws Exception {
         assertThat(networkMessageFuture, willBe(equalTo(networkMessage)));
     }
 
+    /**
+     * Tests removal from physical topology.
+     *
+     * @param testInfo Test info.
+     * @throws Exception If failed.
+     */
+    @Test
+    public void removalFromPhysicalTopologyRemovesAndFiresEvent(TestInfo testInfo) throws Exception {

Review Comment:
   I don't understand this test. There's a latch, but no one is waiting for it. The test is probably wrong or incomplete, please revisit it



##########
modules/cli/src/integrationTest/java/org/apache/ignite/internal/cli/IntegrationTestBase.java:
##########
@@ -117,7 +117,8 @@ public class IntegrationTestBase extends BaseIgniteAbstractTest {
             + "        gossipInterval: 10\n"
             + "      },\n"
             + "    }\n"
-            + "  }\n"
+            + "  },"
+            + "  cluster.failoverTimeout: 0\n"

Review Comment:
   Will this make tests less stable? What's the reason?



##########
modules/network/src/main/java/org/apache/ignite/network/scalecube/ScaleCubeTopologyService.java:
##########
@@ -77,41 +87,30 @@ void onMembershipEvent(MembershipEvent event) {
         ClusterNode member = fromMember(event.member(), metadata);
 
         if (event.isAdded()) {
-            members.put(member.address(), member);
-            consistentIdToMemberMap.put(member.name(), member);
+            // When a node appears in DT, it immediately appears in PT.
+            consistentIdsToDiscoveredMembers.put(member.name(), member);
+
+            physicalTopologyMembers.put(member.address(), member);
+            consistentIdsToPhysicalTopologyMembers.put(member.name(), member);
 
             LOG.info("Node joined [node={}]", member);
 
             fireAppearedEvent(member);
         } else if (event.isUpdated()) {
-            members.put(member.address(), member);
-            consistentIdToMemberMap.put(member.name(), member);
+            consistentIdsToDiscoveredMembers.put(member.name(), member);
+
+            physicalTopologyMembers.put(member.address(), member);
+            consistentIdsToPhysicalTopologyMembers.put(member.name(), member);
         } else if (event.isRemoved()) {
-            members.compute(member.address(), (addr, node) -> {
-                // Ignore stale remove event.
-                if (node == null || node.id().equals(member.id())) {
-                    return null;
-                } else {
-                    return node;
-                }
-            });
-
-            consistentIdToMemberMap.compute(member.name(), (consId, node) -> {
-                // Ignore stale remove event.
-                if (node == null || node.id().equals(member.id())) {
-                    return null;
-                } else {
-                    return node;
-                }
-            });
-
-            LOG.info("Node left [member={}]", member);
-
-            fireDisappearedEvent(member);
+            removeFromConsistentIdsToMembers(member, consistentIdsToDiscoveredMembers);
+
+            LOG.info("Node left Discovery Topology [member={}]", member);

Review Comment:
   I don't think that it makes sense to call a set of swim nodes a "discovery topology", let's discuss this. From my point of view, something like "fast-fail" flag in "send" method would be enough, client would just get an exception if there's no channel at this point in time, and retry the request later. No need for extra topology service



##########
modules/network/src/main/java/org/apache/ignite/network/scalecube/ScaleCubeTopologyService.java:
##########
@@ -132,19 +134,45 @@ void updateLocalMetadata(@Nullable NodeMetadata metadata) {
      * @param member Appeared cluster member.
      */
     private void fireAppearedEvent(ClusterNode member) {
+        // Appearing in the Discovery Topology is equivalent to appearing in the Physical topology, so we trigger
+        // PT.onAppeared() right away.
+
         for (TopologyEventHandler handler : getEventHandlers()) {
-            handler.onAppeared(member);
+            try {
+                handler.onAppeared(member);
+            } catch (Exception | AssertionError e) {

Review Comment:
   Wow, why do we specifically catch assertion errors? Why not "Throwable" then?



##########
modules/runner/src/integrationTest/java/org/apache/ignite/internal/AbstractClusterIntegrationTest.java:
##########
@@ -71,7 +75,8 @@ public abstract class AbstractClusterIntegrationTest extends BaseIgniteAbstractT
             + "        gossipInterval: 10\n"
             + "      },\n"
             + "    }\n"
-            + "  }\n"
+            + "  },"
+            + "  cluster.failoverTimeout: 0\n"

Review Comment:
   Looks out of place, to be honest. Maybe it should be a part of network configuration, let's consider that a possibility



##########
modules/runner/src/integrationTest/java/org/apache/ignite/internal/compute/ItLogicalTopologyTest.java:
##########
@@ -170,6 +175,96 @@ public void onDisappeared(ClusterNode disappearedNode, LogicalTopologySnapshot n
         assertTrue(secondIgniteDisappeared.await(10, TimeUnit.SECONDS), "Did not see second node leaving in time");
     }
 
+    @Test
+    @BootstrapConfigTemplateMethod("templateWithVeryLongDelayToRemoveFromLogicalTopology")

Review Comment:
   Is there a shorter name for "DelayToRemoveFromLogicalTopology"? I bet there is



##########
modules/runner/src/integrationTest/java/org/apache/ignite/internal/BootstrapConfigTemplateMethod.java:
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal;
+
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
+
+/**
+ * Defines name of a static method defined in a test class that will be invoked to produce a bootstrap config
+ * template to use intead of a default template.
+ */
+@Retention(RetentionPolicy.RUNTIME)
+@Target({ElementType.METHOD})
+public @interface BootstrapConfigTemplateMethod {

Review Comment:
   I wonder if this annotation and the abstract test should be in test fixtures? Just a thought, you don't need to move them



##########
modules/network/src/main/java/org/apache/ignite/internal/network/discovery/DiscoveryTopologyEventListener.java:
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.network.discovery;
+
+import org.apache.ignite.network.ClusterNode;
+
+/**
+ * Interface for handling events related to discovery topology changes. Discovery Topology (DT) is the set of nodes
+ * that were discovered by a discovery mechanism (like a SWIM protocol implementation) and which did not leave (from
+ * the point of view of the discovery mechanism) yet.
+ */
+public interface DiscoveryTopologyEventListener {

Review Comment:
   So, the terminology is all messed up. We have:
   - TopologyEventHandler
   - LogicalTopologyEventListener
   - DiscoveryTopologyEventListener
   and they all share the same set of names for events. This is confusing, and we should probably fix it in the future.
   Anyway, why do we need a third class? What's the difference between this event and "TopologyEventHandler"? Should there be a difference? Too many entities is bad for understanding them, let's try to make it simpler



##########
modules/cluster-management/src/main/java/org/apache/ignite/internal/cluster/management/ClusterManagementGroupManager.java:
##########
@@ -664,6 +679,26 @@ private void sendWithRetry(ClusterNode node, NetworkMessage msg, CompletableFutu
                 });
     }
 
+    @Override
+    public void beforeNodeStop() {
+        CompletableFuture<CmgRaftService> serviceFuture = raftService;
+
+        if (serviceFuture != null) {
+            tryToLeaveLogicalTopologyGracefully(serviceFuture);
+        }
+    }
+
+    private void tryToLeaveLogicalTopologyGracefully(CompletableFuture<CmgRaftService> serviceFuture) {
+        serviceFuture.thenCompose(service -> service.removeFromCluster(Set.of(clusterService.topologyService().localMember())))
+                .orTimeout(GRACEFUL_LEAVE_TIMEOUT_MILLIS, TimeUnit.MILLISECONDS)
+                .exceptionally(ex -> {
+                    LOG.info("Was not able to send a bye-bye in time", ex);
+
+                    return null;
+                })
+                .join();

Review Comment:
   Should we really wait for it?



##########
modules/raft/src/test/java/org/apache/ignite/raft/jraft/rpc/IgniteRpcTest.java:
##########
@@ -107,6 +110,15 @@ public IgniteRpcTest(TestInfo testInfo) {
 
         service.start();
 
+        // Remove a node from Physical Topology as soon as it gets lost by SWIM.
+        InternalTopologyService internalTopologyService = (InternalTopologyService) service.topologyService();

Review Comment:
   This is interesting, why did you have to explicitly add this to a test? Is there a way to have this listener "by default" when starting all necessary components?



##########
modules/runner/src/integrationTest/java/org/apache/ignite/internal/compute/ItLogicalTopologyTest.java:
##########
@@ -170,6 +175,96 @@ public void onDisappeared(ClusterNode disappearedNode, LogicalTopologySnapshot n
         assertTrue(secondIgniteDisappeared.await(10, TimeUnit.SECONDS), "Did not see second node leaving in time");
     }
 
+    @Test
+    @BootstrapConfigTemplateMethod("templateWithVeryLongDelayToRemoveFromLogicalTopology")
+    void nodeDoesNotLeaveLogicalTopologyImmediatelyAfterBeingLostBySwim() throws Exception {
+        IgniteImpl entryNode = node(0);
+
+        startNode(1);
+
+        entryNode.logicalTopologyService().addEventListener(listener);
+
+        // Knock the node out without allowing it to say good bye.
+        cluster.knockOutNode(1, NodeKnockout.PARTITION_NETWORK);
+
+        assertFalse(waitForCondition(() -> !events.isEmpty(), 3_000));

Review Comment:
   Can we somehow make it faster than 3 seconds?
   Please add a comment why this value is chosen, it must depend on other settings, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite-3] SammyVimes commented on a diff in pull request #1604: IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Posted by "SammyVimes (via GitHub)" <gi...@apache.org>.
SammyVimes commented on code in PR #1604:
URL: https://github.com/apache/ignite-3/pull/1604#discussion_r1092922968


##########
modules/api/src/main/java/org/apache/ignite/network/TopologyService.java:
##########
@@ -22,6 +22,14 @@
 
 /**
  * Entry point for obtaining information about a cluster's physical topology.
+ *
+ * <p>Physical topology (PT) is the set of nodes that are thought to be reachable (from the point of view of this node)
+ * for message sending. More precisely, a node appears in the PT when it gets discovered by an underlying
+ * discovery mechanism (like SWIM) and disappears from the PT when it disappears from the Logical Topology (LT).
+ * This means that a node might still be in the PT when it is not actually reachable anymore (i.e. due to a network partition).
+ *
+ * <p>A node that was removed from the PT due to being removed from the LT cannot reappear in the PT with same node ID

Review Comment:
   We should probably decide on the naming (I mean node id vs launch id)



##########
modules/runner/src/integrationTest/java/org/apache/ignite/internal/compute/ItLogicalTopologyTest.java:
##########
@@ -170,6 +175,96 @@ public void onDisappeared(ClusterNode disappearedNode, LogicalTopologySnapshot n
         assertTrue(secondIgniteDisappeared.await(10, TimeUnit.SECONDS), "Did not see second node leaving in time");
     }
 
+    @Test
+    @BootstrapConfigTemplateMethod("templateWithVeryLongDelayToRemoveFromLogicalTopology")

Review Comment:
   I personally think that it's ok for test. The clearer -- the better. 



##########
modules/network/src/integrationTest/java/org/apache/ignite/network/scalecube/ItScaleCubeNetworkMessagingTest.java:
##########
@@ -320,6 +322,39 @@ public void testMessageGroupsHandlers(TestInfo testInfo) throws Exception {
         assertThat(networkMessageFuture, willBe(equalTo(networkMessage)));
     }
 
+    /**
+     * Tests removal from physical topology.
+     *
+     * @param testInfo Test info.
+     * @throws Exception If failed.
+     */
+    @Test
+    public void removalFromPhysicalTopologyRemovesAndFiresEvent(TestInfo testInfo) throws Exception {
+        testCluster = new Cluster(2, testInfo);
+        testCluster.startAwait();
+
+        ClusterService alice = testCluster.members.get(0);
+        ClusterService bob = testCluster.members.get(1);
+        String aliceName = alice.localConfiguration().getName();
+
+        var aliceShutdownLatch = new CountDownLatch(1);

Review Comment:
   It seems like there is now await on this latch



##########
modules/raft/src/test/java/org/apache/ignite/raft/jraft/rpc/IgniteRpcTest.java:
##########
@@ -107,6 +110,15 @@ public IgniteRpcTest(TestInfo testInfo) {
 
         service.start();
 
+        // Remove a node from Physical Topology as soon as it gets lost by SWIM.
+        InternalTopologyService internalTopologyService = (InternalTopologyService) service.topologyService();
+        internalTopologyService.addDiscoveryEventListener(new DiscoveryTopologyEventListener() {
+            @Override
+            public void onDisappeared(ClusterNode member) {
+                internalTopologyService.removeFromPhysicalTopology(member);

Review Comment:
   I don't get it. We set a listener in the internalTopologyService to update internalTopologyService? Why can't internalTopologyService do it by itself?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org