You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ignite.apache.org by GitBox <gi...@apache.org> on 2020/05/22 10:08:33 UTC

[GitHub] [ignite] Vladsz83 opened a new pull request #7835: IGNITE-13021 Make node connection checking rely on the configuration. Simplify node ping routine.

Vladsz83 opened a new pull request #7835:
URL: https://github.com/apache/ignite/pull/7835


   Do not review yet. Testing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433220223



##########
File path: modules/core/src/test/java/org/apache/ignite/internal/GridFailFastNodeFailureDetectionSelfTest.java
##########
@@ -107,7 +107,7 @@ public void testFailFast() throws Exception {
 
         failNode(ignite1);
 
-        assert failLatch.await(1500, MILLISECONDS);
+        assert failLatch.await(ignite1.configuration().getFailureDetectionTimeout() + 50, MILLISECONDS);

Review comment:
       We always can be hit by GC pause, timer granulation, delays in other code. Not giving additional time to wait can lead to flaky tests. Makes sense?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] rkondakov commented on pull request #7835: IGNITE-13021 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
rkondakov commented on pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#issuecomment-632623736


   Please check the correctness of the jira issue number in the PR heading. It looks like it is incorrect. [IGNITE-13021](https://issues.apache.org/jira/browse/IGNITE-13021) is about the new SQL engine, not about the connectivity.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433121095



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -1916,6 +1924,12 @@ private void clearNodeAddedMessage(TcpDiscoveryAbstractMessage msg) {
         return threads;
     }
 
+    /** @return Total timeout on complete message exchange in network over established connection. */
+    protected long effectiveExchangeTimeout() {

Review comment:
       Do we need a method with a single usage?
   Will it be more suitable to inline it?

##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3469,6 +3456,8 @@ else if (log.isTraceEnabled())
                                         }
                                     }
 
+                                    updateLastSentMessageTime();

Review comment:
       what the reason to set sent time here?

##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3574,6 +3563,8 @@ else if (!spi.failureDetectionTimeoutEnabled() && (e instanceof
 
                                     int res = spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0));
 
+                                    updateLastSentMessageTime();

Review comment:
       what the reason to set sent time here?

##########
File path: modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySelfTest.java
##########
@@ -1976,7 +1976,8 @@ private void checkFailedCoordinatorNode(SegmentationPolicy segPlc) throws Except
 
             ignite1.configuration().getDiscoverySpi().failNode(coordId, null);
 
-            assertTrue(failedLatch.await(2000, MILLISECONDS));
+            // Wait for the configured timeout + other possible code delays.
+            assertTrue(failedLatch.await(ignite1.configuration().getFailureDetectionTimeout() + 50, MILLISECONDS));

Review comment:
       magic number?

##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoveryImpl.java
##########
@@ -303,7 +303,7 @@ public int boundPort() throws IgniteSpiException {
     /**
      * @return connection check interval.
      */
-    public long connectionCheckInterval() {
+    long connectionCheckInterval() {

Review comment:
       why changed?

##########
File path: modules/core/src/test/java/org/apache/ignite/internal/GridFailFastNodeFailureDetectionSelfTest.java
##########
@@ -107,7 +107,7 @@ public void testFailFast() throws Exception {
 
         failNode(ignite1);
 
-        assert failLatch.await(1500, MILLISECONDS);
+        assert failLatch.await(ignite1.configuration().getFailureDetectionTimeout() + 50, MILLISECONDS);

Review comment:
       magic number?

##########
File path: modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/ConnectionCheckTest.java
##########
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.spi.discovery.tcp;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.net.Socket;
+import java.util.concurrent.Exchanger;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.ignite.IgniteCheckedException;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.internal.util.GridTestClockTimer;
+import org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryAbstractMessage;
+import org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryConnectionCheckMessage;
+import org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryNodeFailedMessage;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static com.sun.tools.javac.util.Assert.check;
+
+/**
+ * Checks pinging next node in the ring relies on configured timeouts.
+ */
+public class ConnectionCheckTest extends GridCommonAbstractTest {
+    /** Number of the ping messages to ensure node pinging works well. */
+    private static final int PING_MESSAGES_CNT_TO_ENSURE = 15;
+
+    /** Timer granulation in milliseconds. See {@link GridTestClockTimer}. */
+    private static final int TIMER_GRANULATION = 10;
+
+    /**
+     * Maximal additional delay before sending the ping message including timer granulation in and other delays
+     * like code delays and/or GC.
+     */
+    private static final int ACCEPTABLE_ADDITIONAL_DELAY = TIMER_GRANULATION + 50;

Review comment:
       magic number

##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3611,15 +3602,26 @@ else if (!spi.failureDetectionTimeoutEnabled() && (e instanceof
                                 if (latencyCheck && log.isInfoEnabled())
                                     log.info("Latency check message has been written to socket: " + msg.id());
 
+                                boolean ping = msg instanceof TcpDiscoveryConnectionCheckMessage;
+
+                                long timeout = timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout());
+
+                                // For the ping we take half of actual failure detection. Another half is the interval.
                                 spi.writeToSocket(newNextNode ? newNext : next,
                                     sock,
                                     out,
                                     msg,
-                                    timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout()));
+                                    ping && spi.failureDetectionTimeoutEnabled() ? timeout / 2 : timeout
+                                );
+
+                                timeout = timeoutHelper.nextTimeoutChunk(ackTimeout0);
 
                                 long tsNanos0 = System.nanoTime();
 
-                                int res = spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0));
+                                int res = spi.readReceipt(sock, ping && spi.failureDetectionTimeoutEnabled() ?
+                                    timeout / 2 : timeout);
+
+                                updateLastSentMessageTime();

Review comment:
       1) We should wait FDT - (cutTime - lastMesssentTime) here, not full FDT.
   
   2) This code will fail if we wait for 0.8 of FDT in the first case and 0.1 in the second, but everything will be fine in general.
   It seems you should wait for the rest of the timeout at the first case, but for the rest of the rest it at second.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433228158



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3469,6 +3456,8 @@ else if (log.isTraceEnabled())
                                         }
                                     }
 
+                                    updateLastSentMessageTime();

Review comment:
       We’ve just initialized successful connection to next node. We just sent TcpDiscoveryHandshakeRequest message. This means: this connection is definitely ok and should not be checked right now. Also, we just sent a message. We rely on the time of last message sent and we fixate it here. Looks correct to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433225213



##########
File path: modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySelfTest.java
##########
@@ -1976,7 +1976,8 @@ private void checkFailedCoordinatorNode(SegmentationPolicy segPlc) throws Except
 
             ignite1.configuration().getDiscoverySpi().failNode(coordId, null);
 
-            assertTrue(failedLatch.await(2000, MILLISECONDS));
+            // Wait for the configured timeout + other possible code delays.
+            assertTrue(failedLatch.await(ignite1.configuration().getFailureDetectionTimeout() + 50, MILLISECONDS));

Review comment:
       Same here. 
   
   There is a comment: "Wait for the configured timeout + other possible code delays."
   
   We always can be hit by GC pause, timer granulation, delays in other code. Not giving additional time to wait can lead to flaky tests. Makes sense?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433247975



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3611,15 +3602,26 @@ else if (!spi.failureDetectionTimeoutEnabled() && (e instanceof
                                 if (latencyCheck && log.isInfoEnabled())
                                     log.info("Latency check message has been written to socket: " + msg.id());
 
+                                boolean ping = msg instanceof TcpDiscoveryConnectionCheckMessage;
+
+                                long timeout = timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout());
+
+                                // For the ping we take half of actual failure detection. Another half is the interval.
                                 spi.writeToSocket(newNextNode ? newNext : next,
                                     sock,
                                     out,
                                     msg,
-                                    timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout()));
+                                    ping && spi.failureDetectionTimeoutEnabled() ? timeout / 2 : timeout
+                                );
+
+                                timeout = timeoutHelper.nextTimeoutChunk(ackTimeout0);
 
                                 long tsNanos0 = System.nanoTime();
 
-                                int res = spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0));
+                                int res = spi.readReceipt(sock, ping && spi.failureDetectionTimeoutEnabled() ?
+                                    timeout / 2 : timeout);
+
+                                updateLastSentMessageTime();

Review comment:
       1. This would require us to check connection often. Like `connCheckInterval = spi.failureDetectionTimeout() / 4.`. Otherwise we can leave too short timeout to any message if it is being sent just before next ping: quite long time passed since the last message. Is it OK?
   
   2. Didn't catch you. We do wait for rest of the rest in second case:
   
   `timeout = timeoutHelper.nextTimeoutChunk(ackTimeout0);` - this is rest of the rest. Makes sense?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444171500



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoveryImpl.java
##########
@@ -303,7 +303,7 @@ public int boundPort() throws IgniteSpiException {
     /**
      * @return connection check interval.
      */
-    public long connectionCheckInterval() {
+    long connectionCheckInterval() {

Review comment:
       > We shouldn't reduce visibility of methods in public API. It is highly unlikely of course that there are users calling this method directly but concealing of method is definitely a breaking change. We could target it to 3.0 version but in minor releases it is not acceptable.
   
   Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444189002



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -6192,40 +6188,21 @@ private void checkMetricsReceiving() {
         }
 
         /**
-         * Check connection aliveness status.
+         * Check connection to next node in the ring.
          */
         private void checkConnection() {
             Boolean hasRemoteSrvNodes = null;
 
-            if (spi.failureDetectionTimeoutEnabled() && !failureThresholdReached &&
-                U.millisSinceNanos(locNode.lastExchangeTimeNanos()) >= connCheckThreshold &&
-                spiStateCopy() == CONNECTED &&
-                (hasRemoteSrvNodes = ring.hasRemoteServerNodes())) {
-
-                if (log.isInfoEnabled())
-                    log.info("Local node seems to be disconnected from topology (failure detection timeout " +
-                        "is reached) [failureDetectionTimeout=" + spi.failureDetectionTimeout() +
-                        ", connCheckInterval=" + CON_CHECK_INTERVAL + ']');
-
-                failureThresholdReached = true;
-
-                // Reset sent time deliberately to force sending connection check message.
-                lastTimeConnCheckMsgSent = 0;
-            }
-
-            long elapsed = (lastTimeConnCheckMsgSent + CON_CHECK_INTERVAL) - U.currentTimeMillis();
+            long elapsed = (lastRingMsgSentTime + U.millisToNanos(connCheckInterval)) - System.nanoTime();
 
             if (elapsed > 0)
                 return;
 
             if (hasRemoteSrvNodes == null)
                 hasRemoteSrvNodes = ring.hasRemoteServerNodes();
 
-            if (hasRemoteSrvNodes) {
+            if (hasRemoteSrvNodes)

Review comment:
       > Why not to call `updateLastSentMessageTime` method here as well?
   
   We hasn't successfully sent message here, we hasn't received RES_OK. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] sergey-chugunov-1985 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
sergey-chugunov-1985 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444162526



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoveryImpl.java
##########
@@ -303,7 +303,7 @@ public int boundPort() throws IgniteSpiException {
     /**
      * @return connection check interval.
      */
-    public long connectionCheckInterval() {
+    long connectionCheckInterval() {

Review comment:
       We shouldn't reduce visibility of methods in public API. It is highly unlikely of course that there are users calling this method directly but concealing of method is definitely a breaking change. We could target it to 3.0 version but in minor releases it is not acceptable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433220223



##########
File path: modules/core/src/test/java/org/apache/ignite/internal/GridFailFastNodeFailureDetectionSelfTest.java
##########
@@ -107,7 +107,7 @@ public void testFailFast() throws Exception {
 
         failNode(ignite1);
 
-        assert failLatch.await(1500, MILLISECONDS);
+        assert failLatch.await(ignite1.configuration().getFailureDetectionTimeout() + 50, MILLISECONDS);

Review comment:
       Removed.
   
   There is a comment:  // Wait for the configured timeout + other possible code delays.
   
   Isn't it enough?
   
   We always can be hit by GC pause, timer granulation, delays in other code. Not giving additional time to wait can lead to flaky tests.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#issuecomment-632632637


   > Please check the correctness of the jira issue number in the PR heading. It looks like it is incorrect. [IGNITE-13021](https://issues.apache.org/jira/browse/IGNITE-13021) is about the new SQL engine, not about the connectivity.
   
   Fixed on IGNITE-13012. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#issuecomment-642014375


   @sergey-chugunov-1985 , I find you have good experience in TcpDiscoverySpi. Could you take a look at this ticket too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 edited a comment on pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 edited a comment on pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#issuecomment-642014375


   @sergey-chugunov-1985 , I find you have good experience in TcpDiscoverySpi. Could you take a look at this ticket too?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433219526



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -1916,6 +1924,12 @@ private void clearNodeAddedMessage(TcpDiscoveryAbstractMessage msg) {
         return threads;
     }
 
+    /** @return Total timeout on complete message exchange in network over established connection. */
+    protected long effectiveExchangeTimeout() {

Review comment:
       Removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444189002



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -6192,40 +6188,21 @@ private void checkMetricsReceiving() {
         }
 
         /**
-         * Check connection aliveness status.
+         * Check connection to next node in the ring.
          */
         private void checkConnection() {
             Boolean hasRemoteSrvNodes = null;
 
-            if (spi.failureDetectionTimeoutEnabled() && !failureThresholdReached &&
-                U.millisSinceNanos(locNode.lastExchangeTimeNanos()) >= connCheckThreshold &&
-                spiStateCopy() == CONNECTED &&
-                (hasRemoteSrvNodes = ring.hasRemoteServerNodes())) {
-
-                if (log.isInfoEnabled())
-                    log.info("Local node seems to be disconnected from topology (failure detection timeout " +
-                        "is reached) [failureDetectionTimeout=" + spi.failureDetectionTimeout() +
-                        ", connCheckInterval=" + CON_CHECK_INTERVAL + ']');
-
-                failureThresholdReached = true;
-
-                // Reset sent time deliberately to force sending connection check message.
-                lastTimeConnCheckMsgSent = 0;
-            }
-
-            long elapsed = (lastTimeConnCheckMsgSent + CON_CHECK_INTERVAL) - U.currentTimeMillis();
+            long elapsed = (lastRingMsgSentTime + U.millisToNanos(connCheckInterval)) - System.nanoTime();
 
             if (elapsed > 0)
                 return;
 
             if (hasRemoteSrvNodes == null)
                 hasRemoteSrvNodes = ring.hasRemoteServerNodes();
 
-            if (hasRemoteSrvNodes) {
+            if (hasRemoteSrvNodes)

Review comment:
       > Why not to call `updateLastSentMessageTime` method here as well?
   
   We hasn't successfully sent message here, we hasn't received RES_OK. We just put it in the queue. So, we cannot say here the connection is ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433229555



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3574,6 +3563,8 @@ else if (!spi.failureDetectionTimeoutEnabled() && (e instanceof
 
                                     int res = spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0));
 
+                                    updateLastSentMessageTime();

Review comment:
       We just sent one or several pending messages. Every message can take time, also every message checks connection. We rely on the time of last sent message. We should fixate this time when do really send message. We do it in this part of code too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444192325



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -6192,40 +6188,21 @@ private void checkMetricsReceiving() {
         }
 
         /**
-         * Check connection aliveness status.
+         * Check connection to next node in the ring.
          */
         private void checkConnection() {
             Boolean hasRemoteSrvNodes = null;
 
-            if (spi.failureDetectionTimeoutEnabled() && !failureThresholdReached &&
-                U.millisSinceNanos(locNode.lastExchangeTimeNanos()) >= connCheckThreshold &&
-                spiStateCopy() == CONNECTED &&
-                (hasRemoteSrvNodes = ring.hasRemoteServerNodes())) {
-
-                if (log.isInfoEnabled())
-                    log.info("Local node seems to be disconnected from topology (failure detection timeout " +
-                        "is reached) [failureDetectionTimeout=" + spi.failureDetectionTimeout() +
-                        ", connCheckInterval=" + CON_CHECK_INTERVAL + ']');
-
-                failureThresholdReached = true;
-
-                // Reset sent time deliberately to force sending connection check message.
-                lastTimeConnCheckMsgSent = 0;
-            }
-
-            long elapsed = (lastTimeConnCheckMsgSent + CON_CHECK_INTERVAL) - U.currentTimeMillis();
+            long elapsed = (lastRingMsgSentTime + U.millisToNanos(connCheckInterval)) - System.nanoTime();
 
             if (elapsed > 0)
                 return;
 
             if (hasRemoteSrvNodes == null)
                 hasRemoteSrvNodes = ring.hasRemoteServerNodes();
 
-            if (hasRemoteSrvNodes) {
+            if (hasRemoteSrvNodes)

Review comment:
       As you can see, we call updateLastSentMessageTime() after successful reading spi.readReceipt or proper TcpDiscoveryHandshakeResponse. These are the places where we are sure the message was sent and connection is OK.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] anton-vinogradov merged pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
anton-vinogradov merged pull request #7835:
URL: https://github.com/apache/ignite/pull/7835


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433224852



##########
File path: modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/ConnectionCheckTest.java
##########
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.spi.discovery.tcp;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.net.Socket;
+import java.util.concurrent.Exchanger;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.ignite.IgniteCheckedException;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.internal.util.GridTestClockTimer;
+import org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryAbstractMessage;
+import org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryConnectionCheckMessage;
+import org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryNodeFailedMessage;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static com.sun.tools.javac.util.Assert.check;
+
+/**
+ * Checks pinging next node in the ring relies on configured timeouts.
+ */
+public class ConnectionCheckTest extends GridCommonAbstractTest {
+    /** Number of the ping messages to ensure node pinging works well. */
+    private static final int PING_MESSAGES_CNT_TO_ENSURE = 15;
+
+    /** Timer granulation in milliseconds. See {@link GridTestClockTimer}. */
+    private static final int TIMER_GRANULATION = 10;
+
+    /**
+     * Maximal additional delay before sending the ping message including timer granulation in and other delays
+     * like code delays and/or GC.
+     */
+    private static final int ACCEPTABLE_ADDITIONAL_DELAY = TIMER_GRANULATION + 50;

Review comment:
       > magic number
   
   There is a comment: "Maximal additional delay before sending the ping message including timer granulation in and other delays like code delays and/or GC"
   
   We always can be hit by GC pause, timer granulation, delays in other code. Not giving additional time to wait can lead to flaky tests. Makes sense?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433247975



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -3611,15 +3602,26 @@ else if (!spi.failureDetectionTimeoutEnabled() && (e instanceof
                                 if (latencyCheck && log.isInfoEnabled())
                                     log.info("Latency check message has been written to socket: " + msg.id());
 
+                                boolean ping = msg instanceof TcpDiscoveryConnectionCheckMessage;
+
+                                long timeout = timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout());
+
+                                // For the ping we take half of actual failure detection. Another half is the interval.
                                 spi.writeToSocket(newNextNode ? newNext : next,
                                     sock,
                                     out,
                                     msg,
-                                    timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout()));
+                                    ping && spi.failureDetectionTimeoutEnabled() ? timeout / 2 : timeout
+                                );
+
+                                timeout = timeoutHelper.nextTimeoutChunk(ackTimeout0);
 
                                 long tsNanos0 = System.nanoTime();
 
-                                int res = spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0));
+                                int res = spi.readReceipt(sock, ping && spi.failureDetectionTimeoutEnabled() ?
+                                    timeout / 2 : timeout);
+
+                                updateLastSentMessageTime();

Review comment:
       1. This would require us to check connection often. Like `connCheckInterval = spi.failureDetectionTimeout() / 4.`. Is it OK?
   
   2. Didn't catch you. We do wait for rest of the rest in second case:
   
   `timeout = timeoutHelper.nextTimeoutChunk(ackTimeout0);` - this is rest of the rest. Makes sense?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] Vladsz83 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r433231148



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoveryImpl.java
##########
@@ -303,7 +303,7 @@ public int boundPort() throws IgniteSpiException {
     /**
      * @return connection check interval.
      */
-    public long connectionCheckInterval() {
+    long connectionCheckInterval() {

Review comment:
       To increase encapsulation once we touched it. Internal usages inly, not an interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] sergey-chugunov-1985 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
sergey-chugunov-1985 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444162526



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/TcpDiscoveryImpl.java
##########
@@ -303,7 +303,7 @@ public int boundPort() throws IgniteSpiException {
     /**
      * @return connection check interval.
      */
-    public long connectionCheckInterval() {
+    long connectionCheckInterval() {

Review comment:
       We shouldn't reduce visibility of methods in public API. It is highly unlikely of course that there are users calling this method directly but this concealing of method is definitely a breaking change, it could be targeted only to major releases not a minor ones.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ignite] sergey-chugunov-1985 commented on a change in pull request #7835: IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine.

Posted by GitBox <gi...@apache.org>.
sergey-chugunov-1985 commented on a change in pull request #7835:
URL: https://github.com/apache/ignite/pull/7835#discussion_r444183900



##########
File path: modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/ServerImpl.java
##########
@@ -6192,40 +6188,21 @@ private void checkMetricsReceiving() {
         }
 
         /**
-         * Check connection aliveness status.
+         * Check connection to next node in the ring.
          */
         private void checkConnection() {
             Boolean hasRemoteSrvNodes = null;
 
-            if (spi.failureDetectionTimeoutEnabled() && !failureThresholdReached &&
-                U.millisSinceNanos(locNode.lastExchangeTimeNanos()) >= connCheckThreshold &&
-                spiStateCopy() == CONNECTED &&
-                (hasRemoteSrvNodes = ring.hasRemoteServerNodes())) {
-
-                if (log.isInfoEnabled())
-                    log.info("Local node seems to be disconnected from topology (failure detection timeout " +
-                        "is reached) [failureDetectionTimeout=" + spi.failureDetectionTimeout() +
-                        ", connCheckInterval=" + CON_CHECK_INTERVAL + ']');
-
-                failureThresholdReached = true;
-
-                // Reset sent time deliberately to force sending connection check message.
-                lastTimeConnCheckMsgSent = 0;
-            }
-
-            long elapsed = (lastTimeConnCheckMsgSent + CON_CHECK_INTERVAL) - U.currentTimeMillis();
+            long elapsed = (lastRingMsgSentTime + U.millisToNanos(connCheckInterval)) - System.nanoTime();
 
             if (elapsed > 0)
                 return;
 
             if (hasRemoteSrvNodes == null)
                 hasRemoteSrvNodes = ring.hasRemoteServerNodes();
 
-            if (hasRemoteSrvNodes) {
+            if (hasRemoteSrvNodes)

Review comment:
       Why not to call `updateLastSentMessageTime` method here as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org