You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by "Vladimir Steshin (Jira)" <ji...@apache.org> on 2020/05/14 22:43:00 UTC

[jira] [Created] (IGNITE-13012) Make node connection checking rely on the configuration. Simplify node ping routine.

Vladimir Steshin created IGNITE-13012:
-----------------------------------------

             Summary: Make node connection checking rely on the configuration. Simplify node ping routine.
                 Key: IGNITE-13012
                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
             Project: Ignite
          Issue Type: Improvement
            Reporter: Vladimir Steshin
            Assignee: Vladimir Steshin



Current noted-to-node connection checking has several drawbacks:
1)	Minimal connection checking interval is not bound to failure detection parameters: 
static int ServerImpls.CON_CHECK_INTERVAL = 500;
2)	Connection checking is made as ability of periodical message sending (TcpDiscoveryConnectionCheckMessage). It is bound to own time (ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent), not to common time of last sent message. This is weird because any discovery message actually checks connection. And TpDiscoveryConnectionCheckMessage is just an addition when message queue is empty for a long time.
3)	Period of Node-to-Node connection checking can be sometimes shortened for strange reason: if no sent or received message appears within failureDetectionTimeout. Here, despite we have minimal period of connection checking (ServerImpls.CON_CHECK_INTERVAL), we can also send TpDiscoveryConnectionCheckMessage before this period exhausted. Moreover, this premature node ping relies also on time of last received message. Imagine: if node 2 receives no message from node 1 within some time it decides to do extra ping node 3 not waiting for regular ping interval. Such behavior makes confusion and gives no additional guaranties.
4)	If #3 happens, node writes in the log on INFO: “Local node seems to be disconnected from topology …” whereas it is not actually disconnected. User can see this message if he typed failureDetectionTimeout < 500ms. I wouldn’t like seeing INFO in a log saying a node is might be disconnected. This sounds like some troubles raised in network. But not as everything is OK. 

Suggestions:
1)	Make connection check interval be based on failureDetectionTimeout or similar params.
2)	Make connection check interval rely on common time of last sent message. Not on dedicated time.
3)	Remove additional, random, quickened connection checking.
4)	Do not worry user with “Node disconnected” when everything is OK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)