You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Steshin (Jira)" <ji...@apache.org> on 2020/06/09 09:38:00 UTC

[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

     [ https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Summary: Fix failure detection timeout. Simplify node ping routine.  (was: Make node connection checking rely on the configuration. Simplify node ping routine.)

> Fix failure detection timeout. Simplify node ping routine.
> ----------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Node-to-next-node connection checking has several drawbacks which go together. These drawback hindered understanding and catching problems in IGNITE-13016.  We should fix the following :
> 1. Failure detection timeout should take in account last sent message. Connection check interval should also rely on this time. If we set timeout on current message only, we have no guarantee that connection failure is detected with failure detection timeout.  
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. And TpcDiscoveryConnectionCheckMessage is just an addition when message queue is empty for a long time. 
> 2. Make connection check interval depend on failure detection timeout (FTD). Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> Let's set it FDT/4 to get enough timeout time since last sent message.
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping before this period exhausts. This premature node ping relies on the time of any sent or even any received message. Imagine: if node 2 receives no message from node 1 within some time, it decides to do extra ping node 3 not waiting for regular ping. Such behavior makes confusion and gives no considerable benefits. 
> See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
> 4. Do not worry user with “Node disconnected” when everything is OK. Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)