You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Steshin (Jira)" <ji...@apache.org> on 2020/05/27 11:08:00 UTC

[jira] [Updated] (IGNITE-13014) Remove double checking of node availability.

     [ https://issues.apache.org/jira/browse/IGNITE-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Steshin updated IGNITE-13014:
--------------------------------------
    Description: 
Proposal:
Do not check failed node second time. Double node checking prolongs node failure detection and gives no additional benefits. There are mesh and hardcoded values in this routine.

For the present, we have double checking of node availability. Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks Node 3 to establish permanent connection instead of node 2. Node 3 may try to check node 2 too. Or may not.

Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See ‘WostCase.txt’



  was:
For the present, we have double checking of node availability. This prolongs node failure detection and gives no additional benefits. There are mesh and hardcoded values in this routine.

Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks Node 3 to establish permanent connection instead of node 2. Node 3 may try to check node 2 too. Or may not.

Proposal:
Do not check failed node second time. Keep failure detection within expected timeouts (IgniteConfiguration.failureDetectionTimeout).


Drawbacks:

1)	Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See ‘WostCase.txt’

2)	Unexpected, not-configurable decision to check availability of previous node based on ‘2 * ServerImpl.CON_CHECK_INTERVAL‘:
{code:java}
// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; 
{code}

If ‘ok == true’ node 3 checks node 2.

3)	Several not-configurable hardcoded delays:
Node 3 checks node 2 with hardcoded timeout 100ms:
ServerImpl.isConnectionRefused():
{code:java}
sock.connect(addr, 100);
{code}

Node 1 marks Node 2 alive anew with hardcoded 200ms. See ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive():
{code:java}
try {
   Thread.sleep(200);
}
catch (InterruptedException e) {
   Thread.currentThread().interrupt();
}
{code}

4) Checking availability of previous node considers any exception but ConnectionException (connection refused) as existing connection. Even a timeout. See ServerImpl.isConnectionRefused():

{code:java}
try (Socket sock = new Socket()) {
   sock.connect(addr, 100);
}
catch (ConnectException e) {
   return true;
}
catch (IOException e) {
   return false; //Consideres as OK.
}
{code}


        Summary: Remove double checking of node availability.   (was: Remove double checking of node availability. Fix hardcoded values.)

> Remove double checking of node availability. 
> ---------------------------------------------
>
>                 Key: IGNITE-13014
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13014
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>         Attachments: WostCase.txt
>
>
> Proposal:
> Do not check failed node second time. Double node checking prolongs node failure detection and gives no additional benefits. There are mesh and hardcoded values in this routine.
> For the present, we have double checking of node availability. Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks Node 3 to establish permanent connection instead of node 2. Node 3 may try to check node 2 too. Or may not.
> Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See ‘WostCase.txt’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)