You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Steshin (Jira)" <ji...@apache.org> on 2021/03/10 14:46:00 UTC

[jira] [Updated] (IGNITE-14068) Infinite node persistance in the ring while outgoing connections are lost

     [ https://issues.apache.org/jira/browse/IGNITE-14068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Steshin updated IGNITE-14068:
--------------------------------------
    Description: 
If node looses outgoing connections, it can decide it is alone in the cluster and won't fail. Happens on small clusters where failed node is able to unsuccessfully try to connect to all other nodes before connRecoveryTimeout expires.

Consider:
- The cluster n1 -> n2 -> n3 -> n4 -> n1
- n4 looses all outgoing connections.
- n3 keeps successful ping to n4.
- n4 attempts to connect to n1, n2, n3. Fails with each due to outgoing network failure.
- spi.connrecoveryTimeout is not reached. n4 decides it is alone and continues working.
- n3 still sends messages to n4. n4 does not lack incoming connections.
- ring is actually broken because of n4. n3 cannot determine failure of n4.

Solution: node can watch incoming ping what means there is incoming connection. If all outgoing connections was lost, being pinged node must left the grid not to stop the message traffic.

  was:
If node loses outgoing connections, it can decide it is alone in the cluster and won't fail. Happens on small clusters where failed node is able to unsuccessfully try to connect to all other nodes before connRecoveryTimeout expires.

Consider:
- The cluster n1 -> n2 -> n3 -> n4 -> n1
- n4 looses all outgoing connections.
- n3 keeps successful ping to n4.
- n4 attempts to connect to n1, n2, n3. Fails with each due to outgoing network failure.
- spi.connrecoveryTimeout is not reached. n4 decides it is alone and continues working.
- n3 still sends messages to n4. n4 does not lack incoming connections.
- ring is actually broken because of n4. n3 cannot determine failure of n4.

Solution: node can watch incoming ping what means there is incoming connection. If all outgoing connections was lost, being pinged node must left the grid not to stop the message traffic.


> Infinite node persistance in the ring while outgoing connections are lost
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-14068
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14068
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If node looses outgoing connections, it can decide it is alone in the cluster and won't fail. Happens on small clusters where failed node is able to unsuccessfully try to connect to all other nodes before connRecoveryTimeout expires.
> Consider:
> - The cluster n1 -> n2 -> n3 -> n4 -> n1
> - n4 looses all outgoing connections.
> - n3 keeps successful ping to n4.
> - n4 attempts to connect to n1, n2, n3. Fails with each due to outgoing network failure.
> - spi.connrecoveryTimeout is not reached. n4 decides it is alone and continues working.
> - n3 still sends messages to n4. n4 does not lack incoming connections.
> - ring is actually broken because of n4. n3 cannot determine failure of n4.
> Solution: node can watch incoming ping what means there is incoming connection. If all outgoing connections was lost, being pinged node must left the grid not to stop the message traffic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)