You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2023/04/30 10:29:00 UTC

[jira] [Commented] (IGNITE-18712) Do not allow a node excluded from Physical Topology to enter the topology again

    [ https://issues.apache.org/jira/browse/IGNITE-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718029#comment-17718029 ] 

Roman Puchkovskiy commented on IGNITE-18712:
--------------------------------------------

Thanks!

> Do not allow a node excluded from Physical Topology to enter the topology again
> -------------------------------------------------------------------------------
>
>                 Key: IGNITE-18712
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18712
>             Project: Ignite
>          Issue Type: New Feature
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The following scenario is possible:
>  # Node X is a part of PT
>  # Its network cable gets unplugged, but the node X keeps being alive
>  # After proper timeouts, other nodes remove the node X from PT, so their {{MessagingServices}} drop messages still not delivered to node X
>  # The network cable gets plugged again, so the node X attempts to enter the PT with the same old ID (aka Launch ID)
> If we allow it to enter PT again, we might lose some messages to node X from other nodes, but node X will never know about it. Some state in its memory might still remain from a process thinking that the messages will be delivered later, so some invariants might break.
> To prevent such a situation, the node must be refused entry, namely, a connection must be terminated on a handshake attempt. This has to be done both in {{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}.
> When a node is refused a connection attempt, the refusing node must first send an explaining message (like 'your ID is stale') and then close the physical connection.
> The refused node must take measures to refresh its identity (like initiating a critical failure using a Failure Handler to reboot).
> It seems that we do not need a consensus of the whole cluster (on the decision that a node has left and should never be allowed to join again) as messaging communications are point-to-point. SWIM 'half consensus' should be enough.
> A subtle thing is how we persist the fact that some node ID is stale. For starters, we could make this information volatile (only keep it in memory), but later we could record this information using CMG.
> Please do not confuse this issue with IGNITE-18685 which was caused by a rejected attempt of fixing same problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)