You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2023/02/08 06:38:00 UTC

[jira] [Resolved] (IGNITE-18630) Try to deliver a message until receiver drops out from logical topology

     [ https://issues.apache.org/jira/browse/IGNITE-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Puchkovskiy resolved IGNITE-18630.
----------------------------------------
    Resolution: Invalid

Superseded by IGNITE-18712

> Try to deliver a message until receiver drops out from logical topology
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-18630
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18630
>             Project: Ignite
>          Issue Type: Improvement
>          Components: networking
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, there are two topologies: physical (bound to Scalecube events 1:1) and logical. Appearing in the physical topology (PT) starts validation which (if successful) ends with addition to the logical topology (LT); dropping from the PT immediately removes a node from the LT.
> We use PT as a set of nodes to which the current node can send messages. This means that if ScaleCube loses a node from sight due to a transient glitch (caused by a GC pause, for example), after which a node becomes visible again, we still remove the node from the PT, making it impossible to deliver a message to it; so transient network glitches harm the reliability of messaging.
> The suggestion is to switch to the following:
>  # We decouple ScaleCube topology from the PT, so we now have 3 topologies: ScaleCube topology (tracked via ScaleCube events) (these are nodes that are thought to be alive by our node from the point of view of SWIM protocol), physical topology (nodes which we consider as reachable and to which we can send messages) and logical topology (nodes that passed validation and joined the cluster)
>  # A node enters PT when it appears in the ScaleCube topology (ST), but it leaves the PT when it leaves the LT
>  # Logical topology 'leave' events will be triggered by ST leave events, but with a delay, so that if a node returns to the ST with same ScaleCube ID, LT leave event is not fired
> Summing up:
>  # When a node appears in ST, it appears in PT
>  # When it appears in PT, validation process starts (which might lead to adding the node to LT)
>  # When a node leaves ST, a delayed removal from LT is scheduled. It is cancelled if the node appears in ST again
>  # When a node leaves LT, it leaves PT (making it impossible to send a message to it)
>  # When doing a graceful shutdown, a node should send a 'graceful LT leave' message so that it drops from LT and PT immediately, without the timeout defined in item 3.
>  # If a node is removed from LT, it can not be let to PT again with same ID (ID is the 'launch ID', not the consistent ID); to enter, it must change its ID (this will be implemented in IGNITE-18685)
> As LT events are distributed using RAFT, if a node loses ability to connect a CMG leader, it will never drop other nodes from its PT, so it will try to deliver messages for infinite time. This seems ok.
> One thing that should be considered is that {{TopologyService}} (for PT) and {{LogicalTopologyService}} are defined in different modules, which might cause difficulties when subscribing to each other events.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)