You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Moti Nisenson-Ken (Jira)" <ji...@apache.org> on 2019/10/06 08:28:00 UTC
[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange

    [ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945269#comment-16945269 ] 

Moti Nisenson-Ken edited comment on IGNITE-12133 at 10/6/19 8:27 AM:
---------------------------------------------------------------------

I'm not familiar enough with Ignite internals - from a message ordering perspective, since the coordinator is the one initiating this, and the coordinator doesn't change without a change to the major topology version, shouldn't it be simple enough to establish an ordering? The change to the next major topology version should be the last message sent under the previous topology version. Nodes can then hold on to messages received out of order (possibly up to some maximum size) for handling in order. I'd assume that when the coordinator goes down, the new coordinator will need to notify under the previous topology version.

I agree that the ability to mutate messages is lost; this requires sending the updates back to the coordinator which then has additional work to do merging them. 

Regarding [~ivan.glukos] I don't think that top-level node failure should be much of an issue. I did miss that we do need the forward and backwards communications. If you look in my diagram above 1 -> 3 and 1 -> 2. Thus, failure of 3 doesn't prevent 4 from getting a message, since 2 -> 4 (3 is not a near neighbour of 2 because it's max level is higher than 2's). Although without backward comms, 5 would get disconnected. However, if we include it, then each node is contacted by either: the coordinator or 4 other nodes. To see this, if a node isn't contacted by the coordinator, then it must have nodes preceding and succeeding it at its max-level. Both of those nodes will contact it. But then there must be additional nodes at a greater level between those nodes and it, which also will contact it. Note that nodes contacted by the coordinator may also be contacted by an additional node in the other direction, not that it matters.


was (Author: mnk):
I'm not familiar enough with Ignite internals - from a message ordering perspective, since the coordinator is the one initiating this, and the coordinator doesn't change without a change to the major topology version, shouldn't it be simple enough to establish an ordering? The change to the next major topology version should be the last message sent under the previous topology version. Nodes can then hold on to messages received out of order (possibly up to some maximum size) for handling in order. I'd assume that when the coordinator goes down, the new coordinator will need to notify under the previous topology version.

I agree that the ability to mutate messages is lost; this requires sending the updates back to the coordinator which then has additional work to do merging them. 

 

Regarding [~ivan.glukos] note - top level node failures are not an issue. The issue is around consecutive node failures at the bottom most layer. If you look in my diagram above 1 -> 3 and 1 -> 2. Thus, failure of 3 doesn't prevent 4 from getting a message, since 2 -> 4 (3 is not a near neighbour of 2 because it's max level is higher than 2's). Again, adding comms in both directions (where we have 1 -> 5 and 1 -> 6 as well), makes this even more robust.

> O(log n) partition exchange
> ---------------------------
>
>                 Key: IGNITE-12133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Moti Nisenson-Ken
>            Priority: Major
>
> Currently, partition exchange leverages a ring. This means that communications is O\(n) in number of nodes. It also means that if non-coordinator nodes hang it can take much longer to successfully resolve the topology.
> Instead, why not use something like a skip-list where the coordinator is first. The coordinator can notify the first node at each level of the skip-list. Each node then notifies all of its "near-neighbours" in the skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= max-level(nodeA), and nodeB is the first node at its level when traversing from nodeA in the direction of nodeB, skipping over nodes C which have max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both directions, and having the coordinator also notify the last node in the list at each level. This way in the above example if 2 and 3 were both down, 4 would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the overall time is reduced. Additionally, we can deal well with at least 1 node failure - if one includes the option of processing backwards, 2 consecutive node failures can be handled as well. By taking this kind of an approach, then the coordinator can basically treat any nodes it didn't receive a message from as not-connected, and update the topology as well (disconnecting any nodes that it didn't get a notification from). While there are some edge cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 disconnected nodes - the connected node would be wrongly ejected from the topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)