You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexei Scherbakov (Jira)" <ji...@apache.org> on 2019/09/04 20:00:18 UTC
[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange

    [ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922809#comment-16922809 ] 

Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:00 PM:
--------------------------------------------------------------------

PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable.
NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper?







was (Author: ascherbakov):
PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable.
NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper?






> O(log n) partition exchange
> ---------------------------
>
>                 Key: IGNITE-12133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Moti Nisenson-Ken
>            Priority: Major
>
> Currently, partition exchange leverages a ring. This means that communications is O\(n) in number of nodes. It also means that if non-coordinator nodes hang it can take much longer to successfully resolve the topology.
> Instead, why not use something like a skip-list where the coordinator is first. The coordinator can notify the first node at each level of the skip-list. Each node then notifies all of its "near-neighbours" in the skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= max-level(nodeA), and nodeB is the first node at its level when traversing from nodeA in the direction of nodeB, skipping over nodes C which have max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both directions, and having the coordinator also notify the last node in the list at each level. This way in the above example if 2 and 3 were both down, 4 would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the overall time is reduced. Additionally, we can deal well with at least 1 node failure - if one includes the option of processing backwards, 2 consecutive node failures can be handled as well. By taking this kind of an approach, then the coordinator can basically treat any nodes it didn't receive a message from as not-connected, and update the topology as well (disconnecting any nodes that it didn't get a notification from). While there are some edge cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 disconnected nodes - the connected node would be wrongly ejected from the topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)