You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Scherbakov (Jira)" <ji...@apache.org> on 2020/05/11 08:26:00 UTC

[jira] [Comment Edited] (IGNITE-12617) PME-free switch should wait for recovery only at affected nodes.

    [ https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103757#comment-17103757 ] 

Alexey Scherbakov edited comment on IGNITE-12617 at 5/11/20, 8:25 AM:
----------------------------------------------------------------------

[~avinogradov]

Your solution has 2 drawbacks:

1. Double latch waiting if replicated caches are in topology.
2. It degrades to be a no-op if backups are spread by grid nodes (this is a default behavior with rendezvous affinity).

I would like to propose an algorithm, which should provide the same latency decrease as for a best case by your approach and has no obvious drawbacks.

Assume a baseline node has failed.

1. As soon as all nodes have received node failed event a coordinator sends a discovery custom CollectMaxCountersMessage for collecting max update counter for all affected partitions (all primaries for failed node). No new transaction can reserve a counter at this point.
2. Each node having a newly assigned primary partition waits for countersReadyFuture before finishing an exchange future.
3. All nodes having no such partitions finish exchange future immediately.
4. A coordinator on recieving CollectMaxCountersMessage prepares a CountersReadyMessage and sends directly to all affected nodes from step 2.
5. Each affected node received the CountersReadyMessage, applies max counter to a partition and finishes exchange future.

The algorithm also benefits from "cells".
It can be further improved by integrating max counters collection to node fail processing at discovery layer.







was (Author: ascherbakov):
[~avinogradov]

Your solution has 2 drawbacks:

1. Double latch waiting if replicated caches are in topology.
2. It degrades to be a no-op if backups are spread by grid nodes (this is a default behavior with rendezvous affinity).

I would like to propose algorythm, which should provide the same latency decrease as for a best case by your approach and has no obvious drawbacks.

Assume a baseline node has failed.

1. As soon as all nodes have received node failed event a coordinator sends a discovery custom CollectMaxCountersMessage for collecting max update counter for all affected partitions (all primaries for failed node). No new transaction can reserve a counter at this point.
2. Each node having a newly assigned primary partition waits for countersReadyFuture before finishing an exchange future.
3. All nodes having no such partitions finish exchange future immediately.
4. A coordinator on recieving CollectMaxCountersMessage prepares a CountersReadyMessage and sends directly to all affected nodes from step 2.
5. Each affected node received the CountersReadyMessage, applies max counter to a partition and finishes exchange future.

This algorythm also benefits from "cells".
It can be further improved by integrating max counters collection to node fail processing at discovery layer.






> PME-free switch should wait for recovery only at affected nodes.
> ----------------------------------------------------------------
>
>                 Key: IGNITE-12617
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12617
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Anton Vinogradov
>            Assignee: Anton Vinogradov
>            Priority: Major
>              Labels: iep-45
>             Fix For: 2.9
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since IGNITE-9913, new-topology operations allowed immediately after cluster-wide recovery finished.
> But is there any reason to wait for a cluster-wide recovery if only one node failed?
> In this case, we should recover only the failed node's backups.
> Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's backup partitions to the whole cluster. In this case, we, obviously, have to wait for cluster-wide recovery on switch.
> But what if only some nodes will be the backups for every primary?
> In case nodes combined into virtual cells where, for each partition, backups located at the same cell with primaries, it's possible to finish the switch outside the affected cell before tx recovery finish.
> This optimization will allow us to start and even finish new operations outside the failed cell without a cluster-wide switch finish (broken cell recovery) waiting.
> In other words, switch (when left/fail + baseline + rebalanced) will have little effect on the operation's (not related to failed cell) latency.
> In other words
> - We should wait for tx recovery before finishing the switch only on a broken cell.
> - We should wait for replicated caches tx recovery everywhere since every node is a backup of a failed one.
> - Upcoming operations related to the broken cell (including all replicated caches operations) will require a cluster-wide switch finish to be processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)