You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Kirill Gusakov (Jira)" <ji...@apache.org> on 2023/05/05 13:53:00 UTC
[jira] [Updated] (IGNITE-19087) Cancel rebalance mechanism
[ https://issues.apache.org/jira/browse/IGNITE-19087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kirill Gusakov updated IGNITE-19087:
------------------------------------
Description:
Sometimes we must cancel the ongoing rebalance:
* We can receive an unrecoverable error from replication group during the current rebalance
* We can decide to cancel it manually
[!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]
h3. 1. Put rebalance intent to *.cancel key
For the purpose of persisting for cancel intent, we must save the (oldTopology, newTopology) pair of peers lists to {{zoneId.assignment.cancel}} key. Also, every invoke with update of {{*.cancel}} key must be enriched by revision of the pending key, which must be cancelled:
{{ if(zoneId.assignment.pending.revision == inputRevision):
zoneId.assignment.cancel = cancelValue
return true
else:
return false}}
It's needed to prevent the race, between the rebalance done and cancel persisting, otherwise we can try to cancel the wrong rebalance process.
h3. [|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/rebalance.md#2-primaryreplica-replicationgroup-cancel-protocol]
h3. 2. PrimaryReplica->ReplicationGroup cancel protocol
When PrimaryReplica send {{CancelRebalanceRequest(oldTopology, newTopology)}} to the ReplicationGroup following cases are possible:
* Replication group has ongoing rebalance oldTopology->newTopology. So, it must be cancelled and cleanup for the configuration state of replication group to oldTopology must be executed.
* Replication group has no ongoing rebalance and currentTopology==oldTopology. So, nothing to cancel, return success response.
* Replication group has no ongoing rebalance and currentTopology==newTopology. So, cancel request can't be executed, return the response about it. Result recipient of this response (placement driver) must log this fact and do the same routine for usual rebalanceDone.
was:
Sometimes we must cancel the ongoing rebalance:
* We can receive an unrecoverable error from replication group during the current rebalance
* We can decide to cancel it manually
[!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]
> Cancel rebalance mechanism
> --------------------------
>
> Key: IGNITE-19087
> URL: https://issues.apache.org/jira/browse/IGNITE-19087
> Project: Ignite
> Issue Type: Task
> Reporter: Kirill Gusakov
> Priority: Major
> Labels: ignite-3
>
> Sometimes we must cancel the ongoing rebalance:
> * We can receive an unrecoverable error from replication group during the current rebalance
> * We can decide to cancel it manually
> [!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]
>
>
>
> h3. 1. Put rebalance intent to *.cancel key
> For the purpose of persisting for cancel intent, we must save the (oldTopology, newTopology) pair of peers lists to {{zoneId.assignment.cancel}} key. Also, every invoke with update of {{*.cancel}} key must be enriched by revision of the pending key, which must be cancelled:
> {{ if(zoneId.assignment.pending.revision == inputRevision):
> zoneId.assignment.cancel = cancelValue
> return true
> else:
> return false}}
> It's needed to prevent the race, between the rebalance done and cancel persisting, otherwise we can try to cancel the wrong rebalance process.
> h3. [|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/rebalance.md#2-primaryreplica-replicationgroup-cancel-protocol]
> h3. 2. PrimaryReplica->ReplicationGroup cancel protocol
> When PrimaryReplica send {{CancelRebalanceRequest(oldTopology, newTopology)}} to the ReplicationGroup following cases are possible:
> * Replication group has ongoing rebalance oldTopology->newTopology. So, it must be cancelled and cleanup for the configuration state of replication group to oldTopology must be executed.
> * Replication group has no ongoing rebalance and currentTopology==oldTopology. So, nothing to cancel, return success response.
> * Replication group has no ongoing rebalance and currentTopology==newTopology. So, cancel request can't be executed, return the response about it. Result recipient of this response (placement driver) must log this fact and do the same routine for usual rebalanceDone.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)