You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Kirill Gusakov (Jira)" <ji...@apache.org> on 2023/05/05 13:53:00 UTC
[jira] [Updated] (IGNITE-19087) Cancel rebalance mechanism

     [ https://issues.apache.org/jira/browse/IGNITE-19087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kirill Gusakov updated IGNITE-19087:
------------------------------------
    Description: 
Sometimes we must cancel the ongoing rebalance:
 * We can receive an unrecoverable error from replication group during the current rebalance
 * We can decide to cancel it manually

[!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]

 

 

 
h3. 1. Put rebalance intent to *.cancel key

For the purpose of persisting for cancel intent, we must save the (oldTopology, newTopology) pair of peers lists to {{zoneId.assignment.cancel}} key. Also, every invoke with update of {{*.cancel}} key must be enriched by revision of the pending key, which must be cancelled:
 {{    if(zoneId.assignment.pending.revision == inputRevision):
        zoneId.assignment.cancel = cancelValue
        return true
    else:
        return false}}
It's needed to prevent the race, between the rebalance done and cancel persisting, otherwise we can try to cancel the wrong rebalance process.
h3. [|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/rebalance.md#2-primaryreplica-replicationgroup-cancel-protocol]
h3. 2. PrimaryReplica->ReplicationGroup cancel protocol

When PrimaryReplica send {{CancelRebalanceRequest(oldTopology, newTopology)}} to the ReplicationGroup following cases are possible:
 * Replication group has ongoing rebalance oldTopology->newTopology. So, it must be cancelled and cleanup for the configuration state of replication group to oldTopology must be executed.
 * Replication group has no ongoing rebalance and currentTopology==oldTopology. So, nothing to cancel, return success response.
 * Replication group has no ongoing rebalance and currentTopology==newTopology. So, cancel request can't be executed, return the response about it. Result recipient of this response (placement driver) must log this fact and do the same routine for usual rebalanceDone.

  was:
Sometimes we must cancel the ongoing rebalance:
 * We can receive an unrecoverable error from replication group during the current rebalance
 * We can decide to cancel it manually

[!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]

 

 

 

 


> Cancel rebalance mechanism
> --------------------------
>
>                 Key: IGNITE-19087
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19087
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Kirill Gusakov
>            Priority: Major
>              Labels: ignite-3
>
> Sometimes we must cancel the ongoing rebalance:
>  * We can receive an unrecoverable error from replication group during the current rebalance
>  * We can decide to cancel it manually
> [!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]
>  
>  
>  
> h3. 1. Put rebalance intent to *.cancel key
> For the purpose of persisting for cancel intent, we must save the (oldTopology, newTopology) pair of peers lists to {{zoneId.assignment.cancel}} key. Also, every invoke with update of {{*.cancel}} key must be enriched by revision of the pending key, which must be cancelled:
>  {{    if(zoneId.assignment.pending.revision == inputRevision):
>         zoneId.assignment.cancel = cancelValue
>         return true
>     else:
>         return false}}
> It's needed to prevent the race, between the rebalance done and cancel persisting, otherwise we can try to cancel the wrong rebalance process.
> h3. [|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/rebalance.md#2-primaryreplica-replicationgroup-cancel-protocol]
> h3. 2. PrimaryReplica->ReplicationGroup cancel protocol
> When PrimaryReplica send {{CancelRebalanceRequest(oldTopology, newTopology)}} to the ReplicationGroup following cases are possible:
>  * Replication group has ongoing rebalance oldTopology->newTopology. So, it must be cancelled and cleanup for the configuration state of replication group to oldTopology must be executed.
>  * Replication group has no ongoing rebalance and currentTopology==oldTopology. So, nothing to cancel, return success response.
>  * Replication group has no ongoing rebalance and currentTopology==newTopology. So, cancel request can't be executed, return the response about it. Result recipient of this response (placement driver) must log this fact and do the same routine for usual rebalanceDone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)