You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ignite.apache.org by sk...@apache.org on 2022/06/27 11:24:09 UTC
[ignite-3] branch main updated: IGNITE-17143 Update rebalance.md documentation. Fixes #894
This is an automated email from the ASF dual-hosted git repository.
sk0x50 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git
The following commit(s) were added to refs/heads/main by this push:
new 5cda17885 IGNITE-17143 Update rebalance.md documentation. Fixes #894
5cda17885 is described below
commit 5cda17885bbea824163cd4a3238af79c34564080
Author: Mirza Aliev <al...@gmail.com>
AuthorDate: Mon Jun 27 14:24:01 2022 +0300
IGNITE-17143 Update rebalance.md documentation. Fixes #894
Signed-off-by: Slava Koptilin <sl...@gmail.com>
---
modules/table/tech-notes/rebalance.md | 81 ++++++++++++++++-------------------
1 file changed, 36 insertions(+), 45 deletions(-)
diff --git a/modules/table/tech-notes/rebalance.md b/modules/table/tech-notes/rebalance.md
index 2f096b52c..d0198c8d6 100644
--- a/modules/table/tech-notes/rebalance.md
+++ b/modules/table/tech-notes/rebalance.md
@@ -1,73 +1,61 @@
# How to read this doc
Every algorithm phase has the following main sections:
-- Trigger - how current phase will be invoked
-- Steps/Pseudocode - the main logical steps of the current phase
-- Result (optional, if pseudocode provided) - events and system state changes, which this phase produces
+- Trigger – how current phase will be invoked
+- Steps/Pseudocode – the main logical steps of the current phase
+- Result (optional, if pseudocode provided) – events and system state changes, which this phase produces
# Rebalance algorithm
## Short algorithm description
- Operations, which can trigger rebalance occurred:
- Write new baseline to metastore (effectively from 1 node in cluster)
-
- OR
-
Write new replicas configuration number to table config (effectively from 1 node)
OR
Write new partitions configuration number to table config (effectively from 1 node)
- Write new assignments' intention to metastore (effectively from 1 node in cluster)
-- Start new raft nodes. Initiate/update change peer request to raft group (effectively from 1 node per partition)
+- Start new raft nodes. Initiate/update asynchronous change peer request to raft group (effectively from 1 node per partition)
- Stop all redundant nodes. Change stable partition assignment to the new one and finish rebalance process.
## New metastore keys
For further steps, we should introduce some new metastore keys:
-- `partition.assignments.stable` - the list of peers, which process operations for partition at the current moment.
+- `partition.assignments.stable` - the list of peers, which process operations for a partition at the current moment.
- `partition.assignments.pending` - the list of peers, where current rebalance move the partition.
- `partition.assignments.planned` - the list of peers, which will be used for new rebalance, when current will be finished.
Also, we will need the utility key:
-- `partition.assignments.change.trigger.revision` - the key, needed for processing the event about assignments' update trigger only once.
+- `partition.change.trigger.revision` - the key, needed for processing the event about assignments' update trigger only once.
## Operations, which can trigger rebalance
Three types of events can trigger the rebalance:
-- Change of baseline metastore key (1 for all tables for now, but maybe it should be separate per table in future)
- Configuration change through `org.apache.ignite.configuration.schemas.table.TableChange.changeReplicas` produce metastore update event
- Configuration change through `org.apache.ignite.configuration.schemas.table.TableChange.changePartitions` produce metastore update event (IMPORTANT: this type of trigger has additional difficulties because of cross raft group data migration and it is out of scope of this document)
-**Result**: So, one of three metastore keys' changes will trigger rebalance:
+**Result**: So, one of two metastore keys' changes will trigger rebalance:
```
-<global>.baseline
<tableScope>.replicas
<tableScope>.partitions // out of scope
```
## Write new pending assignments (1)
**Trigger**:
-- Metastore event about change in `<global>.baseline`
-- Metastore event about changes in `<tableScope>.replicas`
+- Metastore event about changes in `<tableScope>.replicas` (See `org.apache.ignite.internal.table.distributed.TableManager.onUpdateReplicas`)
**Pseudocode**:
-```
-onBaselineEvent:
- for table in tableCfg.tables():
- for partition in table.partitions:
- <inline metastoreInvoke>
-
+```
onReplicaNumberChange:
with table as event.table:
for partitoin in table.partitions:
<inline metastoreInvoke>
metastoreInvoke: // atomic metastore call through multi-invoke api
- if empty(partition.assignments.change.trigger.revision) || partition.assignments.change.trigger.revision < event.revision:
+ if empty(partition.change.trigger.revision) || partition.change.trigger.revision < event.revision:
if empty(partition.assignments.pending) && partition.assignments.stable != calcPartAssighments():
partition.assignments.pending = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else:
if partition.assignments.pending != calcPartAssignments
partition.assignments.planned = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else
remove(partition.assignments.planned)
else:
@@ -75,18 +63,18 @@ metastoreInvoke: // atomic metastore call through multi-invoke api
```
## Start new raft nodes and initiate change peers (2)
-**Trigger**: Metastore event about new `partition.assignments.pending` received
+**Trigger**: Metastore event about new `partition.assignments.pending` received (See corresponding listener for pending key in `org.apache.ignite.internal.table.distributed.TableManager.registerRebalanceListeners`)
**Steps**:
- Start all new needed nodes `partition.assignments.pending / partition.assignments.stable`
-- After successful starts - check if current node is the leader of raft group (leader response must be updated by current term) and `changePeers(leaderTerm, peers)`. `changePeers` from old terms must be skipped.
+- After successful starts - check if current node is the leader of raft group (leader response must be updated by current term) and run `RaftGroupService#changePeersAsync(leaderTerm, peers)`. `RaftGroupService#changePeersAsync` from old terms must be skipped.
**Result**:
- New needed raft nodes started
- Change peers state initiated for every raft group
-## When changePeers done inside the raft group - stop all redundant nodes
-**Trigger**: When leader applied new Configuration with list of resulting peers `<applied peer>`, it calls `onChangePeersCommitted(<applied peers>)`
+## When RaftGroupService#changePeersAsync done inside the raft group – update assignments, stable key and stop all redundant nodes
+**Trigger**: When leader applied new Configuration with list of resulting peers `<applied peer>`, it calls `RebalanceRaftGroupEventsListener.onNewPeersConfigurationApplied(<applied peers>)`
**Pseudocode**:
```
@@ -98,13 +86,16 @@ metastoreInvoke: \\ atomic
partition.assignments.pending = partition.assignments.planned
remove(partition.assignments.planned)
```
+`RebalanceRaftGroupEventsListener.onNewPeersConfigurationApplied(<applied peers>)` also is responsible for the updating of assignments of the table.
+When assignments are updated, corresponding listener for its updates will be triggered (see `org.apache.ignite.internal.table.distributed.TableManager.onUpdateAssignments`), so raft clients will be updated.
+
+After stable key is updated, corresponding listener for that change is called, so redundant raft nodes may be stopped (see corresponding listener for stable key in `org.apache.ignite.internal.table.distributed.TableManager.registerRebalanceListeners`)
-Failover helpers (detailed failover scenarious must be developed in future)
-- `onLeaderElected()` - must be executed from the new leader when raft group elected the new leader. Maybe we actually need to also check if a new lease is received.
-- `onChangePeersError()` - must be executed when any errors during changePeers occurred.
-- `onChangePeersCommitted(peers)` - must be executed with the list of new peers when changePeers has successfully done.
-At the moment, this set of listeners seems to enough for restart rebalance and/or notify node's failover mechanism about fatal issues. Failover scenarios will be explained with more details during first phase of implementation.
+Failover helpers
+- `RebalanceRaftGroupEventsListener.onLeaderElected` - must be executed from the new leader when raft group elected the new leader. Maybe we actually need to also check if a new lease is received.
+- `RebalanceRaftGroupEventsListener.onReconfigurationError` - must be executed when any errors during `RaftGroupService#changePeersAsync` occurred. For more info about change peers process, and more specifically, catch up process, see `modules/raft/tech-notes/changePeers.md` and `modules/raft/tech-notes/nodeCatchUp.md`
+- `RebalanceRaftGroupEventsListener.onNewPeersConfigurationApplied(peers)` - must be executed with the list of new peers when `RaftGroupService#changePeersAsync` has successfully done.
## Cleanup redundant raft nodes (3)
**Trigger**: Node receive update about partition stable assignments
@@ -119,18 +110,18 @@ At the moment, this set of listeners seems to enough for restart rebalance and/o
# Failover
We need to provide Failover thread, which can handle the following cases:
-- `changePeers` can't start even catchup process, because of any new raft nodes wasn't started yet for instance.
-- `changePeers` failed to complete catchup due to catchup timeout, for example.
+- `RaftGroupService#changePeersAsync` can't start even catchup process, because of any new raft nodes wasn't started yet for instance.
+- `RaftGroupService#changePeersAsync` failed to complete catchup due to catchup timeout, for example. To check all possible error cases during catch up stage, check `modules/raft/tech-notes/nodeCatchUp.md`
We have the following mechanisms for handling these cases:
-- `onChangerPeersError(errorContext)`, which must schedule retries, if needed
-- Separate special thread must process all needed retries on the current node
-- If current node is not the leader of partition raft group anymore - it will request `changePeers` with legacy term, receive appropriate answer from the leader and stop retries for this partition.
-- If leader changed, new node will receive `onLeaderElected()` invoke and start needed changePeers from the pending key. Also, it will create new failover thread for retry logic.
-- If failover exhaust maximum number for query retries - it must notify about the issue global node failover (details must be specified later)
+- `RebalanceRaftGroupEventsListener.onReconfigurationError`, which schedules retries, if needed
+- Separate special thread pool processes all needed retries on the current node
+- If a current node is not the leader of partition raft group anymore - it will request `RaftGroupService#changePeersAsync` with legacy term, receive appropriate answer from the leader and stop retries for this partition.
+- If a leader has been changed, new node receives `RebalanceRaftGroupEventsListener.onLeaderElected` invoke and start needed `RaftGroupService#changePeersAsync` from the pending key.
+- If failover exhaust maximum number for query retries - it must notify node's failure handler about the issue global (details must be specified later)
-# Metastore rebalance
+# Metastore rebalance (not implemented yet)
It seems, that rebalance of metastore can be handled the same process, because:
- during the any stage of `changePeers` raft group can handle any another entries
- any rebalance failures must not end up by raft group unavailability (if majority is kept)
@@ -152,20 +143,20 @@ This approach can be addressed with different implemetation details, but let's d
**Pseudocode**
```
metastoreInvoke: // atomic metastore call through multi-invoke api
- if empty(partition.assignments.change.trigger.revision) || partition.assignments.change.trigger.revision < event.revision:
+ if empty(partition.change.trigger.revision) || partition.change.trigger.revision < event.revision:
if empty(partition.assignments.pending):
if partition.assignments.stable != calcPartAssignments():
partition.assignments.pending = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else:
skip
else:
if empty(partition.assignments.pending.lock):
partition.assignments.pending = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else:
partition.assignments.planned = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else:
skip
```