You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ignite.apache.org by sk...@apache.org on 2021/11/16 16:49:28 UTC

[ignite-3] branch main updated: IGNITE-15493 Added technical note about changePeers. Fixes #420

This is an automated email from the ASF dual-hosted git repository.

sk0x50 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git


The following commit(s) were added to refs/heads/main by this push:
     new 5b9a948  IGNITE-15493 Added technical note about changePeers. Fixes #420
5b9a948 is described below

commit 5b9a948fa14cdc78db1585afa175dc9072899eb2
Author: Kirill Gusakov <kg...@gmail.com>
AuthorDate: Tue Nov 16 19:45:17 2021 +0300

    IGNITE-15493 Added technical note about changePeers. Fixes #420
    
    Signed-off-by: Slava Koptilin <sl...@gmail.com>
---
 modules/raft/tech-notes/changePeers.md | 47 ++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/modules/raft/tech-notes/changePeers.md b/modules/raft/tech-notes/changePeers.md
new file mode 100644
index 0000000..bad3b1c
--- /dev/null
+++ b/modules/raft/tech-notes/changePeers.md
@@ -0,0 +1,47 @@
+## Introduction
+ChangePeers is not a separate jraft log command, but an algorithm with some separate phases
+- Start replicators for new nodes on the leader and waiting for nodes' catchup
+- Push configuration LogEntry to raft quorum (old quorum, new quorum) and apply ConfigurationEntry(conf=\<new peers configuration>, oldConf=\<previous peers configuration>) to leader. 
+- When previous configuration committed by quorum, push configuration LogEntry to raft quorum (only new quorum) and apply configuration ConfigurationEntry(conf=\<new peers configuration>, oldConf=null) to leader. Legacy replicators will be stopped and if current leader is not in new topology - it will be stepped down.
+
+## Catchup phase (STAGE_CATCHING_UP)
+On changePeers request leader start all needed replicators for new peers. NodeImpl#confCtx stage set to STAGE_CATCHING_UP. We will use these stages as logical step names for further process explanation.
+
+The end of catchup phase - is the moment, when all new peers caught up the leader. It means that a difference between leader.last_log_index and peer.last_log_index is smaller than NodeOptions#catchupMargin (1000 by default).
+
+So, when catchup finished - NodeImpl#confCtx stage will be moved to STAGE_JOINT and first real configuration changes started.
+
+## Apply composite configuration with old and new peers (STAGE_JOINT)
+First LogEntry(type=ENTRY_TYPE_CONFIGURATION, peers=newConf.peers, oldPeers=oldConf.peers, ...) pushed to composite raft quorum (oldQuorum, newQuorum). Created the listener ConfigurationChangeDone, which will be invoked when both quorum accept entry.
+
+Also, put it to logManager of leader and apply this configuration to leader immediately. 
+
+ConfigurationChangeDone is invoked (so, both quorums accept config entry) and NodeImpl#confCtx stage move to STAGE_STABLE.
+
+## Apply configuration with only new peers (STAGE_STABLE)
+The last chord of this song.
+
+Put LogEntry(type=ENTRY_TYPE_CONFIGURATION, peers=newConf.peers, oldPeers=null, ...) to the new quorum.
+
+Change configuration of the leader to the new peers only.
+
+When new quorum accept new config - stop legacy replicators and step down the leader if it was removed from the new configuration.
+
+After that - changePeers finished and client receive response.
+
+## Questions
+>Is it possible to change peers for the case when the old and new sets of raft nodes do not intersect?
+
+Yes, according to algorithm it is not an issue.
+
+>When changePeers() returns to the client?
+
+In the end of the whole process (including data migration) and it looks like it is not a problem of algorithm at all, but the problem of ChangePeersRequestProcessor from cli package. The group will be fully workin during the longest phase of changePeers - data migration. So, maybe we need just async version of ChangePeerRequestProcessor.
+
+>Let’s check whether dataRebalance is a raft command that works just as any other raft commands and do not expect index gaps.
+
+As mentioned earlier - dataRebalance is a background process. But configuration changes processed through usual raft LogEntry flow.
+
+>Let’s check that snapshot works the same way as log-rebalance in the context of index moving.
+
+Snapshots can be used on catchup phase along with entries - is not an issue, I think. Every snapshot has last entry index, it will affect only the internal mechanism of data moving during catchup.