You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ignite.apache.org by ag...@apache.org on 2021/04/24 10:11:51 UTC

[ignite-3] branch ignite-14647 created (now c9fa5ec)

This is an automated email from the ASF dual-hosted git repository.

agoncharuk pushed a change to branch ignite-14647
in repository https://gitbox.apache.org/repos/asf/ignite-3.git.


      at c9fa5ec  IGNITE-14647 Describe Raft-based rebalance process

This branch includes the following new commits:

     new c9fa5ec  IGNITE-14647 Describe Raft-based rebalance process

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


[ignite-3] 01/01: IGNITE-14647 Describe Raft-based rebalance process

Posted by ag...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

agoncharuk pushed a commit to branch ignite-14647
in repository https://gitbox.apache.org/repos/asf/ignite-3.git

commit c9fa5ec4dfd62ac9f0c7caa6350c4bcd5051f2e8
Author: Alexey Goncharuk <al...@gmail.com>
AuthorDate: Sat Apr 24 13:11:38 2021 +0300

    IGNITE-14647 Describe Raft-based rebalance process
---
 modules/affinity/README.md | 86 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/modules/affinity/README.md b/modules/affinity/README.md
new file mode 100644
index 0000000..2c2db7c
--- /dev/null
+++ b/modules/affinity/README.md
@@ -0,0 +1,86 @@
+# Partitioning approach
+
+## Hash-based partitioning
+
+## Range-based partitioning
+
+# Data migration (rebalance)
+There is a significant difference between the rebalance approach in Ignite 2.x and rebalance approach in Ignite 3.x.
+
+Ignite 2.x implemented rebalance process with updates being applied to the storage concurrently with data migration
+process. This results in a complex interaction between the rebalance process and data update protocol (the necessity
+to compare key-value versions during data migration, different entry processor application paths for cases when 
+rebalance is active and not active, uncertain partition state during recovery, etc).
+
+Ignite 3.x relies on common replication infrastructure for data replication between nodes, thus the rebalance should
+be handled by means of the replication protocols.
+
+## Raft
+Raft consensus protocol does not have a concept of rebalance. Instead, it relies on two underlying mechanisms in order
+to have an ability to catch offline nodes up-to-speed and bootstrap new Raft group members: Raft log and Snapshots.
+These mechanisms handle both delta (when a local node has relevant enough local state so it can be brought up to speed
+by sending only recent Raft log commands) and full (when a Raft group does not have sufficient Raft log to catch up the
+node, so the full state machine snapshot should be sent to the local node) rebalance scenarios. The choice between
+snapshot and log-based catch-up is based on Raft log availability, however, this logic can be adjusted to a more 
+sophisticated heuristic. The underlying state machine should only provide the snapshot functionality. This functionality
+differs for in-memory and persistent tables.
+
+### In-memory tables
+In-memory tables do not save partitions in isolated memory regions. Instead, the partition data is written to a shared
+memory pool in order to provide efficient memory utilization (otherwise, an assigned memory chunk would remain assigned
+to a partition and would not be eligible for other partitions for reuse). This makes it impossible to create partition 
+memory snapshots on phycial level, so we need to maintain a snapshot on tuple basis.
+
+At any moment in time at most one in-memory partition snapshot can be maintained.
+
+#### Alternative 1
+To create an in-memory snapshot, we use an MVCC-like approach with copy-on-write technique. The partition tree is 
+extended to support keeping two versions of a tuple for the same key: one is the most relevant version, and another one 
+is snapshot version. The snapshot tuple contains snapshot ID additionally to the regular tuple data. Snapshot tuples are 
+only available to the snapshot iterator and must be filtered out from regular data access paths.  
+
+When a snapshot for an in-memory partition is requested, the partition state machine checks that there is no another 
+active snapshot and assigns a new snapshot ID which will be used for copy-on-write tuples. When the snapshot iterator
+is traversing a tree, it attempts to read both up-to-date and snapshot version of the key. If the snapshot version of 
+the key with the current snapshot ID exists, it must be used in the iterator. If the snapshot version of the key with 
+the current snapshot ID does not exist, the up-to-date version of the tuple must be used in the iterator.
+
+Each partition state machine update checks if there is a snapshot that is being maintained. If there is no active 
+snapshot, the update operation should clean an old snapshot tuple version, if any, and do the regular tuple update. If 
+there is an active snapshot, the update operation must first clean an old snapshot tuple version, if any. Then, if a 
+snapshot tuple version with the current snapshot ID does not exist, the update operation copies the current tuple value
+to the snapshot version, and then completes the update (it does not copy the current value if a relevant snapshot 
+version already exists).
+
+When snapshot is no longer needed, an asynchronous process can clean up the snapshot versions from the partition.
+
+This approach does not induce any memory overhead when no snapshot is maintained, but may require up to 2x of partition
+size under heavy load (because the whole partition may be copied to the snapshot versions in the worst case scenario).
+
+#### Alternative 2
+To create an in-memory snapshot, the snapshot data is written to a separate in-memory buffer. The buffer is populated 
+from the state machine update thread either by the update operations or by a snapshot advance mini-task which is 
+submitted to the state machine update thread as needed.
+
+To maintain a snapshot, the state machine needs to keep an snapshot iterator boundary key. If a key being updated is 
+smaller or equal than the boundary key, there is no need in any additional action because the snapshot iterator has 
+already processed this key. If a key being updated is larger than the boundary key, the old version of the key is 
+eagerly put to the snapshot buffer and the key is marked with snapshot ID (so that the key is skipped during further
+iteration). Snapshot advance mini-task iterates over a next batch of the keys starting from the boundary key and puts
+to the snapshot buffer only keys that are not yet marked by the snapshot ID. 
+
+This approach has similar memory requirements to the first alternative, but does not require to modify the storage tree
+so that it can store multiple versions of the same key. This approach, however, allows for transparent snapshot buffer
+offloading to disk which can reduce memory requirements. It is also simpler in implementation because the code is 
+essentially single-threaded and only requires synchronization for the in-memory buffer. The downside is that snapshot 
+advance tasks will increase tail latency of state machine update operations.
+
+### Persistent tables
+For persistent tables, we exploit the existing ability of Ignite native persistence to take partition snapshots and use
+the partition snapshot file as a Raft snapshot.
+
+## Rebalance scheduling 
+Snapshot transfer between nodes should be run through a separate scheduling layer so that nodes can dynamically adjust
+to memory, CPU, IO, and network consumption. A certain quota of each resource should be allocated to the rebalance 
+process, and the scheduling layer should suspend and probably cancel the rebalance session when the quota is exceeded
+and resume or reschedule the rebalance session when resources get freed.
\ No newline at end of file