You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@ignite.apache.org by ag...@apache.org on 2021/03/24 17:46:07 UTC

[ignite-3] branch ignite-14393 created (now 5ba6651)

This is an automated email from the ASF dual-hosted git repository.

agoncharuk pushed a change to branch ignite-14393
in repository https://gitbox.apache.org/repos/asf/ignite-3.git.


      at 5ba6651  IGNITE-14393 Components interactions workflow draft

This branch includes the following new commits:

     new 5ba6651  IGNITE-14393 Components interactions workflow draft

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.

[ignite-3] 01/01: IGNITE-14393 Components interactions workflow draft

Posted by ag...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

agoncharuk pushed a commit to branch ignite-14393
in repository https://gitbox.apache.org/repos/asf/ignite-3.git

commit 5ba6651a0da7b4637fc901efe384796bf4b5a75e
Author: Alexey Goncharuk <al...@gmail.com>
AuthorDate: Wed Mar 24 20:36:38 2021 +0300

    IGNITE-14393 Components interactions workflow draft
---
 modules/runner/README.md | 145 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/modules/runner/README.md b/modules/runner/README.md
new file mode 100644
index 0000000..a182dd9
--- /dev/null
+++ b/modules/runner/README.md
@@ -0,0 +1,145 @@
+# Ignite cluster & node lifecycle
+This document describes user-level and component-level cluster lifecycles and their mutual interaction.
+
+## Node lifecycle
+A node maintains its' local state in the local persistent key-value storage named vault. The data stored in the vault is 
+semantically divided in the following categories:
+ * User-level local configuration properties (such as memory limits, network timeouts, etc). User-level configuration
+ properties can be written both at runtime (not all properties will be applied at runtime, however, - some of them will
+ require a full node restart) and when a node is shut down (in order to be able to change properties that prevent node
+ startup for some reason)
+ * System-level private properties (such as computed local statistics, node-local commin paths, etc). System-level 
+ private properties are computed locally based on the information available at node locally (not based on metastorage 
+ watched values)
+ * System-level distributed metastorage projected properties (such as paths to partition files, etc). System-level 
+ projected properties are associated with one or more metastorage properties and are computed based on the local node 
+ state and the metastorage properties values. System-level projected properties values are semantically bound to a 
+ particular revision of the dependee properties and must be recalculated when dependees are changed (see 
+ [reliable watch processing](#reliable-watch-processing)). 
+
+The vault is created during the first node startup and optionally populated with the paremeters from the configuration 
+file passed in to the ``ignite node start`` [command](TODO link to CLI readme). Only user-level properties can be 
+written via the provided file. 
+
+System-level properties are written to the storage during the first vault initialization by the node start process. 
+Projected properties are not initialized during the initial node startup because at this point the local node is not 
+aware of the distributed metastorage. The node remains in a 'zombie' state until after it learns that there is an 
+initialized metastorage (either via the ``ignite cluster init`` [command](TODO link to CLI readme) during the initial 
+cluster initialization) or from the group membershup service via gossip (implying that group membership protocol is 
+working at this point).
+
+### Node components startup
+For testability purposes, we require that component dependencies are defined upfront and provided at the construction
+time. This additionaly requires that component dependencies form no cycles. Therefore, components form an acyclic 
+directed graph that is constructed in topological sort order wrt root. 
+
+Components created and initialized also in an order consistent with a topological sort of the components graph. This 
+enforces serveral rules related to the components interaction: 
+ * Since metastorage watches can only be added during the component startup, the watch notification order is consistent
+ with the component initialization order. I.e. if a component `B` depdends on a component `A`, then `A` receives watch
+ notification prior to `B`.
+ * Dependent component can directly call an API method on a dependee component (because it can obtain the dependee 
+ reference during construction). Direct inverse calls are prohibited (this is enforced by only acquiring component 
+ references during the components construction). Nevertheless, inverse call can be implemented by means of listeners or 
+ callbacks: the dependent component installs a listener to a dependeee, which can be later invoked.
+ 
+<!--
+Change /svg/... to /uml/... here to view the image UML.
+-->
+![Example components dependency graph](http://www.plantuml.com/plantuml/svg/TO_Fpi8W4CJlF0N7xpkKn3zUJ3HDZCTwqNZVjXkjKZ1qqTUND6rbCLw0tVbDXiax0aU-rSAMDwn87f1Urjt5SCkrjAv69pTo9aRc35wJwCz8dqzwWGGTMGSN5D4xOXSJUwoks4819W1Ei2dYbnD_WbBZY7y6Hgy4YysoxLYBENeX0h_4eMYw_lrdPaltF2kHLQq6N-W1ZsO7Ml_zinhA1oHfh9kEqA1JrkoVQ2XOSZIrR_KR)
+
+The diagram above shows an example component dependency diagram and provides an order in which compomnents may be 
+initialized.
+
+## Cluster lifecycle
+For a cluster to become operational, the metastorage instance must be initialized first. The initialization command 
+chooses a set of nodes (normally, 3 - 5 nodes) to host the distributed metastorage Raft group. When a node receives the 
+initialization command, it either creates a bootstrapped Raft instance with the given members (if this is a metastorage 
+group node), or writes the metastorage group member IDs to the vault as a private system-level property.
+
+After the metastorage is initialized, components start to receive and process watch events, updating the local state 
+according to the changes received from the watch.   
+
+## Reliable watch processing
+All cluster state is written and maintained in the metastorage. Nodes may update some state in the metastorage, which
+may require a recomputation of some other metastorage properties (for example, when cluster baseline changes, Ignite
+needs to recalculate table affinity assignments). In other words, some properties in the metastore are dependent on each
+other and we may need to reliably update one property in response to an update to another.
+
+To facilitate this pattern, Ignite uses the metastorage ability to replay metastorage changes from a certain revision
+called [watch](TODO link to metastorage watch). To process watch updates reliably, we associate a special persistent 
+value called ``applied revision`` (stored in the vault) with each watch. We rely on the following assumptions about the 
+reliable watch processing:
+ * Watch execution is idempotent (if the same watch event is processed twice, the second watch invocation will have no 
+ effect on the system). This is usually enforced by conditional multi-update operations for the metastorage and 
+ deterministic projected properties calculations. The conditional multi-update should check that the revision of the key
+ being updated matches the revision observed with the watch's event upper bound.
+ * All properties read inside the watch must be read with the upper bound equal to the watch event revision.
+ * If watch processing initiates a metastorage update, the ``applied revision`` is propagated only after the metastorage
+ confirmed that the proposed change is committed (note that in this case it does not matter whether this particular 
+ multi-update succeeds or not: we know that the first multi-update will succeed, and all further updates are idempotent,
+ so we need to make sure that at least one multi-update is committed).
+ * If watch processing initiates projected keys writes to the vault, the keys must be written atomically with the 
+ updated ``applied revision`` value.
+ * If a watch initiates metastorage properties update, it should only be deployed on the metastorage group members to
+ avoid massive identical updates being issued to the metastorage (TODO: should really be only the leader of the 
+ metastorage Raft group).
+
+In a case of a crash, each watch is restared from the revision stored in the corresponding ``applied revision`` variable
+of the watch, and not processed events are replayed.
+
+### Example: `CREATE TABLE` flow
+We require that each Ignite table is assigned a globally unique ID (the ID must not repeat even after the table is 
+dropped, so we use a growing `long` counter to assign table IDs).
+
+When a table is created, Ignite first checks that a table with the given name does not exist, chooses the next available 
+table ID ``idNext`` and attempts to create the following pair of key-value pairs in the metastorage via the conditional 
+multi-update:
+
+```
+internal.tables.names.<name>=<idNext>
+internal.tables.<idNext>=name
+```  
+
+If the multi-update succeeds, Ignite considers the table created. If the multi-update fails, then either the table with
+the same name was concurrently created (the operation fails in this case) or the ``idNext`` was assigned to another 
+table with a different name (Ignite retries the operation in this case).
+
+In order to process affinity calculations and assignments, the affinity manager creates a reliable watch for the 
+following keys on metastorage group members:
+
+```
+internal.tables.<ID>
+internal.baseline
+``` 
+
+Whenever a watch is fired, the affinity manager checks which key was updated. If the watch is triggered for 
+``internal.tables.<ID>`` key, it calculates a new affinity for the table with the given ``ID``. If the watch is 
+triggered for ``internal.baseline`` key, the manager recalculates affinity for all tables exsiting at the watch revision
+(this can be done using the metastorage ``range(keys, upperBound)`` method providing the watch event revision as the 
+upper bound). The calculated affinity is written to the ``internal.tables.<ID>.affinity`` key.
+
+> Note that ideally the watch should only be processed on metastorage group leader, thus eliminating unnecessary network
+> trips. Theoretically, we could have embedded this logic to the state machine, but this would enormously complicate 
+> the cluster updates and put metastorage consistency at risk. 
+
+To handle partition assignments, partition manager creates a reliable watch for the affinity assignment key on all 
+nodes:
+
+```
+internal.tables.<ID>.affinity
+```
+
+Whenever a watch is fired, the node checks whether there exist new partitions assigned to the local node, and if there 
+are, the node bootstraps corresponding Raft partition servers (i.e. allocates paths to Raft logs and storage files). 
+The allocation information is written to projected vault keys:
+
+```
+local.tables.<ID>.<PARTITION_ID>.logpath=/path/to/raft/log
+local.tables.<ID>.<PARTITION_ID>.storagepath=/path/to/storage/file
+``` 
+
+Once the projected keys are synced to the vault, the partition manager can create partition Raft servers (initialize 
+the log and storage, write hard state, register message handlers, etc). Upon startup, the node checks the existing 
+projected keys (finishing the raft log and storage initialization in case it crashed in the midst) and starts the Raft
+servers.
\ No newline at end of file