You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ignite.apache.org by as...@apache.org on 2021/06/30 13:55:15 UTC
[ignite-3] branch main updated: IGNITE-15013 The design for
transactional protocol. - Fixes #190.
This is an automated email from the ASF dual-hosted git repository.
ascherbakov pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git
The following commit(s) were added to refs/heads/main by this push:
new 7610b69 IGNITE-15013 The design for transactional protocol. - Fixes #190.
7610b69 is described below
commit 7610b692fd8a93cada968033bdf5881bb84312b4
Author: Alexey Scherbakov <al...@gmail.com>
AuthorDate: Wed Jun 30 16:54:30 2021 +0300
IGNITE-15013 The design for transactional protocol. - Fixes #190.
Signed-off-by: Alexey Scherbakov <al...@gmail.com>
---
modules/transactions/README.md | 124 +++++++++++++++++++++
modules/transactions/pom.xml | 93 ++++++++++++++++
.../org/apache/ignite/internal/tx/LockManager.java | 54 +++++++++
.../ignite/internal/tx/LockOrderException.java | 26 +++++
.../apache/ignite/internal/tx/LockManagerTest.java | 24 ++++
pom.xml | 1 +
6 files changed, 322 insertions(+)
diff --git a/modules/transactions/README.md b/modules/transactions/README.md
new file mode 100644
index 0000000..e840895
--- /dev/null
+++ b/modules/transactions/README.md
@@ -0,0 +1,124 @@
+# Ignite transactions
+This module provides transactions support for cross partition operations. Using the transactions, such operations are
+executed in atomic way (either all changes all applied, or nothing at all) with a strong isolation.
+
+Transactions support is supposed to be icremental. In the first approach, we are trying to put existing ideas from
+ignite 2 to the new replication infrastructure. In the next phases, MVCC support should be added to avoid blocking reads
+and some other optimization, like parallel commits from <sup id="a1">[1](#f1)</sup>
+
+# Transaction protocol design
+In high level, we utilize 2 phase locking (2PL) for a concurrency control, 2 phase commit (2PC) as an atomic commitment
+protocol, in conjunction with WAIT_DIE deadlock prevention, described in <sup id="a2">[2](#f2)</sup>.
+This implementation is very close to Ignite 2 optimistic serializable mode.
+Additional goals are:
+1) retain only strong isolation
+2) support for SQL
+3) utilize new common replication infrastructure based on RAFT.
+
+# Two phase commit
+This protocol is responsible for atomic commitment (all or nothing) tx guraranties.
+Each update is **pre-written** to a replication groups on first phase (and replicated to a majority).
+As soon as all updates are pre-written, it's safe to commit.
+This slightly differs from ignite 2, because there is no PREPARED state.
+
+# Two phase locking
+2PL states the transaction constist of growing phase, where locks are acquired, and shrinking phase where locks are released.
+A tx acquires shared locks for reads and exclusive locks for writes.
+It's possible to lock for read in exclusive mode (select for update semantics)
+Locks for different keys can be acquired in parallel.
+Shared lock is obtained on DN-read operation.
+Exclusive lock is obtained on DN-prewrite operation.
+
+(DN - data node)
+
+# Lock manager
+Locking functionality is implemented by LockManager.
+Each **leaseholder** for a partition replication group deploys an instance of LockManager.
+All reads and writes go through the **leaseholder**. Only raft leader for some term can become a **leaseholder**.
+It's important what no two different leaseholder intervals can overlap for the same group.
+Current leasholder map can be loaded from metastore (watches can be used for a fast notification about leaseholder change).
+
+The lockmanager should keep locks in the offheap to reduce GC pressure, but can be heap based in the first approach.
+
+# Locking precausion
+LockManager has a volatile state, so some precausions must be taken before locking the keys due to possible node restarts.
+Before taking a lock, LockManager should consult a tx state for the key (by reading it's metadata if present and looking into txstate map).
+If a key is enlisted in transaction and wait is possible according to tx priority, lock cannot be taken immediately.
+
+# Tx coordinator
+Tx coordinator is assigned in a random fashion from a list of allowed nodes. They can be dedicated nodes or same as data nodes.
+They are responsible for id assignment and failover handling if some nodes from tx topology have failed.
+Knows full tx topology.
+
+It's possible to live without dedicated coordinators, but it will make client tx logic more complex and prone to coordinator failures.
+The drawback of this is a additional hop for each tx op.
+But from the architecure point of view having the dedicated coordinators is definetely more robust.
+
+# Deadlock prevention
+Deadlock prevention in WAIT_DIE mode - uses priorities to decide which tx should be restarted.
+Each transaction is assigned a unique globally comparable timestamp (for example UUID), which defines tx priority.
+If T1 has lower priority when T2 (T1 is younger) it can wait for lock, otherwise it's restarted keeping it's timestamp.
+committing transaction can't be restarted.
+Deadlock detection is not an option due to huge computation resources requirement and no real-time guaranties.
+
+# Tx map
+Each node maintains a persistent tx map:
+
+txid -> timestamp|txstate(PENDING|ABORTED|COMMITED)
+
+This map is used for a failover. Oldest entries in txid map must be cleaned to avoid unlimited grow.
+
+# Data format
+A row is stored in key-value database with additional attached metadata for referencing associated tx.
+
+# Write tx example.
+Assume the current row is: key -> oldvalue
+
+The steps to update a row:
+
+1. acquire exclusive lock on key on prewrite
+
+2. remove key -> oldvalue<br/>
+ set key -> newvalue [txid] // Inserted row has a special metadata containing transaction id it's enlisted in.<br/>
+ set txid + key -> oldvalue (for aborting purposes)
+
+3. on commit:<br/>
+ set txid -> commited<br/>
+ release exclusive lock<br/>
+ async clear garbage
+
+4. on abort:<br/>
+ set txid -> aborted<br/>
+ remove key -> newvalue<br/>
+ set key -> oldvalue<br/>
+ release exclusive lock<br/>
+ async clear garbage
+
+# SQL and indexes.
+
+We assume only row level locking in the first approach, which gives us a repeatable_read isolation.
+
+When the SQL query is executed, it acquires locks while the the data is collected on data nodes.
+
+Locks are acquired lazily as result set is consumed.
+
+The locking rules are same as for get/put operations.
+
+Then values are removed from indexes on step 2, they are written as tombstones to avoid read inconsistency and should be
+cleaned up after tx finish.
+
+# Failover handling
+Failover protocol is similar to Ignite 2 with a main difference: until tx is sure it can commit or rollback, it holds
+its locks. This means in the case of split-brain, some keys will be locked until split-brain situation is resolved and
+tx recovery protocol will converge.
+
+## Leaserholder fail
+An enlisted node asks a coordinator if it can commit or not.
+
+## Coordinator fail
+Broadcast recovery is necessary (because we don't have full tx topology on each enlisted node).
+All nodes are requested about local txs state. If at least one is commiting, it's safe to commit.
+
+<em id="f1">[1]</em> CockroachDB: The Resilient Geo-Distributed SQL Database, 2020 [↩](#a1)<br/>
+<em id="f2">[2]</em> Concurrency Control in Distributed Database Systems, 1981 [↩](#a2)
+
\ No newline at end of file
diff --git a/modules/transactions/pom.xml b/modules/transactions/pom.xml
new file mode 100644
index 0000000..67fc4e6
--- /dev/null
+++ b/modules/transactions/pom.xml
@@ -0,0 +1,93 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+
+ <parent>
+ <groupId>org.apache.ignite</groupId>
+ <artifactId>ignite-parent</artifactId>
+ <version>1</version>
+ <relativePath>../../parent/pom.xml</relativePath>
+ </parent>
+
+ <artifactId>ignite-transactions</artifactId>
+ <version>3.0.0-SNAPSHOT</version>
+
+ <dependencies>
+ <dependency>
+ <groupId>org.apache.ignite</groupId>
+ <artifactId>ignite-core</artifactId>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.ignite</groupId>
+ <artifactId>ignite-network-api</artifactId>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.ignite</groupId>
+ <artifactId>ignite-network-annotation-processor</artifactId>
+ <optional>true</optional>
+ </dependency>
+
+ <!-- 3rd party dependencies -->
+ <dependency>
+ <groupId>org.jetbrains</groupId>
+ <artifactId>annotations</artifactId>
+ </dependency>
+
+ <!-- Test dependencies -->
+ <dependency>
+ <groupId>org.junit.jupiter</groupId>
+ <artifactId>junit-jupiter-engine</artifactId>
+ <scope>test</scope>
+ </dependency>
+
+ <dependency>
+ <groupId>org.mockito</groupId>
+ <artifactId>mockito-junit-jupiter</artifactId>
+ <scope>test</scope>
+ </dependency>
+
+ <dependency>
+ <groupId>org.mockito</groupId>
+ <artifactId>mockito-core</artifactId>
+ <scope>test</scope>
+ </dependency>
+ </dependencies>
+
+ <build>
+ <resources>
+ <resource>
+ <directory>src/main/resources</directory>
+ <filtering>true</filtering>
+ </resource>
+ </resources>
+
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-compiler-plugin</artifactId>
+ </plugin>
+ </plugins>
+ </build>
+</project>
diff --git a/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java
new file mode 100644
index 0000000..014a892
--- /dev/null
+++ b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.tx;
+
+import java.util.concurrent.CompletableFuture;
+
+/**
+ * Lock manager allows to acquire locks in shared and exclusive mode and supports deadlock prevention by timestamp
+ * ordering.
+ */
+public interface LockManager {
+ /**
+ * @param key The key.
+ * @param timestamp The timestamp.
+ * @return The future.
+ * @throws LockOrderException When a lock can't be taken due to wrong ordering.
+ */
+ public CompletableFuture<Void> tryAcquire(Object key, long timestamp) throws LockOrderException;
+
+ /**
+ * @param key The key.
+ * @return {@code True} if the lock was released.
+ */
+ public boolean tryRelease(Object key);
+
+ /**
+ * @param key The key.
+ * @param timestamp The timestamp.
+ * @return The future.
+ * @throws LockOrderException When a lock can't be taken due to wrong ordering.
+ */
+ public CompletableFuture<Void> tryAcquireShared(Object key, long timestamp) throws LockOrderException;
+
+ /**
+ * @param key The key.
+ * @return {@code True} if the lock was released.
+ */
+ public boolean tryReleaseShared(Object key);
+}
diff --git a/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockOrderException.java b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockOrderException.java
new file mode 100644
index 0000000..6ec2068
--- /dev/null
+++ b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockOrderException.java
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.tx;
+
+import org.apache.ignite.lang.IgniteInternalException;
+
+/**
+ * This exception is thrown when lock cannot be acquired due to wrong locking order.
+ */
+public class LockOrderException extends IgniteInternalException {
+}
diff --git a/modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTest.java b/modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTest.java
new file mode 100644
index 0000000..4ab4c13
--- /dev/null
+++ b/modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTest.java
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.tx;
+
+/**
+ * Tests a LockManager implementation.
+ */
+public class LockManagerTest {
+}
diff --git a/pom.xml b/pom.xml
index 102cf87..6b82b3e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -61,6 +61,7 @@
<module>modules/storage-api</module>
<module>modules/storage-rocksdb</module>
<module>modules/table</module>
+ <module>modules/transactions</module>
<module>modules/vault</module>
</modules>