You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ignite.apache.org by as...@apache.org on 2021/06/30 13:55:15 UTC

[ignite-3] branch main updated: IGNITE-15013 The design for transactional protocol. - Fixes #190.

This is an automated email from the ASF dual-hosted git repository.

ascherbakov pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git


The following commit(s) were added to refs/heads/main by this push:
     new 7610b69  IGNITE-15013 The design for transactional protocol. - Fixes #190.
7610b69 is described below

commit 7610b692fd8a93cada968033bdf5881bb84312b4
Author: Alexey Scherbakov <al...@gmail.com>
AuthorDate: Wed Jun 30 16:54:30 2021 +0300

    IGNITE-15013 The design for transactional protocol. - Fixes #190.
    
    Signed-off-by: Alexey Scherbakov <al...@gmail.com>
---
 modules/transactions/README.md                     | 124 +++++++++++++++++++++
 modules/transactions/pom.xml                       |  93 ++++++++++++++++
 .../org/apache/ignite/internal/tx/LockManager.java |  54 +++++++++
 .../ignite/internal/tx/LockOrderException.java     |  26 +++++
 .../apache/ignite/internal/tx/LockManagerTest.java |  24 ++++
 pom.xml                                            |   1 +
 6 files changed, 322 insertions(+)

diff --git a/modules/transactions/README.md b/modules/transactions/README.md
new file mode 100644
index 0000000..e840895
--- /dev/null
+++ b/modules/transactions/README.md
@@ -0,0 +1,124 @@
+# Ignite transactions
+This module provides transactions support for cross partition operations. Using the transactions, such operations are
+executed in atomic way (either all changes all applied, or nothing at all) with a strong isolation.
+
+Transactions support is supposed to be icremental. In the first approach, we are trying to put existing ideas from
+ignite 2 to the new replication infrastructure. In the next phases, MVCC support should be added to avoid blocking reads
+and some other optimization, like parallel commits from <sup id="a1">[1](#f1)</sup>
+
+# Transaction protocol design
+In high level, we utilize 2 phase locking (2PL) for a concurrency control, 2 phase commit (2PC) as an atomic commitment 
+protocol, in conjunction with WAIT_DIE deadlock prevention, described in <sup id="a2">[2](#f2)</sup>. 
+This implementation is very close to Ignite 2 optimistic serializable mode. 
+Additional goals are: 
+1) retain only strong isolation 
+2) support for SQL 
+3) utilize new common replication infrastructure based on RAFT.
+
+# Two phase commit
+This protocol is responsible for atomic commitment (all or nothing) tx guraranties.
+Each update is **pre-written** to a replication groups on first phase (and replicated to a majority).
+As soon as all updates are pre-written, it's safe to commit.
+This slightly differs from ignite 2, because there is no PREPARED state.
+
+# Two phase locking
+2PL states the transaction constist of growing phase, where locks are acquired, and shrinking phase where locks are released.
+A tx acquires shared locks for reads and exclusive locks for writes.
+It's possible to lock for read in exclusive mode (select for update semantics)
+Locks for different keys can be acquired in parallel.
+Shared lock is obtained on DN-read operation.
+Exclusive lock is obtained on DN-prewrite operation.
+
+(DN - data node)
+
+# Lock manager
+Locking functionality is implemented by LockManager.
+Each **leaseholder** for a partition replication group deploys an instance of LockManager. 
+All reads and writes go through the **leaseholder**. Only raft leader for some term can become a **leaseholder**.
+It's important what no two different leaseholder intervals can overlap for the same group.
+Current leasholder map can be loaded from metastore (watches can be used for a fast notification about leaseholder change).
+
+The lockmanager should keep locks in the offheap to reduce GC pressure, but can be heap based in the first approach.
+
+# Locking precausion
+LockManager has a volatile state, so some precausions must be taken before locking the keys due to possible node restarts.
+Before taking a lock, LockManager should consult a tx state for the key (by reading it's metadata if present and looking into txstate map).
+If a key is enlisted in transaction and wait is possible according to tx priority, lock cannot be taken immediately.
+
+# Tx coordinator
+Tx coordinator is assigned in a random fashion from a list of allowed nodes. They can be dedicated nodes or same as data nodes.
+They are responsible for id assignment and failover handling if some nodes from tx topology have failed. 
+Knows full tx topology.
+
+It's possible to live without dedicated coordinators, but it will make client tx logic more complex and prone to coordinator failures.
+The drawback of this is a additional hop for each tx op.
+But from the architecure point of view having the dedicated coordinators is definetely more robust.
+
+# Deadlock prevention
+Deadlock prevention in WAIT_DIE mode - uses priorities to decide which tx should be restarted.
+Each transaction is assigned a unique globally comparable timestamp (for example UUID), which defines tx priority.
+If T1 has lower priority when T2 (T1 is younger) it can wait for lock, otherwise it's restarted keeping it's timestamp.
+committing transaction can't be restarted.
+Deadlock detection is not an option due to huge computation resources requirement and no real-time guaranties.
+
+# Tx map
+Each node maintains a persistent tx map:
+
+txid -> timestamp|txstate(PENDING|ABORTED|COMMITED)
+
+This map is used for a failover. Oldest entries in txid map must be cleaned to avoid unlimited grow.
+
+# Data format
+A row is stored in key-value database with additional attached metadata for referencing associated tx.
+
+# Write tx example.
+Assume the current row is: key -> oldvalue
+
+The steps to update a row:
+
+1. acquire exclusive lock on key on prewrite
+
+2. remove key -> oldvalue<br/>
+   set key -> newvalue [txid] // Inserted row has a special metadata containing transaction id it's enlisted in.<br/>
+   set txid + key -> oldvalue (for aborting purposes)
+
+3. on commit:<br/>
+   set txid -> commited<br/>
+   release exclusive lock<br/>
+   async clear garbage
+
+4. on abort:<br/>
+   set txid -> aborted<br/>
+   remove key -> newvalue<br/>
+   set key -> oldvalue<br/>
+   release exclusive lock<br/>
+   async clear garbage
+
+# SQL and indexes.
+
+We assume only row level locking in the first approach, which gives us a repeatable_read isolation.
+
+When the SQL query is executed, it acquires locks while the the data is collected on data nodes.
+
+Locks are acquired lazily as result set is consumed.
+
+The locking rules are same as for get/put operations.
+
+Then values are removed from indexes on step 2, they are written as tombstones to avoid read inconsistency and should be 
+cleaned up after tx finish.
+
+# Failover handling
+Failover protocol is similar to Ignite 2 with a main difference: until tx is sure it can commit or rollback, it holds
+its locks. This means in the case of split-brain, some keys will be locked until split-brain situation is resolved and
+tx recovery protocol will converge.
+
+## Leaserholder fail
+An enlisted node asks a coordinator if it can commit or not.
+
+## Coordinator fail
+Broadcast recovery is necessary (because we don't have full tx topology on each enlisted node).
+All nodes are requested about local txs state. If at least one is commiting, it's safe to commit.
+
+<em id="f1">[1]</em> CockroachDB: The Resilient Geo-Distributed SQL Database, 2020 [↩](#a1)<br/>
+<em id="f2">[2]</em> Concurrency Control in Distributed Database Systems, 1981 [↩](#a2)
+ 
\ No newline at end of file
diff --git a/modules/transactions/pom.xml b/modules/transactions/pom.xml
new file mode 100644
index 0000000..67fc4e6
--- /dev/null
+++ b/modules/transactions/pom.xml
@@ -0,0 +1,93 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+
+    <parent>
+        <groupId>org.apache.ignite</groupId>
+        <artifactId>ignite-parent</artifactId>
+        <version>1</version>
+        <relativePath>../../parent/pom.xml</relativePath>
+    </parent>
+
+    <artifactId>ignite-transactions</artifactId>
+    <version>3.0.0-SNAPSHOT</version>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.apache.ignite</groupId>
+            <artifactId>ignite-core</artifactId>
+        </dependency>
+
+        <dependency>
+            <groupId>org.apache.ignite</groupId>
+            <artifactId>ignite-network-api</artifactId>
+        </dependency>
+
+        <dependency>
+            <groupId>org.apache.ignite</groupId>
+            <artifactId>ignite-network-annotation-processor</artifactId>
+            <optional>true</optional>
+        </dependency>
+
+        <!-- 3rd party dependencies -->
+        <dependency>
+            <groupId>org.jetbrains</groupId>
+            <artifactId>annotations</artifactId>
+        </dependency>
+
+        <!-- Test dependencies -->
+        <dependency>
+            <groupId>org.junit.jupiter</groupId>
+            <artifactId>junit-jupiter-engine</artifactId>
+            <scope>test</scope>
+        </dependency>
+
+        <dependency>
+            <groupId>org.mockito</groupId>
+            <artifactId>mockito-junit-jupiter</artifactId>
+            <scope>test</scope>
+        </dependency>
+
+        <dependency>
+            <groupId>org.mockito</groupId>
+            <artifactId>mockito-core</artifactId>
+            <scope>test</scope>
+        </dependency>
+    </dependencies>
+
+    <build>
+        <resources>
+            <resource>
+                <directory>src/main/resources</directory>
+                <filtering>true</filtering>
+            </resource>
+        </resources>
+
+        <plugins>
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-compiler-plugin</artifactId>
+            </plugin>
+        </plugins>
+    </build>
+</project>
diff --git a/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java
new file mode 100644
index 0000000..014a892
--- /dev/null
+++ b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.tx;
+
+import java.util.concurrent.CompletableFuture;
+
+/**
+ * Lock manager allows to acquire locks in shared and exclusive mode and supports deadlock prevention by timestamp
+ * ordering.
+ */
+public interface LockManager {
+    /**
+     * @param key The key.
+     * @param timestamp The timestamp.
+     * @return The future.
+     * @throws LockOrderException When a lock can't be taken due to wrong ordering.
+     */
+    public CompletableFuture<Void> tryAcquire(Object key, long timestamp) throws LockOrderException;
+
+    /**
+     * @param key The key.
+     * @return {@code True} if the lock was released.
+     */
+    public boolean tryRelease(Object key);
+
+    /**
+     * @param key The key.
+     * @param timestamp The timestamp.
+     * @return The future.
+     * @throws LockOrderException When a lock can't be taken due to wrong ordering.
+     */
+    public CompletableFuture<Void> tryAcquireShared(Object key, long timestamp) throws LockOrderException;
+
+    /**
+     * @param key The key.
+     * @return {@code True} if the lock was released.
+     */
+    public boolean tryReleaseShared(Object key);
+}
diff --git a/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockOrderException.java b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockOrderException.java
new file mode 100644
index 0000000..6ec2068
--- /dev/null
+++ b/modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockOrderException.java
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.tx;
+
+import org.apache.ignite.lang.IgniteInternalException;
+
+/**
+ * This exception is thrown when lock cannot be acquired due to wrong locking order.
+ */
+public class LockOrderException extends IgniteInternalException {
+}
diff --git a/modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTest.java b/modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTest.java
new file mode 100644
index 0000000..4ab4c13
--- /dev/null
+++ b/modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTest.java
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.tx;
+
+/**
+ * Tests a LockManager implementation.
+ */
+public class LockManagerTest {
+}
diff --git a/pom.xml b/pom.xml
index 102cf87..6b82b3e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -61,6 +61,7 @@
         <module>modules/storage-api</module>
         <module>modules/storage-rocksdb</module>
         <module>modules/table</module>
+        <module>modules/transactions</module>
         <module>modules/vault</module>
     </modules>