You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2023/05/04 17:52:00 UTC
[jira] [Commented] (IGNITE-19269) Introduce proper local RowId locks
[ https://issues.apache.org/jira/browse/IGNITE-19269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719454#comment-17719454 ]
Roman Puchkovskiy commented on IGNITE-19269:
--------------------------------------------
Sorry, I forgot to do it timely: LGTM!
> Introduce proper local RowId locks
> ----------------------------------
>
> Key: IGNITE-19269
> URL: https://issues.apache.org/jira/browse/IGNITE-19269
> Project: Ignite
> Issue Type: Improvement
> Reporter: Ivan Bessonov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Time Spent: 6h
> Remaining Estimate: 0h
>
> Currently, concurrency in MV storages is a bit convoluted. List of issues include:
> * Implicit assumption that one does not insert the same RowId concurrently. There's a chance that this assumption is already violated. This could lead to B+Tree corruptions, for example.
> * Hidden locks, that cause non-trivial bugs like this: https://issues.apache.org/jira/browse/IGNITE-19169
> * MvPartitionStorage#scanVersions implementation for B+Tree reads the entire chain at once, because we couldn't figure out how to do it properly in concurrent environment. This may lead to drastic performance drops in the future.
> * GC code is complicated and sub-optimal. We need to implement the same approach that's used in Ignite 2.x - read, lock, dequeue, cleanup, unlock*. We can also cache part of the queue beforehand, making the process faster. It is not possible in current implementation.
> * Maybe something else that I couldn't remember.
> h3. Proposed solution
> As wise man once said, "{_}Explicit is better than implicit{_}".
> I propose having an explicit concurrency model akin Ignite 2.x's. We need to introduce 2 new methods to MV partition storage interface.
> {code:java}
> /**
> * Synchronously locks passed RowId.
> */
> void lock(RowId rowId);
> /**
> * Tries to synchronously lock passed RowId.
> * Used to implement deadlock prevention.
> */
> boolean tryLock(RowId rowId);{code}
> These methods should delegate locking to the client, basically exposing the {{ReentrantLockByRowId}} API. Expected invariants:
> * Locks can only be allowed inside of the "runConsistently" closure.
> * Unlock is still supposed to happen at the end of "runConsistently", automatically. This is predicated by the way RocksDB implementation works (batches).
> Client must guarantee, that there are no deadlocks, either by sorting the data or by using "tryLock".
> * All read operations, except for "scanVersions", may be executed without locks.
> * "scanVersions" must have a lock, because it's not atomic. This is not a big deal, because the only two usages are:
> ** index cleanup, we already hold a lock
> ** rebalance, we already hold another lock, so adding a new one is not a problem
> * "scanVersions" implementation must also be fixed. Maybe we need to add a hint, like "loadAllVersions", but it's not required.
> This approach, with proper implementation, solves all mentioned problems, except for GC.
> h3. Changes to GC
> As stated before, GC flow must be changed to something like this:
> {code:java}
> var entry = partition.peek(lw);
> if (entry == null) {
> return;
> }
> if (!partition.tryLock(entry.rowId())) {
> return; // Must exit closure to avoid deadlocks.
> }
> if (!partition.dequeue(entry)) {
> continue; // Or return. We may or may not exit the closure
> // in this case, both options are valid.
> }
> cleanupIndexes(entry);
> continue;{code}
> Right now, all storage implementations have similar construction inside of them. All I propose is to move it outside.
> h3. Tangent changes
> org.apache.ignite.internal.table.distributed.raft.PartitionDataStorage is basically a copy of the original, except for a few methods. Maybe copied methods should exist in a shared super-interface. It's up to developer to decide.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)