You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2023/05/04 17:52:00 UTC
[jira] [Commented] (IGNITE-19269) Introduce proper local RowId locks

    [ https://issues.apache.org/jira/browse/IGNITE-19269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719454#comment-17719454 ] 

Roman Puchkovskiy commented on IGNITE-19269:
--------------------------------------------

Sorry, I forgot to do it timely: LGTM!

> Introduce proper local RowId locks
> ----------------------------------
>
>                 Key: IGNITE-19269
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19269
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>          Time Spent: 6h
>  Remaining Estimate: 0h
>
> Currently, concurrency in MV storages is a bit convoluted. List of issues include:
>  * Implicit assumption that one does not insert the same RowId concurrently. There's a chance that this assumption is already violated. This could lead to B+Tree corruptions, for example.
>  * Hidden locks, that cause non-trivial bugs like this: https://issues.apache.org/jira/browse/IGNITE-19169
>  * MvPartitionStorage#scanVersions implementation for B+Tree reads the entire chain at once, because we couldn't figure out how to do it properly in concurrent environment. This may lead to drastic performance drops in the future.
>  * GC code is complicated and sub-optimal. We need to implement the same approach that's used in Ignite 2.x - read, lock, dequeue, cleanup, unlock*. We can also cache part of the queue beforehand, making the process faster. It is not possible in current implementation.
>  * Maybe something else that I couldn't remember.
> h3. Proposed solution
> As wise man once said, "{_}Explicit is better than implicit{_}".
> I propose having an explicit concurrency model akin Ignite 2.x's. We need to introduce 2 new methods to MV partition storage interface.
> {code:java}
> /**
>  * Synchronously locks passed RowId.
>  */
> void lock(RowId rowId);
> /**
>  * Tries to synchronously lock passed RowId.
>  * Used to implement deadlock prevention.
>  */
> boolean tryLock(RowId rowId);{code}
> These methods should delegate locking to the client, basically exposing the {{ReentrantLockByRowId}} API. Expected invariants:
>  * Locks can only be allowed inside of the "runConsistently" closure.
>  * Unlock is still supposed to happen at the end of "runConsistently", automatically. This is predicated by the way RocksDB implementation works (batches).
> Client must guarantee, that there are no deadlocks, either by sorting the data or by using "tryLock".
>  * All read operations, except for "scanVersions", may be executed without locks.
>  * "scanVersions" must have a lock, because it's not atomic. This is not a big deal, because the only two usages are:
>  ** index cleanup, we already hold a lock
>  ** rebalance, we already hold another lock, so adding a new one is not a problem
>  * "scanVersions" implementation must also be fixed. Maybe we need to add a hint, like "loadAllVersions", but it's not required.
> This approach, with proper implementation, solves all mentioned problems, except for GC.
> h3. Changes to GC
> As stated before, GC flow must be changed to something like this:
> {code:java}
> var entry = partition.peek(lw);
> if (entry == null) {
>     return;
> }
> if (!partition.tryLock(entry.rowId())) {
>     return; // Must exit closure to avoid deadlocks.
> }
> if (!partition.dequeue(entry)) {
>     continue; // Or return. We may or may not exit the closure
>               // in this case, both options are valid.
> }
> cleanupIndexes(entry);
> continue;{code}
> Right now, all storage implementations have similar construction inside of them. All I propose is to move it outside.
> h3. Tangent changes
> org.apache.ignite.internal.table.distributed.raft.PartitionDataStorage is basically a copy of the original, except for a few methods. Maybe copied methods should exist in a shared super-interface. It's up to developer to decide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)