You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Ivan Bessonov (Jira)" <ji...@apache.org> on 2022/09/07 07:44:00 UTC

[jira] [Assigned] (IGNITE-17611) Implement proper local storage recovery for transaction state store

     [ https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Bessonov reassigned IGNITE-17611:
--------------------------------------

    Assignee: Ivan Bessonov

> Implement proper local storage recovery for transaction state store
> -------------------------------------------------------------------
>
>                 Key: IGNITE-17611
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17611
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Preliminaries
> Current design expects transaction states to be replicated using the same RAFT groups that process partition transactional data. In code this means that there are two physical storages associated with a single state machine. This design is easy to achieve when the system is stable, but fault tolerance and basic node restart might introduce some complications.
> h3. Partition storage design
> By itself, partition storage works this way:
>  * every write command writes value of the RAFT log index, associated with the command;
>  * this index value is written atomically with the data from the command;
>  * updates are accumulated in the memory buffer before being written to disk.
>  * upon restart, we read the value of the last applied index and proceed the recovery process from it. It's done with RAFT snapshots infrastructure.
> h3. Changes to tx state store
> Basically, everything has to be repeated:
>  * applied index value must be introduced to tx state storage;
>  * updates must be atomic;
>  * on restart, we should use the minimal value of last applied index from both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be changed).
> h3. Other necessary changes
>  * atomic flush must be set up for the tx state storage. WAL should be disabled;
>  * snapshot command must trigger the flush. Please refer to {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for implementation reference. Listener class can be generified and reused;
>  * assertion in {{PartitionListener#onWrite}} should be removed or drastically improved;
>  * read operation on storages must be prohibited until local recovery is completed - we should apply all command up to "commitIndex" value that's been read at the start of the node, otherwise storages may have data, inconsistent with each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)