You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ivan Bessonov (Jira)" <ji...@apache.org> on 2022/09/01 11:51:00 UTC

[jira] [Updated] (IGNITE-17611) Implement proper local storage recovery for transaction state store

     [ https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Bessonov updated IGNITE-17611:
-----------------------------------
    Description: 
h3. Preliminaries

Current design expects transaction states to be replicated using the same RAFT groups that process partition transactional data. In code this means that there are two physical storages associated with a single state machine. This design is easy to achieve when the system is stable, but fault tolerance and basic node restart might introduce some complications.
h3. Partition storage design

By itself, partition storage works this way:
 * every write command writes value of the RAFT log index, associated with the command;
 * this index value is written atomically with the data from the command;
 * updates are accumulated in the memory buffer before being written to disk.
 * upon restart, we read the value of the last applied index and proceed the recovery process from it. It's done with RAFT snapshots infrastructure.

h3. Changes to tx state store

Basically, everything has to be repeated:
 * applied index value must be introduced to tx state storage;
 * updates must be atomic;
 * on restart, we should use the minimal value of last applied index from both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be changed).

h3. Other necessary changes
 * atomic flush must be set up for the tx state storage. WAL should be disabled;
 * snapshot command must trigger the flush. Please refer to {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for implementation reference. Listener class can be generified and reused;
 * assertion in {{PartitionListener#onWrite}} should be removed or drastically improved;
 * read operation on storages must be prohibited until local recovery is completed - we should apply all command up to "commitIndex" value that's been read at the start of the node, otherwise storages may have data, inconsistent with each other.

  was:
h3. Preliminaries

Current design expects transaction states to be replicated using the same RAFT groups that process partition transactional data. In code this means that there are two physical storages associated with a single state machine. This design is easy to achieve when the system is stable, but fault tolerance and basic node restart might introduce some complications.
h3. Partition storage design

By itself, partition storage works this way:
 * every write command writes value of the RAFT log index, associated with the command;
 * this index value is written atomically with the data from the comment;
 * updates are accumulated in the memory buffer before being written to disk.
 * upon restart, we read the value of the last applied index and proceed the recovery process from it. It's done with RAFT snapshots infrastructure.

h3. Changes to tx state store

Basically, everything has to be repeated:
 * applied index value must be introduced to tx state storage;
 * updates must be atomic;
 * on restart, we should use the minimal value of last applied index from both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be changed).

h3. Other necessary changes
 * atomic flush must be set up for the tx state storage. WAL should be disabled;
 * snapshot command must trigger the flush. Please refer to {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for implementation reference. Listener class can be generified and reused;
 * assertion in {{PartitionListener#onWrite}} should be removed or drastically improved;
 * read operation on storages must be prohibited until local recovery is completed - we should apply all command up to "commitIndex" value that's been read at the start of the node, otherwise storages may have data, inconsistent with each other.


> Implement proper local storage recovery for transaction state store
> -------------------------------------------------------------------
>
>                 Key: IGNITE-17611
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17611
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Preliminaries
> Current design expects transaction states to be replicated using the same RAFT groups that process partition transactional data. In code this means that there are two physical storages associated with a single state machine. This design is easy to achieve when the system is stable, but fault tolerance and basic node restart might introduce some complications.
> h3. Partition storage design
> By itself, partition storage works this way:
>  * every write command writes value of the RAFT log index, associated with the command;
>  * this index value is written atomically with the data from the command;
>  * updates are accumulated in the memory buffer before being written to disk.
>  * upon restart, we read the value of the last applied index and proceed the recovery process from it. It's done with RAFT snapshots infrastructure.
> h3. Changes to tx state store
> Basically, everything has to be repeated:
>  * applied index value must be introduced to tx state storage;
>  * updates must be atomic;
>  * on restart, we should use the minimal value of last applied index from both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be changed).
> h3. Other necessary changes
>  * atomic flush must be set up for the tx state storage. WAL should be disabled;
>  * snapshot command must trigger the flush. Please refer to {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for implementation reference. Listener class can be generified and reused;
>  * assertion in {{PartitionListener#onWrite}} should be removed or drastically improved;
>  * read operation on storages must be prohibited until local recovery is completed - we should apply all command up to "commitIndex" value that's been read at the start of the node, otherwise storages may have data, inconsistent with each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)