You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergey Uttsel (Jira)" <ji...@apache.org> on 2022/04/06 09:30:00 UTC

[jira] [Updated] (IGNITE-16723) TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder

     [ https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Uttsel updated IGNITE-16723:
-----------------------------------
    Description: 
*Transaction recovery* is closely related to concurrency control.

Concurrency control is described in [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions] (especially the chapter “Transaction interactions”) and [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].

 

*Read locks.*

Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to timestamp cache.

Timestamp cache is a bounded in-memory cache that records the maximum timestamp that key ranges were read from and written to. Cache corresponds to the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.

The cache is updated after the completion of each read operation with the range of all keys that the request was predicated upon. It is then consulted for each write operation, allowing them to detect read-write violations that would allow them to write "under" a read that has already been performed.

The cache is size-limited, so to prevent read-write conflicts for arbitrarily old requests, it pessimistically maintains a “low water mark”. This value always ratchets with monotonic increases and is equivalent to the earliest timestamp of any key range that is present in the cache. If a write operation writes to a key not present in the cache, the “low water mark” is consulted instead to determine read-write conflicts. The low water mark is initialized to the current system time plus the maximum clock offset.

On lease changing a timestamp cache snapshot is accepted on a new leaseholder with a summary of the reads served on the range by prior leaseholders. This can be used by the new leaseholder to ensure that no future writes are allowed to invalidate prior reads. If a summary is not provided, for example after a leaseholder failure, the method pessimistically assumes that prior leaseholders served reads all the way up to the start of the new lease.

 

Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls back to a regular “SELECT”.

 

A key and a value of the timestamp cache is structures:
{code:java}
key {startKey, endKey}
value {timestamp, txnID}
{code}
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
 

*Write locks.*

CockroachDB has distributed write locks - write intents. An intent is a regular MVCC KV pair, except that it is preceded by metadata indicating that what follows is an intent. This metadata points to a transaction record, which is a special key (unique per transaction) that stores the current disposition of the transaction: pending, staging, committed or aborted. Because write intents and tx records are replicated, they persist even after the leaseholder falls.

 

Theare is a topic with a discussion of this example on the CocroachDB forum: [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]

  was:
*Transaction recovery* is closely related to concurrency control.

Concurrency control is described in [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions] (especially the chapter “Transaction interactions”) and [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].

 

*Read locks.*

Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to timestamp cache.

Timestamp cache is a bounded in-memory cache that records the maximum timestamp that key ranges were read from and written to. Cache corresponds to the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.

The cache is updated after the completion of each read operation with the range of all keys that the request was predicated upon. It is then consulted for each write operation, allowing them to detect read-write violations that would allow them to write "under" a read that has already been performed.

The cache is size-limited, so to prevent read-write conflicts for arbitrarily old requests, it pessimistically maintains a “low water mark”. This value always ratchets with monotonic increases and is equivalent to the earliest timestamp of any key range that is present in the cache. If a write operation writes to a key not present in the cache, the “low water mark” is consulted instead to determine read-write conflicts. The low water mark is initialized to the current system time plus the maximum clock offset.

On lease changing a timestamp cache snapshot is accepted on a new leaseholder with a summary of the reads served on the range by prior leaseholders. This can be used by the new leaseholder to ensure that no future writes are allowed to invalidate prior reads. If a summary is not provided, for example after a leaseholder failure, the method pessimistically assumes that prior leaseholders served reads all the way up to the start of the new lease.

 

Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls back to a regular “SELECT”.

 

A key and a value of the timestamp cache is structures:

 
{code:java}
key {startKey, endKey}
value {timestamp, txnID}
{code}
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
 

*Write locks.*

CockroachDB has distributed write locks - write intents. An intent is a regular MVCC KV pair, except that it is preceded by metadata indicating that what follows is an intent. This metadata points to a transaction record, which is a special key (unique per transaction) that stores the current disposition of the transaction: pending, staging, committed or aborted. Because write intents and tx records are replicated, they persist even after the leaseholder falls.

 

Theare is a topic with a discussion of this example on the CocroachDB forum: [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]


> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-16723
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16723
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Uttsel
>            Assignee: Sergey Uttsel
>            Priority: Major
>              Labels: ignite-3
>         Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions] (especially the chapter “Transaction interactions”) and [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>  
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum timestamp that key ranges were read from and written to. Cache corresponds to the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the range of all keys that the request was predicated upon. It is then consulted for each write operation, allowing them to detect read-write violations that would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily old requests, it pessimistically maintains a “low water mark”. This value always ratchets with monotonic increases and is equivalent to the earliest timestamp of any key range that is present in the cache. If a write operation writes to a key not present in the cache, the “low water mark” is consulted instead to determine read-write conflicts. The low water mark is initialized to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder with a summary of the reads served on the range by prior leaseholders. This can be used by the new leaseholder to ensure that no future writes are allowed to invalidate prior reads. If a summary is not provided, for example after a leaseholder failure, the method pessimistically assumes that prior leaseholders served reads all the way up to the start of the new lease.
>  
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls back to a regular “SELECT”.
>  
> A key and a value of the timestamp cache is structures:
> {code:java}
> key {startKey, endKey}
> value {timestamp, txnID}
> {code}
> A range lock also uses a timestamp cache:
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>  
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a regular MVCC KV pair, except that it is preceded by metadata indicating that what follows is an intent. This metadata points to a transaction record, which is a special key (unique per transaction) that stores the current disposition of the transaction: pending, staging, committed or aborted. Because write intents and tx records are replicated, they persist even after the leaseholder falls.
>  
> Theare is a topic with a discussion of this example on the CocroachDB forum: [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)