You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jordan West (Jira)" <ji...@apache.org> on 2022/01/11 00:44:00 UTC

[jira] [Comment Edited] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results

    [ https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472374#comment-17472374 ] 

Jordan West edited comment on CASSANDRA-17251 at 1/11/22, 12:43 AM:
--------------------------------------------------------------------

https://github.com/apache/cassandra/compare/cassandra-3.0...jrwest:jwest/17251-3.0


was (Author: jrwest):
https://github.com/apache/cassandra/compare/trunk...jrwest:jwest/17251-3.0

> USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17251
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Other
>            Reporter: Jordan West
>            Assignee: Jordan West
>            Priority: Normal
>             Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> The combination of {{USING writetime = timestamp and ttl = ttl}} can result in non-deterministic MergeIterator results causing DigestMismatchExceptions and increased latencies. The increased latencies are caused by additional round trips due to the digest mismatch as well as read repair rewriting the data. The additional writes lead to an increase in the number of sstables the key is stored in and must be scanned on read.
> The order of events is:
> 1. for a given partition a write is performed with {{USING timestamp = sometime and ttl = ttl1}}.
> 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, expires_at = now + ttl1
> 3. after N seconds, for the same partition, another write is performed with {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This write only makes it to a subset of replicas* for some reason (e.g. partial write, node down, etc).
> 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, expires_at = now + ttl2. Its important to note that at this point, expires_at in 2 above is equal to expires at here. This is because it is calculated relative to the current write time not the provided timestamp and the ttl has been adjusted by the time passed. This write also makes it to a subset of replicas*.
> 5. A read of the data is performed.
> 5a. The MergeIterator resolves conflicts locally (accross sstables) using {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution takes into account the write timestamp , the liveness of the cell, the values themselves, and how much time is left to live via the expires_at field. In this scenario, all of these fields are equal, leading to Cassandra picking the sstable "on the right" – this is non-deterministic. The only item that differs is the ttl itself. 
> 5b. One node returns the non-deterministically chosen value for the row, the other two calculate and send a digest to the coordinator. The digest includes the relative ttl field which may not match. This results in a DigestMismatchException at the coordinator.
> 6. Read repair is triggered 
> *NOTE: its not strictly necessary for the write to make it to a subset of replicas. sstables can also be ordered in random orders for reasons like compaction or repair when returned from the live set which can lead to the same behavior. This also affects repair from what we can tell. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org