You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Andres de la Peña (Jira)" <ji...@apache.org> on 2020/03/25 13:37:00 UTC
[jira] [Comment Edited] (CASSANDRA-8272) 2ndary indexes can return stale data

    [ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066670#comment-17066670 ] 

Andres de la Peña edited comment on CASSANDRA-8272 at 3/25/20, 1:36 PM:
------------------------------------------------------------------------

It seems that there are some cases missed by the previous index tombstone based approach, 
 which is when the replica with the most recent version of a column has never seen the previous versions of that column that might be in other replicas, for example:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, v text);
CREATE INDEX ON t(v);
INSERT INTO t(k, v) VALUES (0, 'old') USING TIMESTAMP 1;  // Only node 1 gets it
INSERT INTO t(k, v) VALUES (0, 'new') USING TIMESTAMP 2;  // Only node 2 gets it
SELECT * FROM t WHERE v = 'old'; // node 1 returns a stale result!
{code}
The attached PR proposes a different approach that is similar to short read protection, and also fixes CASSANDRA-8273.

When there is replica-side protection, we materialize and cache the query results, using a merge listener to take note of the primary keys of rows that doesn't have a response for any of the involved replicas. We know that those silent replicas might have a more recent version of the row that hasn't been included because it doesn't satisfy the filter. Once we have identified and collected those potentially stale rows, we ask for that rows to the silent replicas, with {{SinglePartitionReadCommand}} s that don't use any filtering. Then, we complete the cached filtered results with the responses from the silent replicas, apply the row filter, and we are ready to go.

Another advantage of this approach over the previous one is that coordinators containing the fix can work with replicas that don't contain the fix.

A particular problem is that SASI results don't satisfy the requested row filter when an analyzer is used. This is something that we should fix so the expressions could delegate their evaluation to the specific indexImplementation. I think this is not specially problematic but I think that it should be done in a separate follow up ticket. By now, the fix just skips replica filtering protection when SASI is used, keeping the old behaviour.

I'm attaching a PR for 3.11 and I'm working on the PR for trunk. The dtest PR is updated to include the new cases and queries using filtering instead of indexes.

Since this is a bug fix involving wrong query results, I think it would be great if we could ship it in 4.0.


was (Author: adelapena):
It seems that there are some cases missed by the previous index tombstone based approach, 
which is when the replica with the most recent version of a column has never seen the previous versions of that column that might be in other replicas, for example:

{code}
CREATE TABLE t (k int PRIMARY KEY, v text);
CREATE INDEX ON t(v);
INSERT INTO t(k, v) VALUES (0, 'old') USING TIMESTAMP 1;  // Only node 1 gets it
INSERT INTO t(k, v) VALUES (0, 'new') USING TIMESTAMP 2;  // Only node 2 gets it
SELECT * FROM t WHERE v = 'old'; // node 1 returns a stale result!
{code}

The attached PR proposes a different approach that is similar to short read protection, and also fixes CASSANDRA-8273.

When there is replica-side protection, we materialize and cache the query results, using a merge listener to take note of the primary keys of rows that doesn't have a response for any of the involved replicas. We know that those silent replicas might have a more recent version of the row that hasn't been included because it doesn't satisfy the filter. Once we have identified and collected those potentially stale rows, we ask for that rows to the silent replicas, with {{SinglePartitionReadCommand}} s that don't use any filtering. Then, we complete the cached filtered results with the responses from the silent replicas, apply the row filter, and we are ready to go.

Another advantage of this approach over the previous one is that coordinators containing the fix can work with replicas that don't contain the fix. 

A particular problem is that SASI results don't satisfy the requested row filter when an analyzer is used. This is something that we should fix so the expressions could delegate their evaluation to the specific indexImplementation. I think this is not specially problematic but I think that it should be done in a separate follow up ticket. By now, the fix just skips replica filtering protection when SASI is used, keeping the old behaviour.

I'm attaching a PR for 3.11 and I'm working on the PR for trunk. Since this is a bug fix involving wrong query results, I think it would be great if we could ship it in 4.0.


> 2ndary indexes can return stale data
> ------------------------------------
>
>                 Key: CASSANDRA-8272
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Feature/2i Index
>            Reporter: Sylvain Lebresne
>            Assignee: Andres de la Peña
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 3.0.x
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When replica return 2ndary index results, it's possible for a single replica to return a stale result and that result will be sent back to the user, potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before having applied the insert, then the now stale result will be returned (since C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and provided we make the index inherit the gcGrace of it's parent CF), instead of skipping that tombstone, we'd insert in the result a corresponding range tombstone.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org