You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Caleb Rackliffe (Jira)" <ji...@apache.org> on 2021/08/16 19:45:00 UTC

[jira] [Commented] (CASSANDRA-16721) Repaired data tracking on a read coordinator is susceptible to races between local and remote requests

    [ https://issues.apache.org/jira/browse/CASSANDRA-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17399972#comment-17399972 ] 

Caleb Rackliffe commented on CASSANDRA-16721:
---------------------------------------------

I've made a first pass at the patch, and I think it does solve the problem described in the description above. However, there are a few questions I'm struggling with:
 

1.) Why do we share any aspect of {{RepairedDataInfo}} across threads at all? It seems like both the problem above and a class of other possible problems (read on) would be sidestepped completely. More specifically, perhaps we could do something like just indicating to the {{ReadExecutionController}} whether we should track repaired status?

2.) If we follow the scenario above, and two remote reads return and indicate a mismatch while the local read is still executing, is it possible that both the local read (likely on a Native Transport thread, but possibly on a ReadStage thread) and the local read started in {{startRepair()}} (and now on a ReadStage thread) use the same {{RepairedDataInfo}} instance as they serialize their local data responses?

 
Even if the second item above isn't possible, it still seems like our implementation would be less brittle if if we could find a minimally invasive way to make the change in the first item. I'm open to making a pass at it, but I want to make sure my starting assumptions are correct.

> Repaired data tracking on a read coordinator is susceptible to races between local and remote requests
> ------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16721
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16721
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> At read time on a coordinator which is also a replica, the local and remote reads can race such that the remote responses are received while the local read is executing. If the remote responses are mismatching, triggering a {{DigestMismatchException}} and subsequent round of full data reads and read repair, the local runnable may find the {{isTrackingRepairedStatus}} flag flipped mid-execution.  If this happens after a certain point in execution, it would mean
> that the RepairedDataInfo instance in use is the singleton null object {{RepairedDataInfo.NULL_REPAIRED_DATA_INFO}}. If this happens, it can lead to an NPE when calling {{RepairedDataInfo::extend}} when the local results are iterated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org