You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (Jira)" <ji...@apache.org> on 2020/08/13 12:52:00 UTC
[jira] [Commented] (CASSANDRA-15432) The "read defragmentation" optimization does not work

    [ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176982#comment-17176982 ] 

Sylvain Lebresne commented on CASSANDRA-15432:
----------------------------------------------

Back on this later than I meant, but attaching fairly trivial patches to remove said optimization on 3.0, 3.11 and trunk/4.0.
||patch||CI||
|[3.0|https://github.com/pcmanus/cassandra/commits/C-15432-3.0]|[#239|https://ci-cassandra.apache.org/job/Cassandra-devbranch/239/]|
|[3.11|https://github.com/pcmanus/cassandra/commits/C-15432-3.11]|[#240|https://ci-cassandra.apache.org/job/Cassandra-devbranch/240/]|
|[trunk|https://github.com/pcmanus/cassandra/commits/C-15432-trunk]|[#241|https://ci-cassandra.apache.org/job/Cassandra-devbranch/241/]|

[~aleksey] or [~benedict]: would one of you have cycles to review by any chance (pretty simple diff, removing the {{if}} triggering the defrag as well as tiny bits of incidental code that is now dead).



> The "read defragmentation" optimization does not work
> -----------------------------------------------------
>
>                 Key: CASSANDRA-15432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Normal
>
> The so-called "read defragmentation" that has been added way back with CASSANDRA-2503 actually does not work, and never has. That is, the defragmentation writes do happen, but they only additional load on the nodes without helping anything, and are thus a clear negative.
> The "read defragmentation" (which only impact so-called "names queries") kicks in when a read hits "too many" sstables (> 4 by default), and when it does, it writes down the result of that read. The assumption being that the next read for that data would only read the newly written data, which if not still in memtable would at least be in a single sstable, thus speeding that next read.
> Unfortunately, this is not how this work. When we defrag and write the result of our original read, we do so with the timestamp of the data read (as we should, changing the timestamp would be plain wrong). And as a result, following reads will read that data first, but will have no way to tell that no more sstables should be read. Technically, the [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] call will not return {{null}} because the {{currentMaxTs}} will be higher than at least some of the data in the result, and this until we've read from as many sstables than in the original read.
> I see no easy way to fix this. It might be possible to make it work with additional per-sstable metadata, but nothing sufficiently simple and cheap to be worth it comes to mind. And I thus suggest simply removing that code.
> For the record, I'll note that there is actually a 2nd problem with that code: currently, we "defrag" a read even if we didn't got data for everything that the query requests. This also is "wrong" even if we ignore the first issue: a following read that would read the defragmented data would also have no way to know to not read more sstables to try to get the missing parts. This problem would be fixeable, but is obviously overshadowed by the previous one anyway.
> Anyway, as mentioned, I suggest to just remove the "optimization" (which again, never optimized anything) altogether, and happy to provide the simple patch.
> The only question might be in which versions? This impact all versions, but this isn't a correction bug either, "just" a performance one. So do we want 4.0 only or is there appetite for earlier?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org