You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "C. Scott Andreas (JIRA)" <ji...@apache.org> on 2018/11/18 18:19:02 UTC

[jira] [Updated] (CASSANDRA-8737) AdjacentDataCompactionStrategy

     [ https://issues.apache.org/jira/browse/CASSANDRA-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

C. Scott Andreas updated CASSANDRA-8737:
----------------------------------------
    Component/s: Compaction

> AdjacentDataCompactionStrategy
> ------------------------------
>
>                 Key: CASSANDRA-8737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8737
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Benedict
>            Priority: Major
>             Fix For: 4.x
>
>
> In the original ticket for dealing with timeseries data that introduced DTCS, the first suggestion was for an approach that compacted adjacent data (by clustering columns) together until a single page (or some fixed multiple of pages) on average contained only one partition's worth of data. The idea would be to compact any sstables that overlap their clustering components, so that only one (or a fixed number) of sstables need to be queried for any clustering range. The upshot of this would be tunable compaction burden to get optimal read behaviour, more explicitly defined than the decay in DTCS. 
> The basic idea would be to select boundary clustering prefixes based on the current data occupancy within those ranges, falling roughly along the boundaries of the existing sstables, but so that any overlapping tail falls one side or the other. We then compact all overlapping sstables, and split the results into one side or another of the boundary (or across multiple boundaries). If there are no historical updates, this gives pretty optimal behaviour; we only compact files until we get to our packing threshold (so that reads are known to be at the configured efficiency), and then stop. If updates to older records appear, they would be compacted into their boundary buckets, and left there until we had enough files in a boundary (probably following normal STCS rules) that it warranted compaction.
> The benefit is that such historical updates are still accounted for and bounded by comparison to DTCS, and the configuration parameters give more tunable characteristics, with explicit expectations (i.e. one seek per X bytes read in a partition; higher X may imply more compaction, lower more merges and seeks on read). It also may permit us some easy optimisations further up the stack, since we can guarantee the boundaries of overlap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org