You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Constance Eustace (JIRA)" <ji...@apache.org> on 2018/02/27 19:41:00 UTC
[jira] [Created] (CASSANDRA-14279) Row Tombstones in separate sstables / separate compaction path

Constance Eustace created CASSANDRA-14279:
---------------------------------------------

             Summary: Row Tombstones in separate sstables / separate compaction path
                 Key: CASSANDRA-14279
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14279
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Constance Eustace


In my experience if data is not well organized into time windowed sstables, cassandra has enormous difficulty in actually deleting data if the data has a "medium term" lifetime. Or for example, you might have an active working set and be archiving "unused" data to other tables or clusters. Or you may be purging data. Or you may be migrating/sharding data. Whatever the case, you want that disk space back. 

In STCS and LCS, row tombstones are intermingled with column data and column tombstones. But a row tombstone represents a big event: large amounts of "droppable" data from an sstable, or even a shortcut from reading data from other sstables.

I am wondering that if row tombstones were isolated in their own sstables, separately compacted and merged, that it might enable compaction to work more efficiently: 

reads can prioritize bloom filter lookups that indicate a row tombstone, getting the timestamp of the deletion first, then can use that in the data sstables to filter data or shortcircuit the data if the row data had an overall "most recent data timestamp". 

compaction could be forced to reference all the row tombstone sstables, such that every time two or more "data" sstables are compacted, they must reference the row tombstones to purge data. 

In LCS, this would be particularly useful in getting data out of the upper levels without having to wait for data to trickle up the tree. The row tombstones, being read-only inputs into the data sstable compactions, can be referenced in each of the LCS levels' parallel compactors. 

Based on discussions in the dev list, this would appear to require some sort of customization to the memtable->sstable flushing process, and perhaps a different set of bloom filters. 

Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they should be comparitively smaller and take less time to compact. They could be aggressively compacted on a different schedule than "data" sstables. 

In addition, it may be easier to repair/synchronize row tombstones across the cluster if they have already been separated into their own sstables.

Column/range tombstones may also benefit from a similar separation, but my guess is those are much more numerous and large and fine-grained that they might as well coexist with the data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org