You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joey Lynch (Jira)" <ji...@apache.org> on 2019/11/04 19:31:00 UTC

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

    [ https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966938#comment-16966938 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 11/4/19 7:30 PM:
-----------------------------------------------------------------

My rationale for the {{EnumSet}} over a boolean member function is:
 # Versus the boolean function idea it doesn't break the ICompressor abstraction and let compressors know that flushes exist. As in, it is very easy for an ICompressor author to claim to be good at {{FAST_COMPRESSION}} but probably can't make the call if that should be used in flushes or other situations. I could have a {{isFastCompressor}} boolean function but given that {{ICompressor}} is a public API interface I think sets of capabilities will be more maintainable than a collection of boolean functions going forwards, especially if we start adding more capabilities (see #2).
 # If we go down the path of _not_ making more knobs and just try to have the database figure out the best way to compress data for users this is easier to maintain long term since compressors can offer multiple types of hints to the database. For example the database might refuse to use slow compressors in flushes, commitlogs, etc or having compaction strategies opt into higher ratio compression strategies in higher "levels". If we do go down this path there are fewer interface changes (instead of adding and removing functions we just add ICompressor.Uses hints).
 # Versus the set of strings idea, it has compile time checks that are useful (which is the primary argument against sets of strings afaik).

After thinking about this problem space more I'm no longer convinced that giving general users more knobs here is the right choice (the table properties). By using a {{suitableUses}} hint the database can in the future 4.x releases internally optimize:
 * Flushes: "get this data off my heap as fast as possible". We don't care about ratio (since the products will be re-compacted shortly) or decompression speed, only care about compression speed.
 * Commitlog: "some compression is nice but get this data off my heap fast". We mostly care about compression speed, but very minorly about ratio.
 * Compaction: "The older the data the more compressed it should be". We care a lot about decompression speed and ratio, but don't want to pick expensive compressors at the high churn points (L0 in LCS, small tables in STCS, before the time window bucket in TWCS)

The interface still gives advanced users a backdoor (they extend the compressor they want to change the behavior of and change what capabilities it offers).

edit: I pinged this ticket into [slack|https://the-asf.slack.com/archives/CK23JSY2K/p1572881897039500] to seek more feedback.


was (Author: jolynch):
My rationale for the {{EnumSet}} over a boolean member function is:
 # Versus the boolean function idea it doesn't break the ICompressor abstraction and let compressors know that flushes exist. As in, it is very easy for an ICompressor author to claim to be good at {{FAST_COMPRESSION}} but probably can't make the call if that should be used in flushes or other situations. I could have a {{isFastCompressor}} boolean function but given that {{ICompressor}} is a public API interface I think sets of capabilities will be more maintainable than a collection of boolean functions going forwards, especially if we start adding more capabilities (see #2).
 # If we go down the path of _not_ making more knobs and just try to have the database figure out the best way to compress data for users this is easier to maintain long term since compressors can offer multiple types of hints to the database. For example the database might refuse to use slow compressors in flushes, commitlogs, etc or having compaction strategies opt into higher ratio compression strategies in higher "levels". If we do go down this path there are fewer interface changes (instead of adding and removing functions we just add ICompressor.Uses hints).
 # Versus the set of strings idea, it has compile time checks that are useful (which is the primary argument against sets of strings afaik).

After thinking about this problem space more I'm no longer convinced that giving general users more knobs here is the right choice (the table properties). By using a {{suitableUses}} hint the database can internally optimize:
 * Flushes: "get this data off my heap as fast as possible". We don't care about ratio (since the products will be re-compacted shortly) or decompression speed, only care about compression speed.
 * Commitlog: "some compression is nice but get this data off my heap fast". We mostly care about compression speed, but very minorly about ratio.
 * Compaction: "The older the data the more compressed it should be". We care a lot about decompression speed and ratio, but don't want to pick expensive compressors at the high churn points (L0 in LCS, small tables in STCS, before the time window bucket in TWCS)

The interface still gives advanced users a backdoor (they extend the compressor they want to change the behavior of and change what capabilities it offers).

> Make it possible to flush with a different compression strategy than we compact with
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15379
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15379
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/Config, Local/Memtable
>            Reporter: Joey Lynch
>            Assignee: Joey Lynch
>            Priority: Normal
>
> [~josnyder] and I have been testing out CASSANDRA-14482 (Zstd compression) on some of our most dense clusters and have been observing close to 50% reduction in footprint with Zstd on some of our workloads! Unfortunately though we have been running into an issue where the flush might take so long (Zstd is slower to compress than LZ4) that we can actually block the next flush and cause instability.
> Internally we are working around this with a very simple patch which flushes SSTables as the default compression strategy (LZ4) regardless of the table params. This is a simple solution but I think the ideal solution though might be for the flush compression strategy to be configurable separately from the table compression strategy (while defaulting to the same thing). Instead of adding yet another compression option to the yaml (like hints and commitlog) I was thinking of just adding it to the table parameters and then adding a {{default_table_parameters}} yaml option like:
> {noformat}
> # Default table properties to apply on freshly created tables. The currently supported defaults are:
> # * compression       : How are SSTables compressed in general (flush, compaction, etc ...)
> # * flush_compression : How are SSTables compressed as they flush
> # supported
> default_table_parameters:
>   compression:
>     class_name: 'LZ4Compressor'
>     parameters:
>       chunk_length_in_kb: 16
>   flush_compression:
>     class_name: 'LZ4Compressor'
>     parameters:
>       chunk_length_in_kb: 4
> {noformat}
> This would have the nice effect as well of giving our configuration a path forward to providing user specified defaults for table creation (so e.g. if a particular user wanted to use a different default chunk_length_in_kb they can do that).
> So the proposed (~mandatory) scope is:
> * Flush with a faster compression strategy
> I'd like to implement the following at the same time:
> * Per table flush compression configuration
> * Ability to default the table flush and compaction compression in the yaml.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org