You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Marcus Eriksson (Jira)" <ji...@apache.org> on 2022/12/22 18:54:00 UTC

[jira] [Commented] (CASSANDRA-18123) Reuse of metadata collector can break key count calculation

    [ https://issues.apache.org/jira/browse/CASSANDRA-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651385#comment-17651385 ] 

Marcus Eriksson commented on CASSANDRA-18123:
---------------------------------------------

I don't think this affects any of our current {{SSTableMultiWriter}} implementations right? {{RangeAwareSSTableWriter}} creates a new collector for every new sstable and {{SimpleSSTableMultiWriter}} only creates a single new sstable.

> Reuse of metadata collector can break key count calculation
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-18123
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18123
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction
>            Reporter: Branimir Lambov
>            Priority: Normal
>             Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 4.x
>
>
> When flushing a memtable we currently pass a constructed {{MetadataCollector}} to the {{SSTableMultiWriter}} that is used for writing sstables. The latter may decide to split the data into multiple sstables (e.g. for separate disks or driven by compaction strategy) — if it does so, the cardinality estimation component in the reused {{MetadataCollector}} for each individual sstable contains the data for all of them.
> As a result, when such sstables are compacted the estimation for the number of keys in the resulting sstables, which is used to determine the size of the bloom filter for the compaction result, is heavily overestimated.
> This results in much bigger L1 bloom filters than they should be. One example (which came about during testing of the upcoming CEP-26, after insertion of 100GB data with 10% reads):
> (current)
> {code}
>  		Bloom filter false positives: 22627369
>  		Bloom filter false ratio: 0.02257
>  		Bloom filter space used: 1848247864
>  		Bloom filter off heap memory used: 2338964088
> {code}
> (fixed)
> {code}
>  		Bloom filter false positives: 24426545
>  		Bloom filter false ratio: 0.02429
>  		Bloom filter space used: 1118910096
>  		Bloom filter off heap memory used: 1532357432
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org