You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chris Hart <ch...@remilon.com> on 2014/12/17 19:32:33 UTC

High Bloom Filter FP Ratio

Hi,

I have create the following table with bloom_filter_fp_chance=0.01:

CREATE TABLE logged_event (
  time_key bigint,
  partition_key_randomizer int,
  resource_uuid timeuuid,
  event_json text,
  event_type text,
  field_error_list map<text, text>,
  javascript_timestamp timestamp,
  javascript_uuid uuid,
  page_impression_guid uuid,
  page_request_guid uuid,
  server_received_timestamp timestamp,
  session_id bigint,
  PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};


When I run cfstats, I see a much higher false positive ratio:

                Table: logged_event
                SSTable count: 15
                Space used (live), bytes: 104128214227
                Space used (total), bytes: 104129482871
                SSTable Compression Ratio: 0.3295840184239226
                Number of keys (estimate): 199293952
                Memtable cell count: 56364
                Memtable data size, bytes: 20903960
                Memtable switch count: 148
                Local read count: 1396402
                Local read latency: 0.362 ms
                Local write count: 2345306
                Local write latency: 0.062 ms
                Pending tasks: 0
                Bloom filter false positives: 147705
                Bloom filter false ratio: 0.49020
                Bloom filter space used, bytes: 249129040
                Compacted partition minimum bytes: 447
                Compacted partition maximum bytes: 315852
                Compacted partition mean bytes: 1636
                Average live cells per slice (last five minutes): 0.0
                Average tombstones per slice (last five minutes): 0.0

Any idea what could be causing this?  This is timeseries data.  Every time we read from this table, we read a single row key with 1000 partition_key_randomizer values.  I'm running cassandra 2.0.11.  I tried running an upgradesstables to rewrite them, which didn't change this behavior at all.  I'm using size tiered compaction and I haven't done any major compactions.

Thanks,
Chris

Re: High Bloom Filter FP Ratio

Posted by Chris Hart <ch...@remilon.com>.

Hi Tyler,

I tried what you said and false positives look much more reasonable there.  Thanks for looking into this.

-Chris

----- Original Message -----
From: "Tyler Hobbs" <ty...@datastax.com>
To: user@cassandra.apache.org
Sent: Friday, December 19, 2014 1:25:29 PM
Subject: Re: High Bloom Filter FP Ratio

I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run "TRACING ON") and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene <gr...@gmail.com> wrote:
>
> We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).
>
> We're using Cassandra 2.1.2.
>
>
> Schema
> -----------------------------------------------------------------------
> CREATE TABLE contacts.contact (
>     id bigint,
>     property_id int,
>     created_at bigint,
>     updated_at bigint,
>     value blob,
>     PRIMARY KEY (id, property_id)
> ) WITH CLUSTERING ORDER BY (property_id ASC)
> *    AND bloom_filter_fp_chance = 0.001*
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'min_threshold': '4', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
>
> CF Stats Output:
> -------------------------------------------------------------------------
> Keyspace: contacts
>     Read Count: 2458375
>     Read Latency: 0.8528440766766665 ms.
>     Write Count: 10357
>     Write Latency: 0.1816912233272183 ms.
>     Pending Flushes: 0
>         Table: contact
>         SSTable count: 61
>         SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
>         Space used (live): 9047112471
>         Space used (total): 9047112471
>         Space used by snapshots (total): 0
>         SSTable Compression Ratio: 0.34119240020241487
>         Memtable cell count: 24570
>         Memtable data size: 1299614
>         Memtable switch count: 2
>         Local read count: 2458290
>         Local read latency: 0.853 ms
>         Local write count: 10044
>         Local write latency: 0.186 ms
>         Pending flushes: 0
>         Bloom filter false positives: 11096
> *        Bloom filter false ratio: 0.99197*
>         Bloom filter space used: 3923784
>         Compacted partition minimum bytes: 373
>         Compacted partition maximum bytes: 152321
>         Compacted partition mean bytes: 9938
>         Average live cells per slice (last five minutes): 37.57851240677983
>         Maximum live cells per slice (last five minutes): 63.0
>         Average tombstones per slice (last five minutes): 0.0
>         Maximum tombstones per slice (last five minutes): 0.0
>
> --
> about.me <http://about.me/markgreene>
>
> On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart <ch...@remilon.com> wrote:
>>
>> Hi,
>>
>> I have create the following table with bloom_filter_fp_chance=0.01:
>>
>> CREATE TABLE logged_event (
>>   time_key bigint,
>>   partition_key_randomizer int,
>>   resource_uuid timeuuid,
>>   event_json text,
>>   event_type text,
>>   field_error_list map<text, text>,
>>   javascript_timestamp timestamp,
>>   javascript_uuid uuid,
>>   page_impression_guid uuid,
>>   page_request_guid uuid,
>>   server_received_timestamp timestamp,
>>   session_id bigint,
>>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
>> ) WITH
>>   bloom_filter_fp_chance=0.010000 AND
>>   caching='KEYS_ONLY' AND
>>   comment='' AND
>>   dclocal_read_repair_chance=0.000000 AND
>>   gc_grace_seconds=864000 AND
>>   index_interval=128 AND
>>   read_repair_chance=0.000000 AND
>>   replicate_on_write='true' AND
>>   populate_io_cache_on_flush='false' AND
>>   default_time_to_live=0 AND
>>   speculative_retry='99.0PERCENTILE' AND
>>   memtable_flush_period_in_ms=0 AND
>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> When I run cfstats, I see a much higher false positive ratio:
>>
>>                 Table: logged_event
>>                 SSTable count: 15
>>                 Space used (live), bytes: 104128214227
>>                 Space used (total), bytes: 104129482871
>>                 SSTable Compression Ratio: 0.3295840184239226
>>                 Number of keys (estimate): 199293952
>>                 Memtable cell count: 56364
>>                 Memtable data size, bytes: 20903960
>>                 Memtable switch count: 148
>>                 Local read count: 1396402
>>                 Local read latency: 0.362 ms
>>                 Local write count: 2345306
>>                 Local write latency: 0.062 ms
>>                 Pending tasks: 0
>>                 Bloom filter false positives: 147705
>>                 Bloom filter false ratio: 0.49020
>>                 Bloom filter space used, bytes: 249129040
>>                 Compacted partition minimum bytes: 447
>>                 Compacted partition maximum bytes: 315852
>>                 Compacted partition mean bytes: 1636
>>                 Average live cells per slice (last five minutes): 0.0
>>                 Average tombstones per slice (last five minutes): 0.0
>>
>> Any idea what could be causing this?  This is timeseries data.  Every
>> time we read from this table, we read a single row key with 1000
>> partition_key_randomizer values.  I'm running cassandra 2.0.11.  I tried
>> running an upgradesstables to rewrite them, which didn't change this
>> behavior at all.  I'm using size tiered compaction and I haven't done any
>> major compactions.
>>
>> Thanks,
>> Chris
>>
>

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: High Bloom Filter FP Ratio

Posted by Tyler Hobbs <ty...@datastax.com>.

I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run "TRACING ON") and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene <gr...@gmail.com> wrote:
>
> We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).
>
> We're using Cassandra 2.1.2.
>
>
> Schema
> -----------------------------------------------------------------------
> CREATE TABLE contacts.contact (
>     id bigint,
>     property_id int,
>     created_at bigint,
>     updated_at bigint,
>     value blob,
>     PRIMARY KEY (id, property_id)
> ) WITH CLUSTERING ORDER BY (property_id ASC)
> *    AND bloom_filter_fp_chance = 0.001*
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'min_threshold': '4', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
>
> CF Stats Output:
> -------------------------------------------------------------------------
> Keyspace: contacts
>     Read Count: 2458375
>     Read Latency: 0.8528440766766665 ms.
>     Write Count: 10357
>     Write Latency: 0.1816912233272183 ms.
>     Pending Flushes: 0
>         Table: contact
>         SSTable count: 61
>         SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
>         Space used (live): 9047112471
>         Space used (total): 9047112471
>         Space used by snapshots (total): 0
>         SSTable Compression Ratio: 0.34119240020241487
>         Memtable cell count: 24570
>         Memtable data size: 1299614
>         Memtable switch count: 2
>         Local read count: 2458290
>         Local read latency: 0.853 ms
>         Local write count: 10044
>         Local write latency: 0.186 ms
>         Pending flushes: 0
>         Bloom filter false positives: 11096
> *        Bloom filter false ratio: 0.99197*
>         Bloom filter space used: 3923784
>         Compacted partition minimum bytes: 373
>         Compacted partition maximum bytes: 152321
>         Compacted partition mean bytes: 9938
>         Average live cells per slice (last five minutes): 37.57851240677983
>         Maximum live cells per slice (last five minutes): 63.0
>         Average tombstones per slice (last five minutes): 0.0
>         Maximum tombstones per slice (last five minutes): 0.0
>
> --
> about.me <http://about.me/markgreene>
>
> On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart <ch...@remilon.com> wrote:
>>
>> Hi,
>>
>> I have create the following table with bloom_filter_fp_chance=0.01:
>>
>> CREATE TABLE logged_event (
>>   time_key bigint,
>>   partition_key_randomizer int,
>>   resource_uuid timeuuid,
>>   event_json text,
>>   event_type text,
>>   field_error_list map<text, text>,
>>   javascript_timestamp timestamp,
>>   javascript_uuid uuid,
>>   page_impression_guid uuid,
>>   page_request_guid uuid,
>>   server_received_timestamp timestamp,
>>   session_id bigint,
>>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
>> ) WITH
>>   bloom_filter_fp_chance=0.010000 AND
>>   caching='KEYS_ONLY' AND
>>   comment='' AND
>>   dclocal_read_repair_chance=0.000000 AND
>>   gc_grace_seconds=864000 AND
>>   index_interval=128 AND
>>   read_repair_chance=0.000000 AND
>>   replicate_on_write='true' AND
>>   populate_io_cache_on_flush='false' AND
>>   default_time_to_live=0 AND
>>   speculative_retry='99.0PERCENTILE' AND
>>   memtable_flush_period_in_ms=0 AND
>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> When I run cfstats, I see a much higher false positive ratio:
>>
>>                 Table: logged_event
>>                 SSTable count: 15
>>                 Space used (live), bytes: 104128214227
>>                 Space used (total), bytes: 104129482871
>>                 SSTable Compression Ratio: 0.3295840184239226
>>                 Number of keys (estimate): 199293952
>>                 Memtable cell count: 56364
>>                 Memtable data size, bytes: 20903960
>>                 Memtable switch count: 148
>>                 Local read count: 1396402
>>                 Local read latency: 0.362 ms
>>                 Local write count: 2345306
>>                 Local write latency: 0.062 ms
>>                 Pending tasks: 0
>>                 Bloom filter false positives: 147705
>>                 Bloom filter false ratio: 0.49020
>>                 Bloom filter space used, bytes: 249129040
>>                 Compacted partition minimum bytes: 447
>>                 Compacted partition maximum bytes: 315852
>>                 Compacted partition mean bytes: 1636
>>                 Average live cells per slice (last five minutes): 0.0
>>                 Average tombstones per slice (last five minutes): 0.0
>>
>> Any idea what could be causing this?  This is timeseries data.  Every
>> time we read from this table, we read a single row key with 1000
>> partition_key_randomizer values.  I'm running cassandra 2.0.11.  I tried
>> running an upgradesstables to rewrite them, which didn't change this
>> behavior at all.  I'm using size tiered compaction and I haven't done any
>> major compactions.
>>
>> Thanks,
>> Chris
>>
>

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: High Bloom Filter FP Ratio

Posted by Mark Greene <gr...@gmail.com>.

We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).

We're using Cassandra 2.1.2.


Schema
-----------------------------------------------------------------------
CREATE TABLE contacts.contact (
    id bigint,
    property_id int,
    created_at bigint,
    updated_at bigint,
    value blob,
    PRIMARY KEY (id, property_id)
) WITH CLUSTERING ORDER BY (property_id ASC)
*    AND bloom_filter_fp_chance = 0.001*
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
'max_threshold': '32'}
    AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

CF Stats Output:
-------------------------------------------------------------------------
Keyspace: contacts
    Read Count: 2458375
    Read Latency: 0.8528440766766665 ms.
    Write Count: 10357
    Write Latency: 0.1816912233272183 ms.
    Pending Flushes: 0
        Table: contact
        SSTable count: 61
        SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
        Space used (live): 9047112471
        Space used (total): 9047112471
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.34119240020241487
        Memtable cell count: 24570
        Memtable data size: 1299614
        Memtable switch count: 2
        Local read count: 2458290
        Local read latency: 0.853 ms
        Local write count: 10044
        Local write latency: 0.186 ms
        Pending flushes: 0
        Bloom filter false positives: 11096
*        Bloom filter false ratio: 0.99197*
        Bloom filter space used: 3923784
        Compacted partition minimum bytes: 373
        Compacted partition maximum bytes: 152321
        Compacted partition mean bytes: 9938
        Average live cells per slice (last five minutes): 37.57851240677983
        Maximum live cells per slice (last five minutes): 63.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

--
about.me <http://about.me/markgreene>

On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart <ch...@remilon.com> wrote:
>
> Hi,
>
> I have create the following table with bloom_filter_fp_chance=0.01:
>
> CREATE TABLE logged_event (
>   time_key bigint,
>   partition_key_randomizer int,
>   resource_uuid timeuuid,
>   event_json text,
>   event_type text,
>   field_error_list map<text, text>,
>   javascript_timestamp timestamp,
>   javascript_uuid uuid,
>   page_impression_guid uuid,
>   page_request_guid uuid,
>   server_received_timestamp timestamp,
>   session_id bigint,
>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
> ) WITH
>   bloom_filter_fp_chance=0.010000 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.000000 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.000000 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
>
> When I run cfstats, I see a much higher false positive ratio:
>
>                 Table: logged_event
>                 SSTable count: 15
>                 Space used (live), bytes: 104128214227
>                 Space used (total), bytes: 104129482871
>                 SSTable Compression Ratio: 0.3295840184239226
>                 Number of keys (estimate): 199293952
>                 Memtable cell count: 56364
>                 Memtable data size, bytes: 20903960
>                 Memtable switch count: 148
>                 Local read count: 1396402
>                 Local read latency: 0.362 ms
>                 Local write count: 2345306
>                 Local write latency: 0.062 ms
>                 Pending tasks: 0
>                 Bloom filter false positives: 147705
>                 Bloom filter false ratio: 0.49020
>                 Bloom filter space used, bytes: 249129040
>                 Compacted partition minimum bytes: 447
>                 Compacted partition maximum bytes: 315852
>                 Compacted partition mean bytes: 1636
>                 Average live cells per slice (last five minutes): 0.0
>                 Average tombstones per slice (last five minutes): 0.0
>
> Any idea what could be causing this?  This is timeseries data.  Every time
> we read from this table, we read a single row key with 1000
> partition_key_randomizer values.  I'm running cassandra 2.0.11.  I tried
> running an upgradesstables to rewrite them, which didn't change this
> behavior at all.  I'm using size tiered compaction and I haven't done any
> major compactions.
>
> Thanks,
> Chris
>