You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anishek Agarwal <an...@gmail.com> on 2016/02/18 08:35:10 UTC

High Bloom filter false ratio

Hello,

We have a table with composite partition key with humungous cardinality,
its a combination of (long,long). On the table we have
bloom_filter_fp_chance=0.010000.

On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are
seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.

I thought over time the bloom filter would adjust to the key space
cardinality, we have been running the cluster for a long time now but have
added significant traffic from Jan this year, which would not lead to
writes in the db but would lead to high reads to see if are any values.

Are there any settings that can be changed to allow better ratio.

Thanks
Anishek

Re: High Bloom filter false ratio

Posted by Chris Lohfink <cl...@gmail.com>.

>
> SSTable count: 1289


Thats seriously wrong and pretty horrific if this table is using
size tiered compaction. Is compaction not keeping up or hung? May be whats
affecting your BF FP ratio as well.

On Thu, Feb 18, 2016 at 9:52 PM, Anishek Agarwal <an...@gmail.com> wrote:

> Hey all,
>
> @Jaydeep here is the cfstats output from one node.
>
> Read Count: 1721134722
>
> Read Latency: 0.04268825050756254 ms.
>
> Write Count: 56743880
>
> Write Latency: 0.014650376727851532 ms.
>
> Pending Tasks: 0
>
> Table: user_stay_points
>
> SSTable count: 1289
>
> Space used (live), bytes: 122141272262
>
> Space used (total), bytes: 224227850870
>
> Off heap memory used (total), bytes: 653827528
>
> SSTable Compression Ratio: 0.4959736121441446
>
> Number of keys (estimate): 345137664
>
> Memtable cell count: 339034
>
> Memtable data size, bytes: 106558314
>
> Memtable switch count: 3266
>
> Local read count: 1721134803
>
> Local read latency: 0.048 ms
>
> Local write count: 56743898
>
> Local write latency: 0.018 ms
>
> Pending tasks: 0
>
> Bloom filter false positives: 40664437
>
> Bloom filter false ratio: 0.69058
>
> Bloom filter space used, bytes: 493777336
>
> Bloom filter off heap memory used, bytes: 493767024
>
> Index summary off heap memory used, bytes: 91677192
>
> Compression metadata off heap memory used, bytes: 68383312
>
> Compacted partition minimum bytes: 104
>
> Compacted partition maximum bytes: 1629722
>
> Compacted partition mean bytes: 1773
>
> Average live cells per slice (last five minutes): 0.0
>
> Average tombstones per slice (last five minutes): 0.0
>
>
> @Tyler Hobbs
>
> we are using cassandra 2.0.15 so
> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
> Other problems looks like will be fixed in 3.0 .. we will mostly try and
> slot in an upgrade to 3.x version towards second quarter of this year.
>
>
> @Daemon
>
> Latencies seem to have higher ratios, attached is the graph.
>
>
> I am mostly trying to look at Bloom filters, because the way we do reads,
> we read data with non existent partition keys and it seems to be taking
> long to respond, like for 720 queries it takes 2 seconds, with all 721
> queries not returning anything. the 720 queries are done in sequence of
> 180 queries each with 180 of them running in parallel.
>
>
> thanks
>
> anishek
>
>
>
> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
> chovatia.jaydeep@gmail.com> wrote:
>
>> How many partition keys exists for the table which shows this problem (or
>> provide nodetool cfstats for that table)?
>>
>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com>
>> wrote:
>>
>>> The bloom filter buckets the values in a small number of buckets. I have
>>> been surprised by how many cases I see with large cardinality where a few
>>> values populate a given bloom leaf, resulting in high false positives, and
>>> a surprising impact on latencies!
>>>
>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>> (allowing for gc times)?
>>>
>>> Daemeon Reiydelle
>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>
>>>> You can try slightly lowering the bloom_filter_fp_chance on your table.
>>>>
>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>> partitions that always trigger a bloom filter false positive.  You could
>>>> try manually tracing a few queries on this table (for non-existent
>>>> partitions) to see if the bloom filter rejects them.
>>>>
>>>> Depending on your Cassandra version, your false positive ratio could be
>>>> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>
>>>> There are also a couple of recent improvements to bloom filters:
>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>
>>>>
>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a table with composite partition key with humungous
>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>> bloom_filter_fp_chance=0.010000.
>>>>>
>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we
>>>>> are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>
>>>>> I thought over time the bloom filter would adjust to the key space
>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>> added significant traffic from Jan this year, which would not lead to
>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>
>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>
>>>>> Thanks
>>>>> Anishek
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Tyler Hobbs
>>>> DataStax <http://datastax.com/>
>>>>
>>>
>>
>

RE: High Bloom filter false ratio

Posted by SE...@homedepot.com.

I see the sstablemetadata tool as far back as 1.2.19 (in tools/bin).

Sean Durity
From: Anishek Agarwal [mailto:anishek@gmail.com]
Sent: Tuesday, February 23, 2016 3:37 AM
To: user@cassandra.apache.org
Subject: Re: High Bloom filter false ratio

Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do you know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal <an...@gmail.com>> wrote:
Thanks Jeff, Awesome will look at the tools and JMX endpoint.

our settings are below originated from the jira you posted above as the base. we are running on 48 core machines with 2 SSD disks of 800 GB each .

MAX_HEAP_SIZE="6G"

HEAP_NEWSIZE="4G"

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"

On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>> wrote:
There exists a JMX endpoint called forceUserDefinedCompaction that takes a comma separated list of sstables to compact together.

There also exists a tool called sstablemetadata (may be in a ‘cassandra-tools’ package separate from whatever package you used to install cassandra, or in the tools/ directory of your binary package). Using sstablemetadata, you can look at the maxTimestamp for each table, and the ‘Estimated droppable tombstones’. Using those two fields, you could, very easily, write a script that gives you a list of sstables that you could feed to forceUserDefinedCompaction to join together to eliminate leftover waste.

Your long ParNew times may be fixable by increasing the new gen size of your heap – the general guidance in cassandra-env.sh is out of date, you may want to reference CASSANDRA-8150 for “newer” advice ( http://issues.apache.org/jira/browse/CASSANDRA-8150 )

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Monday, February 22, 2016 at 8:33 PM

To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: Re: High Bloom filter false ratio

Hey Jeff,

Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 days also by default. gc_grace_seconds is 0, so i would think the sstable as a whole would be deleted.

Because of the problems mentioned by at 1) above it looks like, there might be cases where the table just lies around since no compaction is happening on it and even though everything is expired it would still not be deleted?

for 3) the average read is pretty good, though the throughput doesn't seem to be that great, when no repair is running we get GCIns > 200ms every couple of hours once, otherwise its every 10-20 mins

INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224

our reading patterns are dependent on BF to work efficiently as we do a lot of reads for keys that may not exists because its time series and we segregate data based on hourly boundary from epoch.

hey Christoper,

yes every row in the stable that should have been deleted has "d" in that column. Also the key for one of the row is as

"key": "0008000000000cdd5edd000008000000000006251000"

how do i get it back to normal readable format to get the (long,long) -- composite partition key back?

Looks like i have to force a major compaction to delete a lot of data ? are there any other solutions ?

thanks
anishek

On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <je...@crowdstrike.com>> wrote:
1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so it’s very likely that some sstables stick around longer than you expect.

2) max_sstable_age_days tells cassandra when to stop compacting that file, not when to delete it.

3) You can change the window size using both the base_time_seconds parameter and max_sstable_age_days parameter (use the former to set the size of the first window, and the latter to determine how long before you stop compacting that window). It’s somewhat non-intuitive.

Your read latencies actually look pretty reasonable, are you sure you’re not simply hitting GC pauses that cause your queries to run longer than you expect? Do you have graphs of GC time (first derivative of total gc time is common for tools like graphite), or do you see ‘gcinspector’ in your logs indicating pauses > 200ms?

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Sunday, February 21, 2016 at 11:13 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: Re: High Bloom filter false ratio

Hey guys,

Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for  max_stable_age_days for the table.

One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ?

I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to rectify this, We are on version 2.0.15.

thanks
anishek

On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com>> wrote:
We are using DTCS have a 30 day window for them before they are cleaned up. I don't think with DTCS we can do anything about table sizing. Please do let me know if there are other ideas.

On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <ch...@gmail.com>> wrote:
To me following three looks on higher side:
SSTable count: 1289
In order to reduce SSTable count see if you are compacting of not (If using STCS). Is it possible to change this to LCS?

Number of keys (estimate): 345137664 (345M partition keys)
I don't have any suggestion about reducing this unless you partition your data.

Bloom filter space used, bytes: 493777336 (400MB is huge)
If number of keys are reduced then this will automatically reduce bloom filter size I believe.

Jaydeep

On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>> wrote:
Hey all,

@Jaydeep here is the cfstats output from one node.

Read Count: 1721134722

Read Latency: 0.04268825050756254 ms.

Write Count: 56743880

Write Latency: 0.014650376727851532 ms.

Pending Tasks: 0

Table: user_stay_points

SSTable count: 1289

Space used (live), bytes: 122141272262

Space used (total), bytes: 224227850870

Off heap memory used (total), bytes: 653827528

SSTable Compression Ratio: 0.4959736121441446

Number of keys (estimate): 345137664

Memtable cell count: 339034

Memtable data size, bytes: 106558314

Memtable switch count: 3266

Local read count: 1721134803

Local read latency: 0.048 ms

Local write count: 56743898

Local write latency: 0.018 ms

Pending tasks: 0

Bloom filter false positives: 40664437

Bloom filter false ratio: 0.69058

Bloom filter space used, bytes: 493777336

Bloom filter off heap memory used, bytes: 493767024

Index summary off heap memory used, bytes: 91677192

Compression metadata off heap memory used, bytes: 68383312

Compacted partition minimum bytes: 104

Compacted partition maximum bytes: 1629722

Compacted partition mean bytes: 1773

Average live cells per slice (last five minutes): 0.0

Average tombstones per slice (last five minutes): 0.0

@Tyler Hobbs

we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year.

@Daemon

Latencies seem to have higher ratios, attached is the graph.

I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel.

thanks

anishek

On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <ch...@gmail.com>> wrote:
How many partition keys exists for the table which shows this problem (or provide nodetool cfstats for that table)?

On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com>> wrote:

The bloom filter buckets the values in a small number of buckets. I have been surprised by how many cases I see with large cardinality where a few values populate a given bloom leaf, resulting in high false positives, and a surprising impact on latencies!

Are you seeing 2:1 ranges between mean and worse case latencies (allowing for gc times)?

Daemeon Reiydelle
On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com>> wrote:
You can try slightly lowering the bloom_filter_fp_chance on your table.
Otherwise, it's possible that you're repeatedly querying one or two partitions that always trigger a bloom filter false positive.  You could try manually tracing a few queries on this table (for non-existent partitions) to see if the bloom filter rejects them.
Depending on your Cassandra version, your false positive ratio could be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525

There are also a couple of recent improvements to bloom filters:
* https://issues.apache.org/jira/browse/CASSANDRA-8413
* https://issues.apache.org/jira/browse/CASSANDRA-9167

On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>> wrote:
Hello,

We have a table with composite partition key with humungous cardinality, its a combination of (long,long). On the table we have bloom_filter_fp_chance=0.010000.

On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.

I thought over time the bloom filter would adjust to the key space cardinality, we have been running the cluster for a long time now but have added significant traffic from Jan this year, which would not lead to writes in the db but would lead to high reads to see if are any values.

Are there any settings that can be changed to allow better ratio.

Thanks
Anishek

--
Tyler Hobbs
DataStax<http://datastax.com/>

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: High Bloom filter false ratio

Posted by Jeff Jirsa <je...@crowdstrike.com>.

sstablemetadata definitely exists for 2.0 – it may be in a different location, but it exists.

If all else fails, it’s a 50 line bash script, grab it from here: 

https://github.com/apache/cassandra/blob/cassandra-2.0/tools/bin/sstablemetadata



From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, February 23, 2016 at 12:37 AM
To:  "user@cassandra.apache.org"
Subject:  Re: High Bloom filter false ratio

Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do you know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal <an...@gmail.com> wrote:
Thanks Jeff, Awesome will look at the tools and JMX endpoint. 

our settings are below originated from the jira you posted above as the base. we are running on 48 core machines with 2 SSD disks of 800 GB each .

MAX_HEAP_SIZE="6G"

HEAP_NEWSIZE="4G"

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"



On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com> wrote:
There exists a JMX endpoint called forceUserDefinedCompaction that takes a comma separated list of sstables to compact together.

There also exists a tool called sstablemetadata (may be in a ‘cassandra-tools’ package separate from whatever package you used to install cassandra, or in the tools/ directory of your binary package). Using sstablemetadata, you can look at the maxTimestamp for each table, and the ‘Estimated droppable tombstones’. Using those two fields, you could, very easily, write a script that gives you a list of sstables that you could feed to forceUserDefinedCompaction to join together to eliminate leftover waste.

Your long ParNew times may be fixable by increasing the new gen size of your heap – the general guidance in cassandra-env.sh is out of date, you may want to reference CASSANDRA-8150 for “newer” advice ( http://issues.apache.org/jira/browse/CASSANDRA-8150 ) 

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org"
Date: Monday, February 22, 2016 at 8:33 PM 

To: "user@cassandra.apache.org"
Subject: Re: High Bloom filter false ratio

Hey Jeff, 

Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 days also by default. gc_grace_seconds is 0, so i would think the sstable as a whole would be deleted.

Because of the problems mentioned by at 1) above it looks like, there might be cases where the table just lies around since no compaction is happening on it and even though everything is expired it would still not be deleted?

for 3) the average read is pretty good, though the throughput doesn't seem to be that great, when no repair is running we get GCIns > 200ms every couple of hours once, otherwise its every 10-20 mins 
INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224



our reading patterns are dependent on BF to work efficiently as we do a lot of reads for keys that may not exists because its time series and we segregate data based on hourly boundary from epoch.


hey Christoper,

yes every row in the stable that should have been deleted has "d" in that column. Also the key for one of the row is as 
"key": "0008000000000cdd5edd000008000000000006251000"



how do i get it back to normal readable format to get the (long,long) -- composite partition key back?

Looks like i have to force a major compaction to delete a lot of data ? are there any other solutions ?

thanks
anishek



On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <je...@crowdstrike.com> wrote:
1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so it’s very likely that some sstables stick around longer than you expect.

2) max_sstable_age_days tells cassandra when to stop compacting that file, not when to delete it.

3) You can change the window size using both the base_time_seconds parameter and max_sstable_age_days parameter (use the former to set the size of the first window, and the latter to determine how long before you stop compacting that window). It’s somewhat non-intuitive. 

Your read latencies actually look pretty reasonable, are you sure you’re not simply hitting GC pauses that cause your queries to run longer than you expect? Do you have graphs of GC time (first derivative of total gc time is common for tools like graphite), or do you see ‘gcinspector’ in your logs indicating pauses > 200ms? 

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org"
Date: Sunday, February 21, 2016 at 11:13 PM
To: "user@cassandra.apache.org"
Subject: Re: High Bloom filter false ratio

Hey guys, 

Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for  max_stable_age_days for the table.

One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ? 

I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to rectify this, We are on version 2.0.15.

thanks
anishek




On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com> wrote:
We are using DTCS have a 30 day window for them before they are cleaned up. I don't think with DTCS we can do anything about table sizing. Please do let me know if there are other ideas.

On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <ch...@gmail.com> wrote:
To me following three looks on higher side: 
SSTable count: 1289
In order to reduce SSTable count see if you are compacting of not (If using STCS). Is it possible to change this to LCS?

Number of keys (estimate): 345137664 (345M partition keys)
I don't have any suggestion about reducing this unless you partition your data. 

Bloom filter space used, bytes: 493777336 (400MB is huge)
If number of keys are reduced then this will automatically reduce bloom filter size I believe.


Jaydeep

On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com> wrote:
Hey all, 

@Jaydeep here is the cfstats output from one node. 

Read Count: 1721134722

Read Latency: 0.04268825050756254 ms.

Write Count: 56743880

Write Latency: 0.014650376727851532 ms.

Pending Tasks: 0

Table: user_stay_points

SSTable count: 1289

Space used (live), bytes: 122141272262

Space used (total), bytes: 224227850870

Off heap memory used (total), bytes: 653827528

SSTable Compression Ratio: 0.4959736121441446

Number of keys (estimate): 345137664

Memtable cell count: 339034

Memtable data size, bytes: 106558314

Memtable switch count: 3266

Local read count: 1721134803

Local read latency: 0.048 ms

Local write count: 56743898

Local write latency: 0.018 ms

Pending tasks: 0

Bloom filter false positives: 40664437

Bloom filter false ratio: 0.69058

Bloom filter space used, bytes: 493777336

Bloom filter off heap memory used, bytes: 493767024

Index summary off heap memory used, bytes: 91677192

Compression metadata off heap memory used, bytes: 68383312

Compacted partition minimum bytes: 104

Compacted partition maximum bytes: 1629722

Compacted partition mean bytes: 1773

Average live cells per slice (last five minutes): 0.0

Average tombstones per slice (last five minutes): 0.0



@Tyler Hobbs 

we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year.



@Daemon

Latencies seem to have higher ratios, attached is the graph.



I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel. 



thanks

anishek



On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <ch...@gmail.com> wrote:
How many partition keys exists for the table which shows this problem (or provide nodetool cfstats for that table)?

On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com> wrote:

The bloom filter buckets the values in a small number of buckets. I have been surprised by how many cases I see with large cardinality where a few values populate a given bloom leaf, resulting in high false positives, and a surprising impact on latencies!

Are you seeing 2:1 ranges between mean and worse case latencies (allowing for gc times)?

Daemeon Reiydelle

On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
You can try slightly lowering the bloom_filter_fp_chance on your table.

Otherwise, it's possible that you're repeatedly querying one or two partitions that always trigger a bloom filter false positive.  You could try manually tracing a few queries on this table (for non-existent partitions) to see if the bloom filter rejects them.

Depending on your Cassandra version, your false positive ratio could be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525

There are also a couple of recent improvements to bloom filters:
* https://issues.apache.org/jira/browse/CASSANDRA-8413
* https://issues.apache.org/jira/browse/CASSANDRA-9167


On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com> wrote:
Hello, 

We have a table with composite partition key with humungous cardinality, its a combination of (long,long). On the table we have bloom_filter_fp_chance=0.010000.

On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9. 

I thought over time the bloom filter would adjust to the key space cardinality, we have been running the cluster for a long time now but have added significant traffic from Jan this year, which would not lead to writes in the db but would lead to high reads to see if are any values. 

Are there any settings that can be changed to allow better ratio.

Thanks
Anishek



-- 
Tyler Hobbs
DataStax

Re: High Bloom filter false ratio

Posted by Anishek Agarwal <an...@gmail.com>.

Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do
you know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal <an...@gmail.com> wrote:

> Thanks Jeff, Awesome will look at the tools and JMX endpoint.
>
> our settings are below originated from the jira you posted above as the
> base. we are running on 48 core machines with 2 SSD disks of 800 GB each .
>
> MAX_HEAP_SIZE="6G"
>
> HEAP_NEWSIZE="4G"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>
> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>
> # earlier value 131072
>
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"
>
>
> On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> There exists a JMX endpoint called forceUserDefinedCompaction that takes
>> a comma separated list of sstables to compact together.
>>
>> There also exists a tool called sstablemetadata (may be in a
>> ‘cassandra-tools’ package separate from whatever package you used to
>> install cassandra, or in the tools/ directory of your binary package).
>> Using sstablemetadata, you can look at the maxTimestamp for each table, and
>> the ‘Estimated droppable tombstones’. Using those two fields, you could,
>> very easily, write a script that gives you a list of sstables that you
>> could feed to forceUserDefinedCompaction to join together to eliminate
>> leftover waste.
>>
>> Your long ParNew times may be fixable by increasing the new gen size of
>> your heap – the general guidance in cassandra-env.sh is out of date, you
>> may want to reference CASSANDRA-8150 for “newer” advice (
>> http://issues.apache.org/jira/browse/CASSANDRA-8150 )
>>
>> - Jeff
>>
>> From: Anishek Agarwal
>> Reply-To: "user@cassandra.apache.org"
>> Date: Monday, February 22, 2016 at 8:33 PM
>>
>> To: "user@cassandra.apache.org"
>> Subject: Re: High Bloom filter false ratio
>>
>> Hey Jeff,
>>
>> Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days
>> is set to 30 days and the ttl on every insert is set to 30 days also
>> by default. gc_grace_seconds is 0, so i would think the sstable as a whole
>> would be deleted.
>>
>> Because of the problems mentioned by at 1) above it looks like, there
>> might be cases where the table just lies around since no compaction is
>> happening on it and even though everything is expired it would still not be
>> deleted?
>>
>> for 3) the average read is pretty good, though the throughput doesn't
>> seem to be that great, when no repair is running we get GCIns > 200ms every
>> couple of hours once, otherwise its every 10-20 mins
>>
>> INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line
>> 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line
>> 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line
>> 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line
>> 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line
>> 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line
>> 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is
>> 7784628224
>>
>>
>> our reading patterns are dependent on BF to work efficiently as we do a
>> lot of reads for keys that may not exists because its time series and
>> we segregate data based on hourly boundary from epoch.
>>
>>
>> hey Christoper,
>>
>> yes every row in the stable that should have been deleted has "d" in that
>> column. Also the key for one of the row is as
>>
>> "key": "0008000000000cdd5edd000008000000000006251000"
>>
>>
>>
>> how do i get it back to normal readable format to get the (long,long) --
>> composite partition key back?
>>
>> Looks like i have to force a major compaction to delete a lot of data ?
>> are there any other solutions ?
>>
>> thanks
>> anishek
>>
>>
>>
>> On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>>> 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so
>>> it’s very likely that some sstables stick around longer than you expect.
>>>
>>> 2) max_sstable_age_days tells cassandra when to stop compacting that
>>> file, not when to delete it.
>>>
>>> 3) You can change the window size using both the base_time_seconds
>>> parameter and max_sstable_age_days parameter (use the former to set the
>>> size of the first window, and the latter to determine how long before you
>>> stop compacting that window). It’s somewhat non-intuitive.
>>>
>>> Your read latencies actually look pretty reasonable, are you sure you’re
>>> not simply hitting GC pauses that cause your queries to run longer than you
>>> expect? Do you have graphs of GC time (first derivative of total gc time is
>>> common for tools like graphite), or do you see ‘gcinspector’ in your logs
>>> indicating pauses > 200ms?
>>>
>>> From: Anishek Agarwal
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Sunday, February 21, 2016 at 11:13 PM
>>> To: "user@cassandra.apache.org"
>>> Subject: Re: High Bloom filter false ratio
>>>
>>> Hey guys,
>>>
>>> Just did some more digging ... looks like DTCS is not removing old data
>>> completely, I used sstable2json for one such table and saw old data there.
>>> we have a value of 30 for  max_stable_age_days for the table.
>>>
>>> One of the columns showed data as :["2015-12-10 11\\:03+0530:",
>>> "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last
>>> IS_MARKED_FOR_DELETE column ?
>>>
>>> I see data from 10 dec 2015 still there, looks like there are a few
>>> issues with DTCS, Operationally what choices do i have to rectify this, We
>>> are on version 2.0.15.
>>>
>>> thanks
>>> anishek
>>>
>>>
>>>
>>>
>>> On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com>
>>> wrote:
>>>
>>>> We are using DTCS have a 30 day window for them before they are cleaned
>>>> up. I don't think with DTCS we can do anything about table sizing. Please
>>>> do let me know if there are other ideas.
>>>>
>>>> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>
>>>>> To me following three looks on higher side:
>>>>> SSTable count: 1289
>>>>>
>>>>> In order to reduce SSTable count see if you are compacting of not (If
>>>>> using STCS). Is it possible to change this to LCS?
>>>>>
>>>>>
>>>>> Number of keys (estimate): 345137664 (345M partition keys)
>>>>>
>>>>> I don't have any suggestion about reducing this unless you partition
>>>>> your data.
>>>>>
>>>>>
>>>>> Bloom filter space used, bytes: 493777336 (400MB is huge)
>>>>>
>>>>> If number of keys are reduced then this will automatically reduce
>>>>> bloom filter size I believe.
>>>>>
>>>>>
>>>>>
>>>>> Jaydeep
>>>>>
>>>>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> @Jaydeep here is the cfstats output from one node.
>>>>>>
>>>>>> Read Count: 1721134722
>>>>>>
>>>>>> Read Latency: 0.04268825050756254 ms.
>>>>>>
>>>>>> Write Count: 56743880
>>>>>>
>>>>>> Write Latency: 0.014650376727851532 ms.
>>>>>>
>>>>>> Pending Tasks: 0
>>>>>>
>>>>>> Table: user_stay_points
>>>>>>
>>>>>> SSTable count: 1289
>>>>>>
>>>>>> Space used (live), bytes: 122141272262
>>>>>>
>>>>>> Space used (total), bytes: 224227850870
>>>>>>
>>>>>> Off heap memory used (total), bytes: 653827528
>>>>>>
>>>>>> SSTable Compression Ratio: 0.4959736121441446
>>>>>>
>>>>>> Number of keys (estimate): 345137664
>>>>>>
>>>>>> Memtable cell count: 339034
>>>>>>
>>>>>> Memtable data size, bytes: 106558314
>>>>>>
>>>>>> Memtable switch count: 3266
>>>>>>
>>>>>> Local read count: 1721134803
>>>>>>
>>>>>> Local read latency: 0.048 ms
>>>>>>
>>>>>> Local write count: 56743898
>>>>>>
>>>>>> Local write latency: 0.018 ms
>>>>>>
>>>>>> Pending tasks: 0
>>>>>>
>>>>>> Bloom filter false positives: 40664437
>>>>>>
>>>>>> Bloom filter false ratio: 0.69058
>>>>>>
>>>>>> Bloom filter space used, bytes: 493777336
>>>>>>
>>>>>> Bloom filter off heap memory used, bytes: 493767024
>>>>>>
>>>>>> Index summary off heap memory used, bytes: 91677192
>>>>>>
>>>>>> Compression metadata off heap memory used, bytes: 68383312
>>>>>>
>>>>>> Compacted partition minimum bytes: 104
>>>>>>
>>>>>> Compacted partition maximum bytes: 1629722
>>>>>>
>>>>>> Compacted partition mean bytes: 1773
>>>>>>
>>>>>> Average live cells per slice (last five minutes): 0.0
>>>>>>
>>>>>> Average tombstones per slice (last five minutes): 0.0
>>>>>>
>>>>>>
>>>>>> @Tyler Hobbs
>>>>>>
>>>>>> we are using cassandra 2.0.15 so
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt
>>>>>> occur. Other problems looks like will be fixed in 3.0 .. we will mostly try
>>>>>> and slot in an upgrade to 3.x version towards second quarter of this year.
>>>>>>
>>>>>>
>>>>>> @Daemon
>>>>>>
>>>>>> Latencies seem to have higher ratios, attached is the graph.
>>>>>>
>>>>>>
>>>>>> I am mostly trying to look at Bloom filters, because the way we do
>>>>>> reads, we read data with non existent partition keys and it seems to be
>>>>>> taking long to respond, like for 720 queries it takes 2 seconds, with all
>>>>>> 721 queries not returning anything. the 720 queries are done in
>>>>>> sequence of 180 queries each with 180 of them running in parallel.
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> anishek
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>>>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>>>
>>>>>>> How many partition keys exists for the table which shows this
>>>>>>> problem (or provide nodetool cfstats for that table)?
>>>>>>>
>>>>>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <
>>>>>>> daemeonr@gmail.com> wrote:
>>>>>>>
>>>>>>>> The bloom filter buckets the values in a small number of buckets. I
>>>>>>>> have been surprised by how many cases I see with large cardinality where a
>>>>>>>> few values populate a given bloom leaf, resulting in high false positives,
>>>>>>>> and a surprising impact on latencies!
>>>>>>>>
>>>>>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>>>>>> (allowing for gc times)?
>>>>>>>>
>>>>>>>> Daemeon Reiydelle
>>>>>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>>>>>>
>>>>>>>>> You can try slightly lowering the bloom_filter_fp_chance on your
>>>>>>>>> table.
>>>>>>>>>
>>>>>>>>> Otherwise, it's possible that you're repeatedly querying one or
>>>>>>>>> two partitions that always trigger a bloom filter false positive.  You
>>>>>>>>> could try manually tracing a few queries on this table (for non-existent
>>>>>>>>> partitions) to see if the bloom filter rejects them.
>>>>>>>>>
>>>>>>>>> Depending on your Cassandra version, your false positive ratio
>>>>>>>>> could be inaccurate:
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>>>>>
>>>>>>>>> There are also a couple of recent improvements to bloom filters:
>>>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <
>>>>>>>>> anishek@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> We have a table with composite partition key with humungous
>>>>>>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>>>>>
>>>>>>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster
>>>>>>>>>> we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>>>>>>
>>>>>>>>>> I thought over time the bloom filter would adjust to the key
>>>>>>>>>> space cardinality, we have been running the cluster for a long time now but
>>>>>>>>>> have added significant traffic from Jan this year, which would not lead to
>>>>>>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>>>>>>
>>>>>>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Anishek
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Tyler Hobbs
>>>>>>>>> DataStax <http://datastax.com/>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Anishek Agarwal <an...@gmail.com>.

Thanks Jeff, Awesome will look at the tools and JMX endpoint.

our settings are below originated from the jira you posted above as the
base. we are running on 48 core machines with 2 SSD disks of 800 GB each .

MAX_HEAP_SIZE="6G"

HEAP_NEWSIZE="4G"

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"


On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> There exists a JMX endpoint called forceUserDefinedCompaction that takes a
> comma separated list of sstables to compact together.
>
> There also exists a tool called sstablemetadata (may be in a
> ‘cassandra-tools’ package separate from whatever package you used to
> install cassandra, or in the tools/ directory of your binary package).
> Using sstablemetadata, you can look at the maxTimestamp for each table, and
> the ‘Estimated droppable tombstones’. Using those two fields, you could,
> very easily, write a script that gives you a list of sstables that you
> could feed to forceUserDefinedCompaction to join together to eliminate
> leftover waste.
>
> Your long ParNew times may be fixable by increasing the new gen size of
> your heap – the general guidance in cassandra-env.sh is out of date, you
> may want to reference CASSANDRA-8150 for “newer” advice (
> http://issues.apache.org/jira/browse/CASSANDRA-8150 )
>
> - Jeff
>
> From: Anishek Agarwal
> Reply-To: "user@cassandra.apache.org"
> Date: Monday, February 22, 2016 at 8:33 PM
>
> To: "user@cassandra.apache.org"
> Subject: Re: High Bloom filter false ratio
>
> Hey Jeff,
>
> Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days
> is set to 30 days and the ttl on every insert is set to 30 days also
> by default. gc_grace_seconds is 0, so i would think the sstable as a whole
> would be deleted.
>
> Because of the problems mentioned by at 1) above it looks like, there
> might be cases where the table just lies around since no compaction is
> happening on it and even though everything is expired it would still not be
> deleted?
>
> for 3) the average read is pretty good, though the throughput doesn't seem
> to be that great, when no repair is running we get GCIns > 200ms every
> couple of hours once, otherwise its every 10-20 mins
>
> INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line
> 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line
> 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line
> 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line
> 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line
> 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line
> 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is
> 7784628224
>
>
> our reading patterns are dependent on BF to work efficiently as we do a
> lot of reads for keys that may not exists because its time series and
> we segregate data based on hourly boundary from epoch.
>
>
> hey Christoper,
>
> yes every row in the stable that should have been deleted has "d" in that
> column. Also the key for one of the row is as
>
> "key": "0008000000000cdd5edd000008000000000006251000"
>
>
>
> how do i get it back to normal readable format to get the (long,long) --
> composite partition key back?
>
> Looks like i have to force a major compaction to delete a lot of data ?
> are there any other solutions ?
>
> thanks
> anishek
>
>
>
> On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so
>> it’s very likely that some sstables stick around longer than you expect.
>>
>> 2) max_sstable_age_days tells cassandra when to stop compacting that
>> file, not when to delete it.
>>
>> 3) You can change the window size using both the base_time_seconds
>> parameter and max_sstable_age_days parameter (use the former to set the
>> size of the first window, and the latter to determine how long before you
>> stop compacting that window). It’s somewhat non-intuitive.
>>
>> Your read latencies actually look pretty reasonable, are you sure you’re
>> not simply hitting GC pauses that cause your queries to run longer than you
>> expect? Do you have graphs of GC time (first derivative of total gc time is
>> common for tools like graphite), or do you see ‘gcinspector’ in your logs
>> indicating pauses > 200ms?
>>
>> From: Anishek Agarwal
>> Reply-To: "user@cassandra.apache.org"
>> Date: Sunday, February 21, 2016 at 11:13 PM
>> To: "user@cassandra.apache.org"
>> Subject: Re: High Bloom filter false ratio
>>
>> Hey guys,
>>
>> Just did some more digging ... looks like DTCS is not removing old data
>> completely, I used sstable2json for one such table and saw old data there.
>> we have a value of 30 for  max_stable_age_days for the table.
>>
>> One of the columns showed data as :["2015-12-10 11\\:03+0530:",
>> "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last
>> IS_MARKED_FOR_DELETE column ?
>>
>> I see data from 10 dec 2015 still there, looks like there are a few
>> issues with DTCS, Operationally what choices do i have to rectify this, We
>> are on version 2.0.15.
>>
>> thanks
>> anishek
>>
>>
>>
>>
>> On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com>
>> wrote:
>>
>>> We are using DTCS have a 30 day window for them before they are cleaned
>>> up. I don't think with DTCS we can do anything about table sizing. Please
>>> do let me know if there are other ideas.
>>>
>>> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
>>> chovatia.jaydeep@gmail.com> wrote:
>>>
>>>> To me following three looks on higher side:
>>>> SSTable count: 1289
>>>>
>>>> In order to reduce SSTable count see if you are compacting of not (If
>>>> using STCS). Is it possible to change this to LCS?
>>>>
>>>>
>>>> Number of keys (estimate): 345137664 (345M partition keys)
>>>>
>>>> I don't have any suggestion about reducing this unless you partition
>>>> your data.
>>>>
>>>>
>>>> Bloom filter space used, bytes: 493777336 (400MB is huge)
>>>>
>>>> If number of keys are reduced then this will automatically reduce bloom
>>>> filter size I believe.
>>>>
>>>>
>>>>
>>>> Jaydeep
>>>>
>>>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> @Jaydeep here is the cfstats output from one node.
>>>>>
>>>>> Read Count: 1721134722
>>>>>
>>>>> Read Latency: 0.04268825050756254 ms.
>>>>>
>>>>> Write Count: 56743880
>>>>>
>>>>> Write Latency: 0.014650376727851532 ms.
>>>>>
>>>>> Pending Tasks: 0
>>>>>
>>>>> Table: user_stay_points
>>>>>
>>>>> SSTable count: 1289
>>>>>
>>>>> Space used (live), bytes: 122141272262
>>>>>
>>>>> Space used (total), bytes: 224227850870
>>>>>
>>>>> Off heap memory used (total), bytes: 653827528
>>>>>
>>>>> SSTable Compression Ratio: 0.4959736121441446
>>>>>
>>>>> Number of keys (estimate): 345137664
>>>>>
>>>>> Memtable cell count: 339034
>>>>>
>>>>> Memtable data size, bytes: 106558314
>>>>>
>>>>> Memtable switch count: 3266
>>>>>
>>>>> Local read count: 1721134803
>>>>>
>>>>> Local read latency: 0.048 ms
>>>>>
>>>>> Local write count: 56743898
>>>>>
>>>>> Local write latency: 0.018 ms
>>>>>
>>>>> Pending tasks: 0
>>>>>
>>>>> Bloom filter false positives: 40664437
>>>>>
>>>>> Bloom filter false ratio: 0.69058
>>>>>
>>>>> Bloom filter space used, bytes: 493777336
>>>>>
>>>>> Bloom filter off heap memory used, bytes: 493767024
>>>>>
>>>>> Index summary off heap memory used, bytes: 91677192
>>>>>
>>>>> Compression metadata off heap memory used, bytes: 68383312
>>>>>
>>>>> Compacted partition minimum bytes: 104
>>>>>
>>>>> Compacted partition maximum bytes: 1629722
>>>>>
>>>>> Compacted partition mean bytes: 1773
>>>>>
>>>>> Average live cells per slice (last five minutes): 0.0
>>>>>
>>>>> Average tombstones per slice (last five minutes): 0.0
>>>>>
>>>>>
>>>>> @Tyler Hobbs
>>>>>
>>>>> we are using cassandra 2.0.15 so
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
>>>>> Other problems looks like will be fixed in 3.0 .. we will mostly try and
>>>>> slot in an upgrade to 3.x version towards second quarter of this year.
>>>>>
>>>>>
>>>>> @Daemon
>>>>>
>>>>> Latencies seem to have higher ratios, attached is the graph.
>>>>>
>>>>>
>>>>> I am mostly trying to look at Bloom filters, because the way we do
>>>>> reads, we read data with non existent partition keys and it seems to be
>>>>> taking long to respond, like for 720 queries it takes 2 seconds, with all
>>>>> 721 queries not returning anything. the 720 queries are done in
>>>>> sequence of 180 queries each with 180 of them running in parallel.
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>> anishek
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>>
>>>>>> How many partition keys exists for the table which shows this problem
>>>>>> (or provide nodetool cfstats for that table)?
>>>>>>
>>>>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <
>>>>>> daemeonr@gmail.com> wrote:
>>>>>>
>>>>>>> The bloom filter buckets the values in a small number of buckets. I
>>>>>>> have been surprised by how many cases I see with large cardinality where a
>>>>>>> few values populate a given bloom leaf, resulting in high false positives,
>>>>>>> and a surprising impact on latencies!
>>>>>>>
>>>>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>>>>> (allowing for gc times)?
>>>>>>>
>>>>>>> Daemeon Reiydelle
>>>>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>>>>>
>>>>>>>> You can try slightly lowering the bloom_filter_fp_chance on your
>>>>>>>> table.
>>>>>>>>
>>>>>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>>>>>> partitions that always trigger a bloom filter false positive.  You could
>>>>>>>> try manually tracing a few queries on this table (for non-existent
>>>>>>>> partitions) to see if the bloom filter rejects them.
>>>>>>>>
>>>>>>>> Depending on your Cassandra version, your false positive ratio
>>>>>>>> could be inaccurate:
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>>>>
>>>>>>>> There are also a couple of recent improvements to bloom filters:
>>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <anishek@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We have a table with composite partition key with humungous
>>>>>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>>>>
>>>>>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster
>>>>>>>>> we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>>>>>
>>>>>>>>> I thought over time the bloom filter would adjust to the key space
>>>>>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>>>>>> added significant traffic from Jan this year, which would not lead to
>>>>>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>>>>>
>>>>>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Anishek
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Tyler Hobbs
>>>>>>>> DataStax <http://datastax.com/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Jeff Jirsa <je...@crowdstrike.com>.

There exists a JMX endpoint called forceUserDefinedCompaction that takes a comma separated list of sstables to compact together.

There also exists a tool called sstablemetadata (may be in a ‘cassandra-tools’ package separate from whatever package you used to install cassandra, or in the tools/ directory of your binary package). Using sstablemetadata, you can look at the maxTimestamp for each table, and the ‘Estimated droppable tombstones’. Using those two fields, you could, very easily, write a script that gives you a list of sstables that you could feed to forceUserDefinedCompaction to join together to eliminate leftover waste.

Your long ParNew times may be fixable by increasing the new gen size of your heap – the general guidance in cassandra-env.sh is out of date, you may want to reference CASSANDRA-8150 for “newer” advice ( http://issues.apache.org/jira/browse/CASSANDRA-8150 ) 

- Jeff

From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Monday, February 22, 2016 at 8:33 PM
To:  "user@cassandra.apache.org"
Subject:  Re: High Bloom filter false ratio

Hey Jeff, 

Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 days also by default. gc_grace_seconds is 0, so i would think the sstable as a whole would be deleted.

Because of the problems mentioned by at 1) above it looks like, there might be cases where the table just lies around since no compaction is happening on it and even though everything is expired it would still not be deleted?

for 3) the average read is pretty good, though the throughput doesn't seem to be that great, when no repair is running we get GCIns > 200ms every couple of hours once, otherwise its every 10-20 mins 
INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224



our reading patterns are dependent on BF to work efficiently as we do a lot of reads for keys that may not exists because its time series and we segregate data based on hourly boundary from epoch.


hey Christoper,

yes every row in the stable that should have been deleted has "d" in that column. Also the key for one of the row is as 
"key": "0008000000000cdd5edd000008000000000006251000"



how do i get it back to normal readable format to get the (long,long) -- composite partition key back?

Looks like i have to force a major compaction to delete a lot of data ? are there any other solutions ?

thanks
anishek



On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <je...@crowdstrike.com> wrote:
1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so it’s very likely that some sstables stick around longer than you expect.

2) max_sstable_age_days tells cassandra when to stop compacting that file, not when to delete it.

3) You can change the window size using both the base_time_seconds parameter and max_sstable_age_days parameter (use the former to set the size of the first window, and the latter to determine how long before you stop compacting that window). It’s somewhat non-intuitive. 

Your read latencies actually look pretty reasonable, are you sure you’re not simply hitting GC pauses that cause your queries to run longer than you expect? Do you have graphs of GC time (first derivative of total gc time is common for tools like graphite), or do you see ‘gcinspector’ in your logs indicating pauses > 200ms? 

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org"
Date: Sunday, February 21, 2016 at 11:13 PM
To: "user@cassandra.apache.org"
Subject: Re: High Bloom filter false ratio

Hey guys, 

Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for  max_stable_age_days for the table.

One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ? 

I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to rectify this, We are on version 2.0.15.

thanks
anishek




On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com> wrote:
We are using DTCS have a 30 day window for them before they are cleaned up. I don't think with DTCS we can do anything about table sizing. Please do let me know if there are other ideas.

On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <ch...@gmail.com> wrote:
To me following three looks on higher side: 
SSTable count: 1289
In order to reduce SSTable count see if you are compacting of not (If using STCS). Is it possible to change this to LCS?

Number of keys (estimate): 345137664 (345M partition keys)
I don't have any suggestion about reducing this unless you partition your data. 

Bloom filter space used, bytes: 493777336 (400MB is huge)
If number of keys are reduced then this will automatically reduce bloom filter size I believe.


Jaydeep

On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com> wrote:
Hey all, 

@Jaydeep here is the cfstats output from one node. 

Read Count: 1721134722

Read Latency: 0.04268825050756254 ms.

Write Count: 56743880

Write Latency: 0.014650376727851532 ms.

Pending Tasks: 0

Table: user_stay_points

SSTable count: 1289

Space used (live), bytes: 122141272262

Space used (total), bytes: 224227850870

Off heap memory used (total), bytes: 653827528

SSTable Compression Ratio: 0.4959736121441446

Number of keys (estimate): 345137664

Memtable cell count: 339034

Memtable data size, bytes: 106558314

Memtable switch count: 3266

Local read count: 1721134803

Local read latency: 0.048 ms

Local write count: 56743898

Local write latency: 0.018 ms

Pending tasks: 0

Bloom filter false positives: 40664437

Bloom filter false ratio: 0.69058

Bloom filter space used, bytes: 493777336

Bloom filter off heap memory used, bytes: 493767024

Index summary off heap memory used, bytes: 91677192

Compression metadata off heap memory used, bytes: 68383312

Compacted partition minimum bytes: 104

Compacted partition maximum bytes: 1629722

Compacted partition mean bytes: 1773

Average live cells per slice (last five minutes): 0.0

Average tombstones per slice (last five minutes): 0.0



@Tyler Hobbs 

we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year.



@Daemon

Latencies seem to have higher ratios, attached is the graph.



I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel. 



thanks

anishek



On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <ch...@gmail.com> wrote:
How many partition keys exists for the table which shows this problem (or provide nodetool cfstats for that table)?

On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com> wrote:

The bloom filter buckets the values in a small number of buckets. I have been surprised by how many cases I see with large cardinality where a few values populate a given bloom leaf, resulting in high false positives, and a surprising impact on latencies!

Are you seeing 2:1 ranges between mean and worse case latencies (allowing for gc times)?

Daemeon Reiydelle

On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
You can try slightly lowering the bloom_filter_fp_chance on your table.

Otherwise, it's possible that you're repeatedly querying one or two partitions that always trigger a bloom filter false positive.  You could try manually tracing a few queries on this table (for non-existent partitions) to see if the bloom filter rejects them.

Depending on your Cassandra version, your false positive ratio could be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525

There are also a couple of recent improvements to bloom filters:
* https://issues.apache.org/jira/browse/CASSANDRA-8413
* https://issues.apache.org/jira/browse/CASSANDRA-9167


On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com> wrote:
Hello, 

We have a table with composite partition key with humungous cardinality, its a combination of (long,long). On the table we have bloom_filter_fp_chance=0.010000.

On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9. 

I thought over time the bloom filter would adjust to the key space cardinality, we have been running the cluster for a long time now but have added significant traffic from Jan this year, which would not lead to writes in the db but would lead to high reads to see if are any values. 

Are there any settings that can be changed to allow better ratio.

Thanks
Anishek



-- 
Tyler Hobbs
DataStax

Re: High Bloom filter false ratio

Posted by Anishek Agarwal <an...@gmail.com>.

Hey Jeff,

Thanks for the clarification, I did not explain my self clearly, the
max_stable_age_days
is set to 30 days and the ttl on every insert is set to 30 days also
by default. gc_grace_seconds is 0, so i would think the sstable as a whole
would be deleted.

Because of the problems mentioned by at 1) above it looks like, there might
be cases where the table just lies around since no compaction is happening
on it and even though everything is expired it would still not be deleted?

for 3) the average read is pretty good, though the throughput doesn't seem
to be that great, when no repair is running we get GCIns > 200ms every
couple of hours once, otherwise its every 10-20 mins

INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116)
GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line
116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is
7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line
116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is
7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line
116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is
7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line
116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is
7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line
116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is
7784628224


our reading patterns are dependent on BF to work efficiently as we do a
lot of reads for keys that may not exists because its time series and
we segregate data based on hourly boundary from epoch.


hey Christoper,

yes every row in the stable that should have been deleted has "d" in that
column. Also the key for one of the row is as

"key": "0008000000000cdd5edd000008000000000006251000"



how do i get it back to normal readable format to get the (long,long) --
composite partition key back?

Looks like i have to force a major compaction to delete a lot of data ? are
there any other solutions ?

thanks
anishek



On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so
> it’s very likely that some sstables stick around longer than you expect.
>
> 2) max_sstable_age_days tells cassandra when to stop compacting that file,
> not when to delete it.
>
> 3) You can change the window size using both the base_time_seconds
> parameter and max_sstable_age_days parameter (use the former to set the
> size of the first window, and the latter to determine how long before you
> stop compacting that window). It’s somewhat non-intuitive.
>
> Your read latencies actually look pretty reasonable, are you sure you’re
> not simply hitting GC pauses that cause your queries to run longer than you
> expect? Do you have graphs of GC time (first derivative of total gc time is
> common for tools like graphite), or do you see ‘gcinspector’ in your logs
> indicating pauses > 200ms?
>
> From: Anishek Agarwal
> Reply-To: "user@cassandra.apache.org"
> Date: Sunday, February 21, 2016 at 11:13 PM
> To: "user@cassandra.apache.org"
> Subject: Re: High Bloom filter false ratio
>
> Hey guys,
>
> Just did some more digging ... looks like DTCS is not removing old data
> completely, I used sstable2json for one such table and saw old data there.
> we have a value of 30 for  max_stable_age_days for the table.
>
> One of the columns showed data as :["2015-12-10 11\\:03+0530:",
> "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last
> IS_MARKED_FOR_DELETE column ?
>
> I see data from 10 dec 2015 still there, looks like there are a few issues
> with DTCS, Operationally what choices do i have to rectify this, We are on
> version 2.0.15.
>
> thanks
> anishek
>
>
>
>
> On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com>
> wrote:
>
>> We are using DTCS have a 30 day window for them before they are cleaned
>> up. I don't think with DTCS we can do anything about table sizing. Please
>> do let me know if there are other ideas.
>>
>> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
>> chovatia.jaydeep@gmail.com> wrote:
>>
>>> To me following three looks on higher side:
>>> SSTable count: 1289
>>>
>>> In order to reduce SSTable count see if you are compacting of not (If
>>> using STCS). Is it possible to change this to LCS?
>>>
>>>
>>> Number of keys (estimate): 345137664 (345M partition keys)
>>>
>>> I don't have any suggestion about reducing this unless you partition
>>> your data.
>>>
>>>
>>> Bloom filter space used, bytes: 493777336 (400MB is huge)
>>>
>>> If number of keys are reduced then this will automatically reduce bloom
>>> filter size I believe.
>>>
>>>
>>>
>>> Jaydeep
>>>
>>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> @Jaydeep here is the cfstats output from one node.
>>>>
>>>> Read Count: 1721134722
>>>>
>>>> Read Latency: 0.04268825050756254 ms.
>>>>
>>>> Write Count: 56743880
>>>>
>>>> Write Latency: 0.014650376727851532 ms.
>>>>
>>>> Pending Tasks: 0
>>>>
>>>> Table: user_stay_points
>>>>
>>>> SSTable count: 1289
>>>>
>>>> Space used (live), bytes: 122141272262
>>>>
>>>> Space used (total), bytes: 224227850870
>>>>
>>>> Off heap memory used (total), bytes: 653827528
>>>>
>>>> SSTable Compression Ratio: 0.4959736121441446
>>>>
>>>> Number of keys (estimate): 345137664
>>>>
>>>> Memtable cell count: 339034
>>>>
>>>> Memtable data size, bytes: 106558314
>>>>
>>>> Memtable switch count: 3266
>>>>
>>>> Local read count: 1721134803
>>>>
>>>> Local read latency: 0.048 ms
>>>>
>>>> Local write count: 56743898
>>>>
>>>> Local write latency: 0.018 ms
>>>>
>>>> Pending tasks: 0
>>>>
>>>> Bloom filter false positives: 40664437
>>>>
>>>> Bloom filter false ratio: 0.69058
>>>>
>>>> Bloom filter space used, bytes: 493777336
>>>>
>>>> Bloom filter off heap memory used, bytes: 493767024
>>>>
>>>> Index summary off heap memory used, bytes: 91677192
>>>>
>>>> Compression metadata off heap memory used, bytes: 68383312
>>>>
>>>> Compacted partition minimum bytes: 104
>>>>
>>>> Compacted partition maximum bytes: 1629722
>>>>
>>>> Compacted partition mean bytes: 1773
>>>>
>>>> Average live cells per slice (last five minutes): 0.0
>>>>
>>>> Average tombstones per slice (last five minutes): 0.0
>>>>
>>>>
>>>> @Tyler Hobbs
>>>>
>>>> we are using cassandra 2.0.15 so
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
>>>> Other problems looks like will be fixed in 3.0 .. we will mostly try and
>>>> slot in an upgrade to 3.x version towards second quarter of this year.
>>>>
>>>>
>>>> @Daemon
>>>>
>>>> Latencies seem to have higher ratios, attached is the graph.
>>>>
>>>>
>>>> I am mostly trying to look at Bloom filters, because the way we do
>>>> reads, we read data with non existent partition keys and it seems to be
>>>> taking long to respond, like for 720 queries it takes 2 seconds, with all
>>>> 721 queries not returning anything. the 720 queries are done in
>>>> sequence of 180 queries each with 180 of them running in parallel.
>>>>
>>>>
>>>> thanks
>>>>
>>>> anishek
>>>>
>>>>
>>>>
>>>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>
>>>>> How many partition keys exists for the table which shows this problem
>>>>> (or provide nodetool cfstats for that table)?
>>>>>
>>>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <
>>>>> daemeonr@gmail.com> wrote:
>>>>>
>>>>>> The bloom filter buckets the values in a small number of buckets. I
>>>>>> have been surprised by how many cases I see with large cardinality where a
>>>>>> few values populate a given bloom leaf, resulting in high false positives,
>>>>>> and a surprising impact on latencies!
>>>>>>
>>>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>>>> (allowing for gc times)?
>>>>>>
>>>>>> Daemeon Reiydelle
>>>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>>>>
>>>>>>> You can try slightly lowering the bloom_filter_fp_chance on your
>>>>>>> table.
>>>>>>>
>>>>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>>>>> partitions that always trigger a bloom filter false positive.  You could
>>>>>>> try manually tracing a few queries on this table (for non-existent
>>>>>>> partitions) to see if the bloom filter rejects them.
>>>>>>>
>>>>>>> Depending on your Cassandra version, your false positive ratio could
>>>>>>> be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>>>
>>>>>>> There are also a couple of recent improvements to bloom filters:
>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We have a table with composite partition key with humungous
>>>>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>>>
>>>>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster
>>>>>>>> we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>>>>
>>>>>>>> I thought over time the bloom filter would adjust to the key space
>>>>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>>>>> added significant traffic from Jan this year, which would not lead to
>>>>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>>>>
>>>>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Anishek
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Tyler Hobbs
>>>>>>> DataStax <http://datastax.com/>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Jeff Jirsa <je...@crowdstrike.com>.

1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so it’s very likely that some sstables stick around longer than you expect.

2) max_sstable_age_days tells cassandra when to stop compacting that file, not when to delete it.

3) You can change the window size using both the base_time_seconds parameter and max_sstable_age_days parameter (use the former to set the size of the first window, and the latter to determine how long before you stop compacting that window). It’s somewhat non-intuitive. 

Your read latencies actually look pretty reasonable, are you sure you’re not simply hitting GC pauses that cause your queries to run longer than you expect? Do you have graphs of GC time (first derivative of total gc time is common for tools like graphite), or do you see ‘gcinspector’ in your logs indicating pauses > 200ms? 

From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Sunday, February 21, 2016 at 11:13 PM
To:  "user@cassandra.apache.org"
Subject:  Re: High Bloom filter false ratio

Hey guys, 

Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for  max_stable_age_days for the table.

One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ? 

I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to rectify this, We are on version 2.0.15.

thanks
anishek




On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com> wrote:
We are using DTCS have a 30 day window for them before they are cleaned up. I don't think with DTCS we can do anything about table sizing. Please do let me know if there are other ideas.

On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <ch...@gmail.com> wrote:
To me following three looks on higher side: 
SSTable count: 1289
In order to reduce SSTable count see if you are compacting of not (If using STCS). Is it possible to change this to LCS?

Number of keys (estimate): 345137664 (345M partition keys)
I don't have any suggestion about reducing this unless you partition your data. 

Bloom filter space used, bytes: 493777336 (400MB is huge)
If number of keys are reduced then this will automatically reduce bloom filter size I believe.


Jaydeep

On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com> wrote:
Hey all, 

@Jaydeep here is the cfstats output from one node. 

Read Count: 1721134722

Read Latency: 0.04268825050756254 ms.

Write Count: 56743880

Write Latency: 0.014650376727851532 ms.

Pending Tasks: 0

Table: user_stay_points

SSTable count: 1289

Space used (live), bytes: 122141272262

Space used (total), bytes: 224227850870

Off heap memory used (total), bytes: 653827528

SSTable Compression Ratio: 0.4959736121441446

Number of keys (estimate): 345137664

Memtable cell count: 339034

Memtable data size, bytes: 106558314

Memtable switch count: 3266

Local read count: 1721134803

Local read latency: 0.048 ms

Local write count: 56743898

Local write latency: 0.018 ms

Pending tasks: 0

Bloom filter false positives: 40664437

Bloom filter false ratio: 0.69058

Bloom filter space used, bytes: 493777336

Bloom filter off heap memory used, bytes: 493767024

Index summary off heap memory used, bytes: 91677192

Compression metadata off heap memory used, bytes: 68383312

Compacted partition minimum bytes: 104

Compacted partition maximum bytes: 1629722

Compacted partition mean bytes: 1773

Average live cells per slice (last five minutes): 0.0

Average tombstones per slice (last five minutes): 0.0



@Tyler Hobbs 

we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year.



@Daemon

Latencies seem to have higher ratios, attached is the graph.



I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel. 



thanks

anishek



On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <ch...@gmail.com> wrote:
How many partition keys exists for the table which shows this problem (or provide nodetool cfstats for that table)?

On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com> wrote:

The bloom filter buckets the values in a small number of buckets. I have been surprised by how many cases I see with large cardinality where a few values populate a given bloom leaf, resulting in high false positives, and a surprising impact on latencies!

Are you seeing 2:1 ranges between mean and worse case latencies (allowing for gc times)?

Daemeon Reiydelle

On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
You can try slightly lowering the bloom_filter_fp_chance on your table.

Otherwise, it's possible that you're repeatedly querying one or two partitions that always trigger a bloom filter false positive.  You could try manually tracing a few queries on this table (for non-existent partitions) to see if the bloom filter rejects them.

Depending on your Cassandra version, your false positive ratio could be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525

There are also a couple of recent improvements to bloom filters:
* https://issues.apache.org/jira/browse/CASSANDRA-8413
* https://issues.apache.org/jira/browse/CASSANDRA-9167


On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com> wrote:
Hello, 

We have a table with composite partition key with humungous cardinality, its a combination of (long,long). On the table we have bloom_filter_fp_chance=0.010000.

On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9. 

I thought over time the bloom filter would adjust to the key space cardinality, we have been running the cluster for a long time now but have added significant traffic from Jan this year, which would not lead to writes in the db but would lead to high reads to see if are any values. 

Are there any settings that can be changed to allow better ratio.

Thanks
Anishek



-- 
Tyler Hobbs
DataStax

Re: High Bloom filter false ratio

Posted by Christopher Bradford <br...@gmail.com>.

Does every record in the SSTable have a "d" column?

On Mon, Feb 22, 2016 at 2:14 AM Anishek Agarwal <an...@gmail.com> wrote:

> Hey guys,
>
> Just did some more digging ... looks like DTCS is not removing old data
> completely, I used sstable2json for one such table and saw old data there.
> we have a value of 30 for  max_stable_age_days for the table.
>
> One of the columns showed data as :["2015-12-10 11\\:03+0530:",
> "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last
> IS_MARKED_FOR_DELETE column ?
>
> I see data from 10 dec 2015 still there, looks like there are a few issues
> with DTCS, Operationally what choices do i have to rectify this, We are on
> version 2.0.15.
>
> thanks
> anishek
>
>
>
>
> On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com>
> wrote:
>
>> We are using DTCS have a 30 day window for them before they are cleaned
>> up. I don't think with DTCS we can do anything about table sizing. Please
>> do let me know if there are other ideas.
>>
>> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
>> chovatia.jaydeep@gmail.com> wrote:
>>
>>> To me following three looks on higher side:
>>> SSTable count: 1289
>>>
>>> In order to reduce SSTable count see if you are compacting of not (If
>>> using STCS). Is it possible to change this to LCS?
>>>
>>>
>>> Number of keys (estimate): 345137664 (345M partition keys)
>>>
>>> I don't have any suggestion about reducing this unless you partition
>>> your data.
>>>
>>>
>>> Bloom filter space used, bytes: 493777336 (400MB is huge)
>>>
>>> If number of keys are reduced then this will automatically reduce bloom
>>> filter size I believe.
>>>
>>>
>>>
>>> Jaydeep
>>>
>>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> @Jaydeep here is the cfstats output from one node.
>>>>
>>>> Read Count: 1721134722
>>>>
>>>> Read Latency: 0.04268825050756254 ms.
>>>>
>>>> Write Count: 56743880
>>>>
>>>> Write Latency: 0.014650376727851532 ms.
>>>>
>>>> Pending Tasks: 0
>>>>
>>>> Table: user_stay_points
>>>>
>>>> SSTable count: 1289
>>>>
>>>> Space used (live), bytes: 122141272262
>>>>
>>>> Space used (total), bytes: 224227850870
>>>>
>>>> Off heap memory used (total), bytes: 653827528
>>>>
>>>> SSTable Compression Ratio: 0.4959736121441446
>>>>
>>>> Number of keys (estimate): 345137664
>>>>
>>>> Memtable cell count: 339034
>>>>
>>>> Memtable data size, bytes: 106558314
>>>>
>>>> Memtable switch count: 3266
>>>>
>>>> Local read count: 1721134803
>>>>
>>>> Local read latency: 0.048 ms
>>>>
>>>> Local write count: 56743898
>>>>
>>>> Local write latency: 0.018 ms
>>>>
>>>> Pending tasks: 0
>>>>
>>>> Bloom filter false positives: 40664437
>>>>
>>>> Bloom filter false ratio: 0.69058
>>>>
>>>> Bloom filter space used, bytes: 493777336
>>>>
>>>> Bloom filter off heap memory used, bytes: 493767024
>>>>
>>>> Index summary off heap memory used, bytes: 91677192
>>>>
>>>> Compression metadata off heap memory used, bytes: 68383312
>>>>
>>>> Compacted partition minimum bytes: 104
>>>>
>>>> Compacted partition maximum bytes: 1629722
>>>>
>>>> Compacted partition mean bytes: 1773
>>>>
>>>> Average live cells per slice (last five minutes): 0.0
>>>>
>>>> Average tombstones per slice (last five minutes): 0.0
>>>>
>>>>
>>>> @Tyler Hobbs
>>>>
>>>> we are using cassandra 2.0.15 so
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
>>>> Other problems looks like will be fixed in 3.0 .. we will mostly try and
>>>> slot in an upgrade to 3.x version towards second quarter of this year.
>>>>
>>>>
>>>> @Daemon
>>>>
>>>> Latencies seem to have higher ratios, attached is the graph.
>>>>
>>>>
>>>> I am mostly trying to look at Bloom filters, because the way we do
>>>> reads, we read data with non existent partition keys and it seems to be
>>>> taking long to respond, like for 720 queries it takes 2 seconds, with all
>>>> 721 queries not returning anything. the 720 queries are done in
>>>> sequence of 180 queries each with 180 of them running in parallel.
>>>>
>>>>
>>>> thanks
>>>>
>>>> anishek
>>>>
>>>>
>>>>
>>>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>
>>>>> How many partition keys exists for the table which shows this problem
>>>>> (or provide nodetool cfstats for that table)?
>>>>>
>>>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <
>>>>> daemeonr@gmail.com> wrote:
>>>>>
>>>>>> The bloom filter buckets the values in a small number of buckets. I
>>>>>> have been surprised by how many cases I see with large cardinality where a
>>>>>> few values populate a given bloom leaf, resulting in high false positives,
>>>>>> and a surprising impact on latencies!
>>>>>>
>>>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>>>> (allowing for gc times)?
>>>>>>
>>>>>> Daemeon Reiydelle
>>>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>>>>
>>>>>>> You can try slightly lowering the bloom_filter_fp_chance on your
>>>>>>> table.
>>>>>>>
>>>>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>>>>> partitions that always trigger a bloom filter false positive.  You could
>>>>>>> try manually tracing a few queries on this table (for non-existent
>>>>>>> partitions) to see if the bloom filter rejects them.
>>>>>>>
>>>>>>> Depending on your Cassandra version, your false positive ratio could
>>>>>>> be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>>>
>>>>>>> There are also a couple of recent improvements to bloom filters:
>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We have a table with composite partition key with humungous
>>>>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>>>
>>>>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster
>>>>>>>> we are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>>>>
>>>>>>>> I thought over time the bloom filter would adjust to the key space
>>>>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>>>>> added significant traffic from Jan this year, which would not lead to
>>>>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>>>>
>>>>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Anishek
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Tyler Hobbs
>>>>>>> DataStax <http://datastax.com/>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Anishek Agarwal <an...@gmail.com>.

Hey guys,

Just did some more digging ... looks like DTCS is not removing old data
completely, I used sstable2json for one such table and saw old data there.
we have a value of 30 for  max_stable_age_days for the table.

One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2",
1449725602552000, "d"] what is the meaning of "d" in the last
IS_MARKED_FOR_DELETE column ?

I see data from 10 dec 2015 still there, looks like there are a few issues
with DTCS, Operationally what choices do i have to rectify this, We are on
version 2.0.15.

thanks
anishek




On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <an...@gmail.com> wrote:

> We are using DTCS have a 30 day window for them before they are cleaned
> up. I don't think with DTCS we can do anything about table sizing. Please
> do let me know if there are other ideas.
>
> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
> chovatia.jaydeep@gmail.com> wrote:
>
>> To me following three looks on higher side:
>> SSTable count: 1289
>>
>> In order to reduce SSTable count see if you are compacting of not (If
>> using STCS). Is it possible to change this to LCS?
>>
>>
>> Number of keys (estimate): 345137664 (345M partition keys)
>>
>> I don't have any suggestion about reducing this unless you partition your
>> data.
>>
>>
>> Bloom filter space used, bytes: 493777336 (400MB is huge)
>>
>> If number of keys are reduced then this will automatically reduce bloom
>> filter size I believe.
>>
>>
>>
>> Jaydeep
>>
>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>
>> wrote:
>>
>>> Hey all,
>>>
>>> @Jaydeep here is the cfstats output from one node.
>>>
>>> Read Count: 1721134722
>>>
>>> Read Latency: 0.04268825050756254 ms.
>>>
>>> Write Count: 56743880
>>>
>>> Write Latency: 0.014650376727851532 ms.
>>>
>>> Pending Tasks: 0
>>>
>>> Table: user_stay_points
>>>
>>> SSTable count: 1289
>>>
>>> Space used (live), bytes: 122141272262
>>>
>>> Space used (total), bytes: 224227850870
>>>
>>> Off heap memory used (total), bytes: 653827528
>>>
>>> SSTable Compression Ratio: 0.4959736121441446
>>>
>>> Number of keys (estimate): 345137664
>>>
>>> Memtable cell count: 339034
>>>
>>> Memtable data size, bytes: 106558314
>>>
>>> Memtable switch count: 3266
>>>
>>> Local read count: 1721134803
>>>
>>> Local read latency: 0.048 ms
>>>
>>> Local write count: 56743898
>>>
>>> Local write latency: 0.018 ms
>>>
>>> Pending tasks: 0
>>>
>>> Bloom filter false positives: 40664437
>>>
>>> Bloom filter false ratio: 0.69058
>>>
>>> Bloom filter space used, bytes: 493777336
>>>
>>> Bloom filter off heap memory used, bytes: 493767024
>>>
>>> Index summary off heap memory used, bytes: 91677192
>>>
>>> Compression metadata off heap memory used, bytes: 68383312
>>>
>>> Compacted partition minimum bytes: 104
>>>
>>> Compacted partition maximum bytes: 1629722
>>>
>>> Compacted partition mean bytes: 1773
>>>
>>> Average live cells per slice (last five minutes): 0.0
>>>
>>> Average tombstones per slice (last five minutes): 0.0
>>>
>>>
>>> @Tyler Hobbs
>>>
>>> we are using cassandra 2.0.15 so
>>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
>>> Other problems looks like will be fixed in 3.0 .. we will mostly try and
>>> slot in an upgrade to 3.x version towards second quarter of this year.
>>>
>>>
>>> @Daemon
>>>
>>> Latencies seem to have higher ratios, attached is the graph.
>>>
>>>
>>> I am mostly trying to look at Bloom filters, because the way we do
>>> reads, we read data with non existent partition keys and it seems to be
>>> taking long to respond, like for 720 queries it takes 2 seconds, with all
>>> 721 queries not returning anything. the 720 queries are done in
>>> sequence of 180 queries each with 180 of them running in parallel.
>>>
>>>
>>> thanks
>>>
>>> anishek
>>>
>>>
>>>
>>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>>> chovatia.jaydeep@gmail.com> wrote:
>>>
>>>> How many partition keys exists for the table which shows this problem
>>>> (or provide nodetool cfstats for that table)?
>>>>
>>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <daemeonr@gmail.com
>>>> > wrote:
>>>>
>>>>> The bloom filter buckets the values in a small number of buckets. I
>>>>> have been surprised by how many cases I see with large cardinality where a
>>>>> few values populate a given bloom leaf, resulting in high false positives,
>>>>> and a surprising impact on latencies!
>>>>>
>>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>>> (allowing for gc times)?
>>>>>
>>>>> Daemeon Reiydelle
>>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>>>
>>>>>> You can try slightly lowering the bloom_filter_fp_chance on your
>>>>>> table.
>>>>>>
>>>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>>>> partitions that always trigger a bloom filter false positive.  You could
>>>>>> try manually tracing a few queries on this table (for non-existent
>>>>>> partitions) to see if the bloom filter rejects them.
>>>>>>
>>>>>> Depending on your Cassandra version, your false positive ratio could
>>>>>> be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>>
>>>>>> There are also a couple of recent improvements to bloom filters:
>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We have a table with composite partition key with humungous
>>>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>>
>>>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we
>>>>>>> are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>>>
>>>>>>> I thought over time the bloom filter would adjust to the key space
>>>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>>>> added significant traffic from Jan this year, which would not lead to
>>>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>>>
>>>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Anishek
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Tyler Hobbs
>>>>>> DataStax <http://datastax.com/>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Anishek Agarwal <an...@gmail.com>.

We are using DTCS have a 30 day window for them before they are cleaned up.
I don't think with DTCS we can do anything about table sizing. Please do
let me know if there are other ideas.

On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
chovatia.jaydeep@gmail.com> wrote:

> To me following three looks on higher side:
> SSTable count: 1289
>
> In order to reduce SSTable count see if you are compacting of not (If
> using STCS). Is it possible to change this to LCS?
>
>
> Number of keys (estimate): 345137664 (345M partition keys)
>
> I don't have any suggestion about reducing this unless you partition your
> data.
>
>
> Bloom filter space used, bytes: 493777336 (400MB is huge)
>
> If number of keys are reduced then this will automatically reduce bloom
> filter size I believe.
>
>
>
> Jaydeep
>
> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com>
> wrote:
>
>> Hey all,
>>
>> @Jaydeep here is the cfstats output from one node.
>>
>> Read Count: 1721134722
>>
>> Read Latency: 0.04268825050756254 ms.
>>
>> Write Count: 56743880
>>
>> Write Latency: 0.014650376727851532 ms.
>>
>> Pending Tasks: 0
>>
>> Table: user_stay_points
>>
>> SSTable count: 1289
>>
>> Space used (live), bytes: 122141272262
>>
>> Space used (total), bytes: 224227850870
>>
>> Off heap memory used (total), bytes: 653827528
>>
>> SSTable Compression Ratio: 0.4959736121441446
>>
>> Number of keys (estimate): 345137664
>>
>> Memtable cell count: 339034
>>
>> Memtable data size, bytes: 106558314
>>
>> Memtable switch count: 3266
>>
>> Local read count: 1721134803
>>
>> Local read latency: 0.048 ms
>>
>> Local write count: 56743898
>>
>> Local write latency: 0.018 ms
>>
>> Pending tasks: 0
>>
>> Bloom filter false positives: 40664437
>>
>> Bloom filter false ratio: 0.69058
>>
>> Bloom filter space used, bytes: 493777336
>>
>> Bloom filter off heap memory used, bytes: 493767024
>>
>> Index summary off heap memory used, bytes: 91677192
>>
>> Compression metadata off heap memory used, bytes: 68383312
>>
>> Compacted partition minimum bytes: 104
>>
>> Compacted partition maximum bytes: 1629722
>>
>> Compacted partition mean bytes: 1773
>>
>> Average live cells per slice (last five minutes): 0.0
>>
>> Average tombstones per slice (last five minutes): 0.0
>>
>>
>> @Tyler Hobbs
>>
>> we are using cassandra 2.0.15 so
>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
>> Other problems looks like will be fixed in 3.0 .. we will mostly try and
>> slot in an upgrade to 3.x version towards second quarter of this year.
>>
>>
>> @Daemon
>>
>> Latencies seem to have higher ratios, attached is the graph.
>>
>>
>> I am mostly trying to look at Bloom filters, because the way we do reads,
>> we read data with non existent partition keys and it seems to be taking
>> long to respond, like for 720 queries it takes 2 seconds, with all 721
>> queries not returning anything. the 720 queries are done in sequence of
>> 180 queries each with 180 of them running in parallel.
>>
>>
>> thanks
>>
>> anishek
>>
>>
>>
>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>> chovatia.jaydeep@gmail.com> wrote:
>>
>>> How many partition keys exists for the table which shows this problem
>>> (or provide nodetool cfstats for that table)?
>>>
>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com>
>>> wrote:
>>>
>>>> The bloom filter buckets the values in a small number of buckets. I
>>>> have been surprised by how many cases I see with large cardinality where a
>>>> few values populate a given bloom leaf, resulting in high false positives,
>>>> and a surprising impact on latencies!
>>>>
>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>> (allowing for gc times)?
>>>>
>>>> Daemeon Reiydelle
>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>>
>>>>> You can try slightly lowering the bloom_filter_fp_chance on your table.
>>>>>
>>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>>> partitions that always trigger a bloom filter false positive.  You could
>>>>> try manually tracing a few queries on this table (for non-existent
>>>>> partitions) to see if the bloom filter rejects them.
>>>>>
>>>>> Depending on your Cassandra version, your false positive ratio could
>>>>> be inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>
>>>>> There are also a couple of recent improvements to bloom filters:
>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>
>>>>>
>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We have a table with composite partition key with humungous
>>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>
>>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we
>>>>>> are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>>
>>>>>> I thought over time the bloom filter would adjust to the key space
>>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>>> added significant traffic from Jan this year, which would not lead to
>>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>>
>>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>>
>>>>>> Thanks
>>>>>> Anishek
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Tyler Hobbs
>>>>> DataStax <http://datastax.com/>
>>>>>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Jaydeep Chovatia <ch...@gmail.com>.

To me following three looks on higher side:
SSTable count: 1289

In order to reduce SSTable count see if you are compacting of not (If using
STCS). Is it possible to change this to LCS?


Number of keys (estimate): 345137664 (345M partition keys)

I don't have any suggestion about reducing this unless you partition your
data.


Bloom filter space used, bytes: 493777336 (400MB is huge)

If number of keys are reduced then this will automatically reduce bloom
filter size I believe.



Jaydeep

On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <an...@gmail.com> wrote:

> Hey all,
>
> @Jaydeep here is the cfstats output from one node.
>
> Read Count: 1721134722
>
> Read Latency: 0.04268825050756254 ms.
>
> Write Count: 56743880
>
> Write Latency: 0.014650376727851532 ms.
>
> Pending Tasks: 0
>
> Table: user_stay_points
>
> SSTable count: 1289
>
> Space used (live), bytes: 122141272262
>
> Space used (total), bytes: 224227850870
>
> Off heap memory used (total), bytes: 653827528
>
> SSTable Compression Ratio: 0.4959736121441446
>
> Number of keys (estimate): 345137664
>
> Memtable cell count: 339034
>
> Memtable data size, bytes: 106558314
>
> Memtable switch count: 3266
>
> Local read count: 1721134803
>
> Local read latency: 0.048 ms
>
> Local write count: 56743898
>
> Local write latency: 0.018 ms
>
> Pending tasks: 0
>
> Bloom filter false positives: 40664437
>
> Bloom filter false ratio: 0.69058
>
> Bloom filter space used, bytes: 493777336
>
> Bloom filter off heap memory used, bytes: 493767024
>
> Index summary off heap memory used, bytes: 91677192
>
> Compression metadata off heap memory used, bytes: 68383312
>
> Compacted partition minimum bytes: 104
>
> Compacted partition maximum bytes: 1629722
>
> Compacted partition mean bytes: 1773
>
> Average live cells per slice (last five minutes): 0.0
>
> Average tombstones per slice (last five minutes): 0.0
>
>
> @Tyler Hobbs
>
> we are using cassandra 2.0.15 so
> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
> Other problems looks like will be fixed in 3.0 .. we will mostly try and
> slot in an upgrade to 3.x version towards second quarter of this year.
>
>
> @Daemon
>
> Latencies seem to have higher ratios, attached is the graph.
>
>
> I am mostly trying to look at Bloom filters, because the way we do reads,
> we read data with non existent partition keys and it seems to be taking
> long to respond, like for 720 queries it takes 2 seconds, with all 721
> queries not returning anything. the 720 queries are done in sequence of
> 180 queries each with 180 of them running in parallel.
>
>
> thanks
>
> anishek
>
>
>
> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
> chovatia.jaydeep@gmail.com> wrote:
>
>> How many partition keys exists for the table which shows this problem (or
>> provide nodetool cfstats for that table)?
>>
>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com>
>> wrote:
>>
>>> The bloom filter buckets the values in a small number of buckets. I have
>>> been surprised by how many cases I see with large cardinality where a few
>>> values populate a given bloom leaf, resulting in high false positives, and
>>> a surprising impact on latencies!
>>>
>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>> (allowing for gc times)?
>>>
>>> Daemeon Reiydelle
>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>
>>>> You can try slightly lowering the bloom_filter_fp_chance on your table.
>>>>
>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>> partitions that always trigger a bloom filter false positive.  You could
>>>> try manually tracing a few queries on this table (for non-existent
>>>> partitions) to see if the bloom filter rejects them.
>>>>
>>>> Depending on your Cassandra version, your false positive ratio could be
>>>> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>
>>>> There are also a couple of recent improvements to bloom filters:
>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>
>>>>
>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a table with composite partition key with humungous
>>>>> cardinality, its a combination of (long,long). On the table we have
>>>>> bloom_filter_fp_chance=0.010000.
>>>>>
>>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we
>>>>> are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>>
>>>>> I thought over time the bloom filter would adjust to the key space
>>>>> cardinality, we have been running the cluster for a long time now but have
>>>>> added significant traffic from Jan this year, which would not lead to
>>>>> writes in the db but would lead to high reads to see if are any values.
>>>>>
>>>>> Are there any settings that can be changed to allow better ratio.
>>>>>
>>>>> Thanks
>>>>> Anishek
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Tyler Hobbs
>>>> DataStax <http://datastax.com/>
>>>>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Anishek Agarwal <an...@gmail.com>.

Hey all,

@Jaydeep here is the cfstats output from one node.

Read Count: 1721134722

Read Latency: 0.04268825050756254 ms.

Write Count: 56743880

Write Latency: 0.014650376727851532 ms.

Pending Tasks: 0

Table: user_stay_points

SSTable count: 1289

Space used (live), bytes: 122141272262

Space used (total), bytes: 224227850870

Off heap memory used (total), bytes: 653827528

SSTable Compression Ratio: 0.4959736121441446

Number of keys (estimate): 345137664

Memtable cell count: 339034

Memtable data size, bytes: 106558314

Memtable switch count: 3266

Local read count: 1721134803

Local read latency: 0.048 ms

Local write count: 56743898

Local write latency: 0.018 ms

Pending tasks: 0

Bloom filter false positives: 40664437

Bloom filter false ratio: 0.69058

Bloom filter space used, bytes: 493777336

Bloom filter off heap memory used, bytes: 493767024

Index summary off heap memory used, bytes: 91677192

Compression metadata off heap memory used, bytes: 68383312

Compacted partition minimum bytes: 104

Compacted partition maximum bytes: 1629722

Compacted partition mean bytes: 1773

Average live cells per slice (last five minutes): 0.0

Average tombstones per slice (last five minutes): 0.0


@Tyler Hobbs

we are using cassandra 2.0.15 so
https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur. Other
problems looks like will be fixed in 3.0 .. we will mostly try and slot in
an upgrade to 3.x version towards second quarter of this year.


@Daemon

Latencies seem to have higher ratios, attached is the graph.


I am mostly trying to look at Bloom filters, because the way we do reads,
we read data with non existent partition keys and it seems to be taking
long to respond, like for 720 queries it takes 2 seconds, with all 721
queries not returning anything. the 720 queries are done in sequence of 180
queries each with 180 of them running in parallel.


thanks

anishek



On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
chovatia.jaydeep@gmail.com> wrote:

> How many partition keys exists for the table which shows this problem (or
> provide nodetool cfstats for that table)?
>
> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com>
> wrote:
>
>> The bloom filter buckets the values in a small number of buckets. I have
>> been surprised by how many cases I see with large cardinality where a few
>> values populate a given bloom leaf, resulting in high false positives, and
>> a surprising impact on latencies!
>>
>> Are you seeing 2:1 ranges between mean and worse case latencies (allowing
>> for gc times)?
>>
>> Daemeon Reiydelle
>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>
>>> You can try slightly lowering the bloom_filter_fp_chance on your table.
>>>
>>> Otherwise, it's possible that you're repeatedly querying one or two
>>> partitions that always trigger a bloom filter false positive.  You could
>>> try manually tracing a few queries on this table (for non-existent
>>> partitions) to see if the bloom filter rejects them.
>>>
>>> Depending on your Cassandra version, your false positive ratio could be
>>> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>
>>> There are also a couple of recent improvements to bloom filters:
>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>
>>>
>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a table with composite partition key with humungous
>>>> cardinality, its a combination of (long,long). On the table we have
>>>> bloom_filter_fp_chance=0.010000.
>>>>
>>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we
>>>> are seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>>
>>>> I thought over time the bloom filter would adjust to the key space
>>>> cardinality, we have been running the cluster for a long time now but have
>>>> added significant traffic from Jan this year, which would not lead to
>>>> writes in the db but would lead to high reads to see if are any values.
>>>>
>>>> Are there any settings that can be changed to allow better ratio.
>>>>
>>>> Thanks
>>>> Anishek
>>>>
>>>
>>>
>>>
>>> --
>>> Tyler Hobbs
>>> DataStax <http://datastax.com/>
>>>
>>
>

Re: High Bloom filter false ratio

Posted by Jaydeep Chovatia <ch...@gmail.com>.

How many partition keys exists for the table which shows this problem (or
provide nodetool cfstats for that table)?

On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <da...@gmail.com>
wrote:

> The bloom filter buckets the values in a small number of buckets. I have
> been surprised by how many cases I see with large cardinality where a few
> values populate a given bloom leaf, resulting in high false positives, and
> a surprising impact on latencies!
>
> Are you seeing 2:1 ranges between mean and worse case latencies (allowing
> for gc times)?
>
> Daemeon Reiydelle
> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>
>> You can try slightly lowering the bloom_filter_fp_chance on your table.
>>
>> Otherwise, it's possible that you're repeatedly querying one or two
>> partitions that always trigger a bloom filter false positive.  You could
>> try manually tracing a few queries on this table (for non-existent
>> partitions) to see if the bloom filter rejects them.
>>
>> Depending on your Cassandra version, your false positive ratio could be
>> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>
>> There are also a couple of recent improvements to bloom filters:
>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>
>>
>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> We have a table with composite partition key with humungous cardinality,
>>> its a combination of (long,long). On the table we have
>>> bloom_filter_fp_chance=0.010000.
>>>
>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are
>>> seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>>
>>> I thought over time the bloom filter would adjust to the key space
>>> cardinality, we have been running the cluster for a long time now but have
>>> added significant traffic from Jan this year, which would not lead to
>>> writes in the db but would lead to high reads to see if are any values.
>>>
>>> Are there any settings that can be changed to allow better ratio.
>>>
>>> Thanks
>>> Anishek
>>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>

Re: High Bloom filter false ratio

Posted by daemeon reiydelle <da...@gmail.com>.

The bloom filter buckets the values in a small number of buckets. I have
been surprised by how many cases I see with large cardinality where a few
values populate a given bloom leaf, resulting in high false positives, and
a surprising impact on latencies!

Are you seeing 2:1 ranges between mean and worse case latencies (allowing
for gc times)?

Daemeon Reiydelle
On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:

> You can try slightly lowering the bloom_filter_fp_chance on your table.
>
> Otherwise, it's possible that you're repeatedly querying one or two
> partitions that always trigger a bloom filter false positive.  You could
> try manually tracing a few queries on this table (for non-existent
> partitions) to see if the bloom filter rejects them.
>
> Depending on your Cassandra version, your false positive ratio could be
> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>
> There are also a couple of recent improvements to bloom filters:
> * https://issues.apache.org/jira/browse/CASSANDRA-8413
> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>
>
> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com>
> wrote:
>
>> Hello,
>>
>> We have a table with composite partition key with humungous cardinality,
>> its a combination of (long,long). On the table we have
>> bloom_filter_fp_chance=0.010000.
>>
>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are
>> seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>
>> I thought over time the bloom filter would adjust to the key space
>> cardinality, we have been running the cluster for a long time now but have
>> added significant traffic from Jan this year, which would not lead to
>> writes in the db but would lead to high reads to see if are any values.
>>
>> Are there any settings that can be changed to allow better ratio.
>>
>> Thanks
>> Anishek
>>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: High Bloom filter false ratio

Posted by Tyler Hobbs <ty...@datastax.com>.

You can try slightly lowering the bloom_filter_fp_chance on your table.

Otherwise, it's possible that you're repeatedly querying one or two
partitions that always trigger a bloom filter false positive.  You could
try manually tracing a few queries on this table (for non-existent
partitions) to see if the bloom filter rejects them.

Depending on your Cassandra version, your false positive ratio could be
inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525

There are also a couple of recent improvements to bloom filters:
* https://issues.apache.org/jira/browse/CASSANDRA-8413
* https://issues.apache.org/jira/browse/CASSANDRA-9167


On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <an...@gmail.com> wrote:

> Hello,
>
> We have a table with composite partition key with humungous cardinality,
> its a combination of (long,long). On the table we have
> bloom_filter_fp_chance=0.010000.
>
> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are
> seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>
> I thought over time the bloom filter would adjust to the key space
> cardinality, we have been running the cluster for a long time now but have
> added significant traffic from Jan this year, which would not lead to
> writes in the db but would lead to high reads to see if are any values.
>
> Are there any settings that can be changed to allow better ratio.
>
> Thanks
> Anishek
>



-- 
Tyler Hobbs
DataStax <http://datastax.com/>