You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Rahul Reddy <ra...@gmail.com> on 2019/02/21 15:06:23 UTC

Tombstones in memtable

We have small table records are about 5k .
All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc
grace seconds has 3 hours.  We do 5k reads a second during peak load.
During the peak load seeing Alerts for tomstone scanned histogram reaching
million.
Cassandra version 3.11.1. Please let me know how can this tombstone scan
can be avoided in memtable

RE: Tombstones in memtable

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.

Rahul,

 

You wrote that during peak hours you only have a couple hundred inserts per node so now I’m not sure why the default settings wouldn’t have worked just fine.  I sense there is more to the story.  What else could explain those tombstones?

 

From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
Sent: Saturday, February 23, 2019 5:56 PM
To: user@cassandra.apache.org
Subject: Re: Tombstones in memtable

 

Changing gcgs didn't help

 

CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;

 

 

```CREATE TABLE keyspace."table" (

    "column1" text PRIMARY KEY,

    "column2" text

) WITH bloom_filter_fp_chance = 0.01

    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

    AND comment = ''

    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}

    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0

    AND dclocal_read_repair_chance = 0.1

    AND default_time_to_live = 18000

    AND gc_grace_seconds = 60

    AND max_index_interval = 2048

    AND memtable_flush_period_in_ms = 0

    AND min_index_interval = 128

    AND read_repair_chance = 0.0

    AND speculative_retry = '99PERCENTILE';

 

flushed table and took tsstabledump     

grep -i '"expired" : true' SSTables.txt|wc -l

16439

grep -i '"expired" : false'  SSTables.txt |wc -l

2657

 

ttl is 4 hours.

 

INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';

SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';

 

metric to scan tombstones 

increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])

 

during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.

```

 

```tablestats

            Read Count: 605231874

            Read Latency: 0.021268529760215503 ms.

            Write Count: 2763352

            Write Latency: 0.027924007871599422 ms.

            Pending Flushes: 0

                        Table: name

                        SSTable count: 1

                        Space used (live): 1413203

                        Space used (total): 1413203

                        Space used by snapshots (total): 0

                        Off heap memory used (total): 28813

                        SSTable Compression Ratio: 0.5015090954531143

                        Number of partitions (estimate): 19568

                        Memtable cell count: 573

                        Memtable data size: 22971

                        Memtable off heap memory used: 0

                        Memtable switch count: 6

                        Local read count: 529868919

                        Local read latency: 0.020 ms

                        Local write count: 2707371

                        Local write latency: 0.024 ms

                        Pending flushes: 0

                        Percent repaired: 0.0

                        Bloom filter false positives: 1

                        Bloom filter false ratio: 0.00000

                        Bloom filter space used: 23888

                        Bloom filter off heap memory used: 23880

                        Index summary off heap memory used: 4717

                        Compression metadata off heap memory used: 216

                        Compacted partition minimum bytes: 73

                        Compacted partition maximum bytes: 124

                        Compacted partition mean bytes: 99

                        Average live cells per slice (last five minutes): 1.0

                        Maximum live cells per slice (last five minutes): 1

                        Average tombstones per slice (last five minutes): 1.0

                        Maximum tombstones per slice (last five minutes): 1

                        Dropped Mutations: 0

                        

                        histograms

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count

                              (micros)          (micros)           (bytes)                  

50%             000             20.50             17.08                86                 1

75%             0.00             24.60             20.50               124                 1

95%             0.00             35.43             29.52               124                 1

98%             0.00             35.43             42.51               124                 1

99%             0.00             42.51             51.01               124                 1

Min             0.00              8.24              5.72                73                 0

Max             1.00             42.51            152.32               124                 1

```

 

3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge

 

On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:

Would also be good to see your schema (anonymized if needed) and the select queries you’re running

 

-- 

Jeff Jirsa

 


On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:

Thanks Jeff,

 

I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 

 

On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:

If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).

 

Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.

 

-- 

Jeff Jirsa

 


On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.

 

Kenneth Brotman

 

From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
Sent: Thursday, February 21, 2019 7:06 AM
To: user@cassandra.apache.org
Subject: Tombstones in memtable

 

We have small table records are about 5k .

All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.

Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

Given your data model, there’s two ways you may read a tombstone:

You select an expired row, or you scan the whole table.

If you select an expired row, you’re going to scan one tombstone. With sufficiently high read rate, that’ll look like you’re scanning a lot - each read will add one to the histogram and it may add up to millions in 5 minutes if you’re reading fast enough,  but in this read pattern it’s not a problem.

If you’re doing a table scan, and you ask for 5000 rows at a time, you may have to scan past tens of thousands of expired rows to eventually find the 5000 live rows. IF you’re doing this, it may be a bit concerning, because it’s having to skip past a ton of tombstones on the read path - which is expensive; this is why the metric exists,  but you’ve said you’re not doing this.

You’re not going to be able to stop reading tombstones unless you can stop the app from reading expired rows. But on the plus side, this type of tombstone read is not expensive and not concerning at all.

-- 
Jeff Jirsa


> On Feb 24, 2019, at 5:36 AM, Rahul Reddy <ra...@gmail.com> wrote:
> 
> Thanks Jeff. I'm trying to figure out why the tombstones scans are happening if possible eliminate it.
> 
>> On Sat, Feb 23, 2019, 10:50 PM Jeff Jirsa <jj...@gmail.com> wrote:
>> G1GC with an 8g heap may be slower than CMS. Also you don’t typically set new gen size on G1.
>> 
>> Again though - what problem are you solving here? If you’re serving reads and sitting under 50% cpu, it’s not clear to me what you’re trying to fix. Tombstones scanned won’t matter for your table, so if that’s your only concern, I’d ignore it. 
>> 
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Feb 23, 2019, at 7:26 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>> 
>>> ```jvm setting
>>> 
>>> -XX:+UseThreadPriorities
>>> -XX:ThreadPriorityPolicy=42
>>> -XX:+HeapDumpOnOutOfMemoryError
>>> -Xss256k
>>> -XX:StringTableSize=1000003
>>> -XX:+AlwaysPreTouch
>>> -XX:-UseBiasedLocking
>>> -XX:+UseTLAB
>>> -XX:+ResizeTLAB
>>> -XX:+UseNUMA
>>> -XX:+PerfDisableSharedMem
>>> -Djava.net.preferIPv4Stack=true
>>> -XX:+UseG1GC
>>> -XX:G1RSetUpdatingPauseTimePercent=5
>>> -XX:MaxGCPauseMillis=500
>>> -XX:+PrintGCDetails
>>> -XX:+PrintGCDateStamps
>>> -XX:+PrintHeapAtGC
>>> -XX:+PrintTenuringDistribution
>>> -XX:+PrintGCApplicationStoppedTime
>>> -XX:+PrintPromotionFailure
>>> -XX:+UseGCLogFileRotation
>>> -XX:NumberOfGCLogFiles=10
>>> -XX:GCLogFileSize=10M
>>> 
>>> Total memory
>>> free
>>>              total       used       free     shared    buffers     cached
>>> Mem:      16434004   16125340     308664         60     172872    5565184
>>> -/+ buffers/cache:   10387284    6046720
>>> Swap:            0          0          0
>>> 
>>> Heap settings in cassandra-env.sh
>>> MAX_HEAP_SIZE="8192M"
>>> HEAP_NEWSIZE="800M"
>>> ```
>>> 
>>>> On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <ra...@gmail.com> wrote:
>>>> Thanks Jeff,
>>>> 
>>>> Since low writes and high reads most of the time data in memtables only.  When I noticed intially issue no stables on disk everything in memtable only. 
>>>> 
>>>>> On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>> Also given your short ttl and low write rate, you may want to think about how you can keep more in memory - this may mean larger memtable and high flush thresholds (reading from the memtable), or perhaps the partition cache (if you are likely to read the same key multiple times). You’ll also probably win some with basic perf and GC tuning, but can’t really do that via email. Cassandra-8150 has some pointers. 
>>>>> 
>>>>> -- 
>>>>> Jeff Jirsa
>>>>> 
>>>>> 
>>>>>> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>> 
>>>>>> You’ll only ever have one tombstone per read, so your load is based on normal read rate not tombstones. The metric isn’t wrong, but it’s not indicative of a problem here given your data model. 
>>>>>> 
>>>>>> You’re using STCS do you may be reading from more than one sstable if you update column2 for a given column1, otherwise you’re probably just seeing normal read load. Consider dropping your compression chunk size a bit (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on a lot of factors, but STCS is probably causing a fair bit of unnecessary compactions and probably is very slow to expire data).
>>>>>> 
>>>>>> -- 
>>>>>> Jeff Jirsa
>>>>>> 
>>>>>> 
>>>>>>> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Do you see anything wrong with this metric.
>>>>>>> 
>>>>>>> metric to scan tombstones
>>>>>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>>>>> 
>>>>>>> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>>>>>>> 
>>>>>>>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>> Your schema is such that you’ll never read more than one tombstone per select (unless you’re also doing range reads / table scans that you didn’t mention) - I’m not quite sure what you’re alerting on, but you’re not going to have tombstone problems with that table / that select. 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Jeff Jirsa
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Changing gcgs didn't help
>>>>>>>>> 
>>>>>>>>> CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ```CREATE TABLE keyspace."table" (
>>>>>>>>>     "column1" text PRIMARY KEY,
>>>>>>>>>     "column2" text
>>>>>>>>> ) WITH bloom_filter_fp_chance = 0.01
>>>>>>>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>>>>>>>     AND comment = ''
>>>>>>>>>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>>>>>>>>>     AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>>>>>>>     AND crc_check_chance = 1.0
>>>>>>>>>     AND dclocal_read_repair_chance = 0.1
>>>>>>>>>     AND default_time_to_live = 18000
>>>>>>>>>     AND gc_grace_seconds = 60
>>>>>>>>>     AND max_index_interval = 2048
>>>>>>>>>     AND memtable_flush_period_in_ms = 0
>>>>>>>>>     AND min_index_interval = 128
>>>>>>>>>     AND read_repair_chance = 0.0
>>>>>>>>>     AND speculative_retry = '99PERCENTILE';
>>>>>>>>> 
>>>>>>>>> flushed table and took tsstabledump    
>>>>>>>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>>>>>>>> 16439
>>>>>>>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>>>>>>>> 2657
>>>>>>>>> 
>>>>>>>>> ttl is 4 hours.
>>>>>>>>> 
>>>>>>>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';
>>>>>>>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>>>>>>>> 
>>>>>>>>> metric to scan tombstones 
>>>>>>>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>>>>>>> 
>>>>>>>>> during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> ```tablestats
>>>>>>>>> 	Read Count: 605231874
>>>>>>>>> 	Read Latency: 0.021268529760215503 ms.
>>>>>>>>> 	Write Count: 2763352
>>>>>>>>> 	Write Latency: 0.027924007871599422 ms.
>>>>>>>>> 	Pending Flushes: 0
>>>>>>>>> 		Table: name
>>>>>>>>> 		SSTable count: 1
>>>>>>>>> 		Space used (live): 1413203
>>>>>>>>> 		Space used (total): 1413203
>>>>>>>>> 		Space used by snapshots (total): 0
>>>>>>>>> 		Off heap memory used (total): 28813
>>>>>>>>> 		SSTable Compression Ratio: 0.5015090954531143
>>>>>>>>> 		Number of partitions (estimate): 19568
>>>>>>>>> 		Memtable cell count: 573
>>>>>>>>> 		Memtable data size: 22971
>>>>>>>>> 		Memtable off heap memory used: 0
>>>>>>>>> 		Memtable switch count: 6
>>>>>>>>> 		Local read count: 529868919
>>>>>>>>> 		Local read latency: 0.020 ms
>>>>>>>>> 		Local write count: 2707371
>>>>>>>>> 		Local write latency: 0.024 ms
>>>>>>>>> 		Pending flushes: 0
>>>>>>>>> 		Percent repaired: 0.0
>>>>>>>>> 		Bloom filter false positives: 1
>>>>>>>>> 		Bloom filter false ratio: 0.00000
>>>>>>>>> 		Bloom filter space used: 23888
>>>>>>>>> 		Bloom filter off heap memory used: 23880
>>>>>>>>> 		Index summary off heap memory used: 4717
>>>>>>>>> 		Compression metadata off heap memory used: 216
>>>>>>>>> 		Compacted partition minimum bytes: 73
>>>>>>>>> 		Compacted partition maximum bytes: 124
>>>>>>>>> 		Compacted partition mean bytes: 99
>>>>>>>>> 		Average live cells per slice (last five minutes): 1.0
>>>>>>>>> 		Maximum live cells per slice (last five minutes): 1
>>>>>>>>> 		Average tombstones per slice (last five minutes): 1.0
>>>>>>>>> 		Maximum tombstones per slice (last five minutes): 1
>>>>>>>>> 		Dropped Mutations: 0
>>>>>>>>> 		
>>>>>>>>> 		histograms
>>>>>>>>> Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
>>>>>>>>>                               (micros)          (micros)           (bytes)                  
>>>>>>>>> 50%             0.00             20.50             17.08                86                 1
>>>>>>>>> 75%             0.00             24.60             20.50               124                 1
>>>>>>>>> 95%             0.00             35.43             29.52               124                 1
>>>>>>>>> 98%             0.00             35.43             42.51               124                 1
>>>>>>>>> 99%             0.00             42.51             51.01               124                 1
>>>>>>>>> Min             0.00              8.24              5.72                73                 0
>>>>>>>>> Max             1.00             42.51            152.32               124                 1
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge
>>>>>>>>> 
>>>>>>>>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>>>> Would also be good to see your schema (anonymized if needed) and the select queries you’re running
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Jeff Jirsa
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thanks Jeff,
>>>>>>>>>>> 
>>>>>>>>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>>>>>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>>>>>>>> 
>>>>>>>>>>>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Kenneth Brotman
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>>>>>>>>>>>> Sent: Thursday, February 21, 2019 7:06 AM
>>>>>>>>>>>>> To: user@cassandra.apache.org
>>>>>>>>>>>>> Subject: Tombstones in memtable
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We have small table records are about 5k .
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

Thanks Jeff. I'm trying to figure out why the tombstones scans are
happening if possible eliminate it.

On Sat, Feb 23, 2019, 10:50 PM Jeff Jirsa <jj...@gmail.com> wrote:

> G1GC with an 8g heap may be slower than CMS. Also you don’t typically set
> new gen size on G1.
>
> Again though - what problem are you solving here? If you’re serving reads
> and sitting under 50% cpu, it’s not clear to me what you’re trying to fix.
> Tombstones scanned won’t matter for your table, so if that’s your only
> concern, I’d ignore it.
>
>
>
> --
> Jeff Jirsa
>
>
> On Feb 23, 2019, at 7:26 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> ```jvm setting
>
> -XX:+UseThreadPriorities
> -XX:ThreadPriorityPolicy=42
> -XX:+HeapDumpOnOutOfMemoryError
> -Xss256k
> -XX:StringTableSize=1000003
> -XX:+AlwaysPreTouch
> -XX:-UseBiasedLocking
> -XX:+UseTLAB
> -XX:+ResizeTLAB
> -XX:+UseNUMA
> -XX:+PerfDisableSharedMem
> -Djava.net.preferIPv4Stack=true
> -XX:+UseG1GC
> -XX:G1RSetUpdatingPauseTimePercent=5
> -XX:MaxGCPauseMillis=500
> -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintPromotionFailure
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=10
> -XX:GCLogFileSize=10M
>
> Total memory
> free
>              total       used       free     shared    buffers     cached
> Mem:      16434004   16125340     308664         60     172872    5565184
> -/+ buffers/cache:   10387284    6046720
> Swap:            0          0          0
>
> Heap settings in cassandra-env.sh
> MAX_HEAP_SIZE="8192M"
> HEAP_NEWSIZE="800M"
> ```
>
> On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <ra...@gmail.com>
> wrote:
>
>> Thanks Jeff,
>>
>> Since low writes and high reads most of the time data in memtables only.
>> When I noticed intially issue no stables on disk everything in memtable
>> only.
>>
>> On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>> Also given your short ttl and low write rate, you may want to think
>>> about how you can keep more in memory - this may mean larger memtable and
>>> high flush thresholds (reading from the memtable), or perhaps the partition
>>> cache (if you are likely to read the same key multiple times). You’ll also
>>> probably win some with basic perf and GC tuning, but can’t really do that
>>> via email. Cassandra-8150 has some pointers.
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>> You’ll only ever have one tombstone per read, so your load is based on
>>> normal read rate not tombstones. The metric isn’t wrong, but it’s not
>>> indicative of a problem here given your data model.
>>>
>>> You’re using STCS do you may be reading from more than one sstable if
>>> you update column2 for a given column1, otherwise you’re probably just
>>> seeing normal read load. Consider dropping your compression chunk size a
>>> bit (given the sizes in your cfstats I’d probably go to 4K instead of 64k),
>>> and maybe consider LCS or TWCS instead of STCS (Which is appropriate
>>> depends on a lot of factors, but STCS is probably causing a fair bit of
>>> unnecessary compactions and probably is very slow to expire data).
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com>
>>> wrote:
>>>
>>> Do you see anything wrong with this metric.
>>>
>>> metric to scan tombstones
>>>
>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>
>>> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>>>
>>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>>> Your schema is such that you’ll never read more than one tombstone per
>>>> select (unless you’re also doing range reads / table scans that you didn’t
>>>> mention) - I’m not quite sure what you’re alerting on, but you’re not going
>>>> to have tombstone problems with that table / that select.
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com>
>>>> wrote:
>>>>
>>>> Changing gcgs didn't help
>>>>
>>>> CREATE KEYSPACE ksname WITH replication = {'class':
>>>> 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
>>>> true;
>>>>
>>>>
>>>> ```CREATE TABLE keyspace."table" (
>>>>     "column1" text PRIMARY KEY,
>>>>     "column2" text
>>>> ) WITH bloom_filter_fp_chance = 0.01
>>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>>     AND comment = ''
>>>>     AND compaction = {'class':
>>>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>>>> 'max_threshold': '32', 'min_threshold': '4'}
>>>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>>>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>>     AND crc_check_chance = 1.0
>>>>     AND dclocal_read_repair_chance = 0.1
>>>>     AND default_time_to_live = 18000
>>>>     AND gc_grace_seconds = 60
>>>>     AND max_index_interval = 2048
>>>>     AND memtable_flush_period_in_ms = 0
>>>>     AND min_index_interval = 128
>>>>     AND read_repair_chance = 0.0
>>>>     AND speculative_retry = '99PERCENTILE';
>>>>
>>>> flushed table and took tsstabledump
>>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>>> 16439
>>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>>> 2657
>>>>
>>>> ttl is 4 hours.
>>>>
>>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
>>>> USING TTL(4hours) ?';
>>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>>>
>>>> metric to scan tombstones
>>>>
>>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>>
>>>> during peak hours. we only have couple of hundred inserts and 5-8k
>>>> reads/s per node.
>>>> ```
>>>>
>>>> ```tablestats
>>>> Read Count: 605231874
>>>> Read Latency: 0.021268529760215503 ms.
>>>> Write Count: 2763352
>>>> Write Latency: 0.027924007871599422 ms.
>>>> Pending Flushes: 0
>>>> Table: name
>>>> SSTable count: 1
>>>> Space used (live): 1413203
>>>> Space used (total): 1413203
>>>> Space used by snapshots (total): 0
>>>> Off heap memory used (total): 28813
>>>> SSTable Compression Ratio: 0.5015090954531143
>>>> Number of partitions (estimate): 19568
>>>> Memtable cell count: 573
>>>> Memtable data size: 22971
>>>> Memtable off heap memory used: 0
>>>> Memtable switch count: 6
>>>> Local read count: 529868919
>>>> Local read latency: 0.020 ms
>>>> Local write count: 2707371
>>>> Local write latency: 0.024 ms
>>>> Pending flushes: 0
>>>> Percent repaired: 0.0
>>>> Bloom filter false positives: 1
>>>> Bloom filter false ratio: 0.00000
>>>> Bloom filter space used: 23888
>>>> Bloom filter off heap memory used: 23880
>>>> Index summary off heap memory used: 4717
>>>> Compression metadata off heap memory used: 216
>>>> Compacted partition minimum bytes: 73
>>>> Compacted partition maximum bytes: 124
>>>> Compacted partition mean bytes: 99
>>>> Average live cells per slice (last five minutes): 1.0
>>>> Maximum live cells per slice (last five minutes): 1
>>>> Average tombstones per slice (last five minutes): 1.0
>>>> Maximum tombstones per slice (last five minutes): 1
>>>> Dropped Mutations: 0
>>>> histograms
>>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>>> Size        Cell Count
>>>>                               (micros)          (micros)
>>>>  (bytes)
>>>> 50%             0.00             20.50             17.08
>>>> 86                 1
>>>> 75%             0.00             24.60             20.50
>>>>  124                 1
>>>> 95%             0.00             35.43             29.52
>>>>  124                 1
>>>> 98%             0.00             35.43             42.51
>>>>  124                 1
>>>> 99%             0.00             42.51             51.01
>>>>  124                 1
>>>> Min             0.00              8.24              5.72
>>>> 73                 0
>>>> Max             1.00             42.51            152.32
>>>>  124                 1
>>>> ```
>>>>
>>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
>>>> m4.xlarge
>>>>
>>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>
>>>>> Would also be good to see your schema (anonymized if needed) and the
>>>>> select queries you’re running
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Jirsa
>>>>>
>>>>>
>>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thanks Jeff,
>>>>>
>>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5
>>>>> hours compared to insert ttl to 4 hours .  Tracing on doesn't show any
>>>>> tombstone scans for the reads.  And also log doesn't show tombstone scan
>>>>> alerts. Has the reads are happening 5-8k reads per node during the peak
>>>>> hours it shows 1M tombstone scans count per read.
>>>>>
>>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>
>>>>>> If all of your data is TTL’d and you never explicitly delete a cell
>>>>>> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>>
>>>>>> Which compaction strategy are you using? You need a way to clear out
>>>>>> those tombstones. There exist tombstone compaction sub properties that can
>>>>>> help encourage compaction to grab sstables just because they’re full of
>>>>>> tombstones which will probably help you.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Jirsa
>>>>>>
>>>>>>
>>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <
>>>>>> kenbrotman@yahoo.com.invalid> wrote:
>>>>>>
>>>>>> Can we see the histogram?  Why wouldn’t you at times have that many
>>>>>> tombstones?  Makes sense.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kenneth Brotman
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
>>>>>> <ra...@gmail.com>]
>>>>>> *Sent:* Thursday, February 21, 2019 7:06 AM
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Tombstones in memtable
>>>>>>
>>>>>>
>>>>>>
>>>>>> We have small table records are about 5k .
>>>>>>
>>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day
>>>>>> and gc grace seconds has 3 hours.  We do 5k reads a second during peak load
>>>>>> During the peak load seeing Alerts for tomstone scanned histogram reaching
>>>>>> million.
>>>>>>
>>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone
>>>>>> scan can be avoided in memtable
>>>>>>
>>>>>>

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

G1GC with an 8g heap may be slower than CMS. Also you don’t typically set new gen size on G1.

Again though - what problem are you solving here? If you’re serving reads and sitting under 50% cpu, it’s not clear to me what you’re trying to fix. Tombstones scanned won’t matter for your table, so if that’s your only concern, I’d ignore it. 



-- 
Jeff Jirsa


> On Feb 23, 2019, at 7:26 PM, Rahul Reddy <ra...@gmail.com> wrote:
> 
> ```jvm setting
> 
> -XX:+UseThreadPriorities
> -XX:ThreadPriorityPolicy=42
> -XX:+HeapDumpOnOutOfMemoryError
> -Xss256k
> -XX:StringTableSize=1000003
> -XX:+AlwaysPreTouch
> -XX:-UseBiasedLocking
> -XX:+UseTLAB
> -XX:+ResizeTLAB
> -XX:+UseNUMA
> -XX:+PerfDisableSharedMem
> -Djava.net.preferIPv4Stack=true
> -XX:+UseG1GC
> -XX:G1RSetUpdatingPauseTimePercent=5
> -XX:MaxGCPauseMillis=500
> -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintPromotionFailure
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=10
> -XX:GCLogFileSize=10M
> 
> Total memory
> free
>              total       used       free     shared    buffers     cached
> Mem:      16434004   16125340     308664         60     172872    5565184
> -/+ buffers/cache:   10387284    6046720
> Swap:            0          0          0
> 
> Heap settings in cassandra-env.sh
> MAX_HEAP_SIZE="8192M"
> HEAP_NEWSIZE="800M"
> ```
> 
>> On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <ra...@gmail.com> wrote:
>> Thanks Jeff,
>> 
>> Since low writes and high reads most of the time data in memtables only.  When I noticed intially issue no stables on disk everything in memtable only. 
>> 
>>> On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>> Also given your short ttl and low write rate, you may want to think about how you can keep more in memory - this may mean larger memtable and high flush thresholds (reading from the memtable), or perhaps the partition cache (if you are likely to read the same key multiple times). You’ll also probably win some with basic perf and GC tuning, but can’t really do that via email. Cassandra-8150 has some pointers. 
>>> 
>>> -- 
>>> Jeff Jirsa
>>> 
>>> 
>>>> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>> 
>>>> You’ll only ever have one tombstone per read, so your load is based on normal read rate not tombstones. The metric isn’t wrong, but it’s not indicative of a problem here given your data model. 
>>>> 
>>>> You’re using STCS do you may be reading from more than one sstable if you update column2 for a given column1, otherwise you’re probably just seeing normal read load. Consider dropping your compression chunk size a bit (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on a lot of factors, but STCS is probably causing a fair bit of unnecessary compactions and probably is very slow to expire data).
>>>> 
>>>> -- 
>>>> Jeff Jirsa
>>>> 
>>>> 
>>>>> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>> 
>>>>> Do you see anything wrong with this metric.
>>>>> 
>>>>> metric to scan tombstones
>>>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>>> 
>>>>> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>>>>> 
>>>>>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>> Your schema is such that you’ll never read more than one tombstone per select (unless you’re also doing range reads / table scans that you didn’t mention) - I’m not quite sure what you’re alerting on, but you’re not going to have tombstone problems with that table / that select. 
>>>>>> 
>>>>>> -- 
>>>>>> Jeff Jirsa
>>>>>> 
>>>>>> 
>>>>>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Changing gcgs didn't help
>>>>>>> 
>>>>>>> CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;
>>>>>>> 
>>>>>>> 
>>>>>>> ```CREATE TABLE keyspace."table" (
>>>>>>>     "column1" text PRIMARY KEY,
>>>>>>>     "column2" text
>>>>>>> ) WITH bloom_filter_fp_chance = 0.01
>>>>>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>>>>>     AND comment = ''
>>>>>>>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>>>>>>>     AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>>>>>     AND crc_check_chance = 1.0
>>>>>>>     AND dclocal_read_repair_chance = 0.1
>>>>>>>     AND default_time_to_live = 18000
>>>>>>>     AND gc_grace_seconds = 60
>>>>>>>     AND max_index_interval = 2048
>>>>>>>     AND memtable_flush_period_in_ms = 0
>>>>>>>     AND min_index_interval = 128
>>>>>>>     AND read_repair_chance = 0.0
>>>>>>>     AND speculative_retry = '99PERCENTILE';
>>>>>>> 
>>>>>>> flushed table and took tsstabledump    
>>>>>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>>>>>> 16439
>>>>>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>>>>>> 2657
>>>>>>> 
>>>>>>> ttl is 4 hours.
>>>>>>> 
>>>>>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';
>>>>>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>>>>>> 
>>>>>>> metric to scan tombstones 
>>>>>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>>>>> 
>>>>>>> during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.
>>>>>>> ```
>>>>>>> 
>>>>>>> ```tablestats
>>>>>>> 	Read Count: 605231874
>>>>>>> 	Read Latency: 0.021268529760215503 ms.
>>>>>>> 	Write Count: 2763352
>>>>>>> 	Write Latency: 0.027924007871599422 ms.
>>>>>>> 	Pending Flushes: 0
>>>>>>> 		Table: name
>>>>>>> 		SSTable count: 1
>>>>>>> 		Space used (live): 1413203
>>>>>>> 		Space used (total): 1413203
>>>>>>> 		Space used by snapshots (total): 0
>>>>>>> 		Off heap memory used (total): 28813
>>>>>>> 		SSTable Compression Ratio: 0.5015090954531143
>>>>>>> 		Number of partitions (estimate): 19568
>>>>>>> 		Memtable cell count: 573
>>>>>>> 		Memtable data size: 22971
>>>>>>> 		Memtable off heap memory used: 0
>>>>>>> 		Memtable switch count: 6
>>>>>>> 		Local read count: 529868919
>>>>>>> 		Local read latency: 0.020 ms
>>>>>>> 		Local write count: 2707371
>>>>>>> 		Local write latency: 0.024 ms
>>>>>>> 		Pending flushes: 0
>>>>>>> 		Percent repaired: 0.0
>>>>>>> 		Bloom filter false positives: 1
>>>>>>> 		Bloom filter false ratio: 0.00000
>>>>>>> 		Bloom filter space used: 23888
>>>>>>> 		Bloom filter off heap memory used: 23880
>>>>>>> 		Index summary off heap memory used: 4717
>>>>>>> 		Compression metadata off heap memory used: 216
>>>>>>> 		Compacted partition minimum bytes: 73
>>>>>>> 		Compacted partition maximum bytes: 124
>>>>>>> 		Compacted partition mean bytes: 99
>>>>>>> 		Average live cells per slice (last five minutes): 1.0
>>>>>>> 		Maximum live cells per slice (last five minutes): 1
>>>>>>> 		Average tombstones per slice (last five minutes): 1.0
>>>>>>> 		Maximum tombstones per slice (last five minutes): 1
>>>>>>> 		Dropped Mutations: 0
>>>>>>> 		
>>>>>>> 		histograms
>>>>>>> Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
>>>>>>>                               (micros)          (micros)           (bytes)                  
>>>>>>> 50%             0.00             20.50             17.08                86                 1
>>>>>>> 75%             0.00             24.60             20.50               124                 1
>>>>>>> 95%             0.00             35.43             29.52               124                 1
>>>>>>> 98%             0.00             35.43             42.51               124                 1
>>>>>>> 99%             0.00             42.51             51.01               124                 1
>>>>>>> Min             0.00              8.24              5.72                73                 0
>>>>>>> Max             1.00             42.51            152.32               124                 1
>>>>>>> ```
>>>>>>> 
>>>>>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge
>>>>>>> 
>>>>>>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>> Would also be good to see your schema (anonymized if needed) and the select queries you’re running
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Jeff Jirsa
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Thanks Jeff,
>>>>>>>>> 
>>>>>>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
>>>>>>>>> 
>>>>>>>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>>>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>>>>>> 
>>>>>>>>>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Jeff Jirsa
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Kenneth Brotman
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>>>>>>>>>> Sent: Thursday, February 21, 2019 7:06 AM
>>>>>>>>>>> To: user@cassandra.apache.org
>>>>>>>>>>> Subject: Tombstones in memtable
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> We have small table records are about 5k .
>>>>>>>>>>> 
>>>>>>>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>>>>>>>>>> 
>>>>>>>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

Reads increase on all most all nodes same is the case with CPU. it's goes
high on all nodes

On Sat, Feb 23, 2019, 11:04 PM Kenneth Brotman <ke...@yahoo.com.invalid>
wrote:

> When the CPU utilization spikes from 5-10% to 50%, how many nodes does it
> happen to at the same time?
>
>
>
> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com]
> *Sent:* Saturday, February 23, 2019 7:26 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Tombstones in memtable
>
>
>
> ```jvm setting
>
>
>
> -XX:+UseThreadPriorities
>
> -XX:ThreadPriorityPolicy=42
>
> -XX:+HeapDumpOnOutOfMemoryError
>
> -Xss256k
>
> -XX:StringTableSize=1000003
>
> -XX:+AlwaysPreTouch
>
> -XX:-UseBiasedLocking
>
> -XX:+UseTLAB
>
> -XX:+ResizeTLAB
>
> -XX:+UseNUMA
>
> -XX:+PerfDisableSharedMem
>
> -Djava.net.preferIPv4Stack=true
>
> -XX:+UseG1GC
>
> -XX:G1RSetUpdatingPauseTimePercent=5
>
> -XX:MaxGCPauseMillis=500
>
> -XX:+PrintGCDetails
>
> -XX:+PrintGCDateStamps
>
> -XX:+PrintHeapAtGC
>
> -XX:+PrintTenuringDistribution
>
> -XX:+PrintGCApplicationStoppedTime
>
> -XX:+PrintPromotionFailure
>
> -XX:+UseGCLogFileRotation
>
> -XX:NumberOfGCLogFiles=10
>
> -XX:GCLogFileSize=10M
>
>
>
> Total memory
>
> free
>
>              total       used       free     shared    buffers     cached
>
> Mem:      16434004   16125340     308664         60     172872    5565184
>
> -/+ buffers/cache:   10387284    6046720
>
> Swap:            0          0          0
>
>
>
> Heap settings in cassandra-env.sh
>
> MAX_HEAP_SIZE="8192M"
>
> HEAP_NEWSIZE="800M"
>
> ```
>
>
>
> On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <ra...@gmail.com>
> wrote:
>
> Thanks Jeff,
>
>
>
> Since low writes and high reads most of the time data in memtables only.
> When I noticed intially issue no stables on disk everything in memtable
> only.
>
>
>
> On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
> Also given your short ttl and low write rate, you may want to think about
> how you can keep more in memory - this may mean larger memtable and high
> flush thresholds (reading from the memtable), or perhaps the partition
> cache (if you are likely to read the same key multiple times). You’ll also
> probably win some with basic perf and GC tuning, but can’t really do that
> via email. Cassandra-8150 has some pointers.
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>
> You’ll only ever have one tombstone per read, so your load is based on
> normal read rate not tombstones. The metric isn’t wrong, but it’s not
> indicative of a problem here given your data model
>
>
>
> You’re using STCS do you may be reading from more than one sstable if you
> update column2 for a given column1, otherwise you’re probably just seeing
> normal read load. Consider dropping your compression chunk size a bit
> (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and
> maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on
> a lot of factors, but STCS is probably causing a fair bit of unnecessary
> compactions and probably is very slow to expire data).
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> Do you see anything wrong with this metric.
>
>
>
> metric to scan tombstones
>
>
> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>
>
>
> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>
>
>
> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
> Your schema is such that you’ll never read more than one tombstone per
> select (unless you’re also doing range reads / table scans that you didn’t
> mention) - I’m not quite sure what you’re alerting on, but you’re not going
> to have tombstone problems with that table / that select.
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> Changing gcgs didn't help
>
>
>
> CREATE KEYSPACE ksname WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
> true;
>
>
>
>
>
> ```CREATE TABLE keyspace."table" (
>
>     "column1" text PRIMARY KEY,
>
>     "column2" text
>
> ) WITH bloom_filter_fp_chance = 0.01
>
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>
>     AND comment = ''
>
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>
>     AND crc_check_chance = 1.0
>
>     AND dclocal_read_repair_chance = 0.1
>
>     AND default_time_to_live = 18000
>
>     AND gc_grace_seconds = 60
>
>     AND max_index_interval = 2048
>
>     AND memtable_flush_period_in_ms = 0
>
>     AND min_index_interval = 128
>
>     AND read_repair_chance = 0.0
>
>     AND speculative_retry = '99PERCENTILE';
>
>
>
> flushed table and took tsstabledump
>
> grep -i '"expired" : true' SSTables.txt|wc -l
>
> 16439
>
> grep -i '"expired" : false'  SSTables.txt |wc -l
>
> 2657
>
>
>
> ttl is 4 hours.
>
>
>
> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
> USING TTL(4hours) ?';
>
> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>
>
>
> metric to scan tombstones
>
>
> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>
>
>
> during peak hours. we only have couple of hundred inserts and 5-8k reads/s
> per node.
>
> ```
>
>
>
> ```tablestats
>
> Read Count: 605231874
>
> Read Latency: 0.021268529760215503 ms.
>
> Write Count: 2763352
>
> Write Latency: 0.027924007871599422 ms.
>
> Pending Flushes: 0
>
> Table: name
>
> SSTable count: 1
>
> Space used (live): 1413203
>
> Space used (total): 1413203
>
> Space used by snapshots (total): 0
>
> Off heap memory used (total): 28813
>
> SSTable Compression Ratio: 0.5015090954531143
>
> Number of partitions (estimate): 19568
>
> Memtable cell count: 573
>
> Memtable data size: 22971
>
> Memtable off heap memory used: 0
>
> Memtable switch count: 6
>
> Local read count: 529868919
>
> Local read latency: 0.020 ms
>
> Local write count: 2707371
>
> Local write latency: 0.024 ms
>
> Pending flushes: 0
>
> Percent repaired: 0.0
>
> Bloom filter false positives: 1
>
> Bloom filter false ratio: 0.00000
>
> Bloom filter space used: 23888
>
> Bloom filter off heap memory used: 23880
>
> Index summary off heap memory used: 4717
>
> Compression metadata off heap memory used: 216
>
> Compacted partition minimum bytes: 73
>
> Compacted partition maximum bytes: 124
>
> Compacted partition mean bytes: 99
>
> Average live cells per slice (last five minutes): 1.0
>
> Maximum live cells per slice (last five minutes): 1
>
> Average tombstones per slice (last five minutes): 1.0
>
> Maximum tombstones per slice (last five minutes): 1
>
> Dropped Mutations: 0
>
> histograms
>
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>
>                               (micros)          (micros)
>  (bytes)
>
> 50%             000             20.50             17.08                86
>                1
>
> 75%             0.00             24.60             20.50
>  124                 1
>
> 95%             0.00             35.43             29.52
>  124                 1
>
> 98%             0.00             35.43             42.51
>  124                 1
>
> 99%             0.00             42.51             51.01
>  124                 1
>
> Min             0.00              8.24              5.72
> 73                 0
>
> Max             1.00             42.51            152.32
>  124                 1
>
> ```
>
>
>
> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
> m4.xlarge
>
>
>
> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
> Would also be good to see your schema (anonymized if needed) and the
> select queries you’re running
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> Thanks Jeff,
>
>
>
> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours
> compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone
> scans for the reads.  And also log doesn't show tombstone scan alerts. Has
> the reads are happening 5-8k reads per node during the peak hours it shows
> 1M tombstone scans count per read.
>
>
>
> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>
> If all of your data is TTL’d and you never explicitly delete a cell
> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>
>
>
> Which compaction strategy are you using? You need a way to clear out those
> tombstones. There exist tombstone compaction sub properties that can help
> encourage compaction to grab sstables just because they’re full of
> tombstones which will probably help you.
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid>
> wrote:
>
> Can we see the histogram?  Why wouldn’t you at times have that many
> tombstones?  Makes sense.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
> <ra...@gmail.com>]
> *Sent:* Thursday, February 21, 2019 7:06 AM
> *To:* user@cassandra.apache.org
> *Subject:* Tombstones in memtable
>
>
>
> We have small table records are about 5k .
>
> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc
> grace seconds has 3 hours.  We do 5k reads a second during peak load During
> the peak load seeing Alerts for tomstone scanned histogram reaching million.
>
> Cassandra version 3.11.1. Please let me know how can this tombstone scan
> can be avoided in memtable
>
>

RE: Tombstones in memtable

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.

When the CPU utilization spikes from 5-10% to 50%, how many nodes does it happen to at the same time?

From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
Sent: Saturday, February 23, 2019 7:26 PM
To: user@cassandra.apache.org
Subject: Re: Tombstones in memtable

```jvm setting

-XX:+UseThreadPriorities

-XX:ThreadPriorityPolicy=42

-XX:+HeapDumpOnOutOfMemoryError

-Xss256k

-XX:StringTableSize=1000003

-XX:+AlwaysPreTouch

-XX:-UseBiasedLocking

-XX:+UseTLAB

-XX:+ResizeTLAB

-XX:+UseNUMA

-XX:+PerfDisableSharedMem

-Djava.net.preferIPv4Stack=true

-XX:+UseG1GC

-XX:G1RSetUpdatingPauseTimePercent=5

-XX:MaxGCPauseMillis=500

-XX:+PrintGCDetails

-XX:+PrintGCDateStamps

-XX:+PrintHeapAtGC

-XX:+PrintTenuringDistribution

-XX:+PrintGCApplicationStoppedTime

-XX:+PrintPromotionFailure

-XX:+UseGCLogFileRotation

-XX:NumberOfGCLogFiles=10

-XX:GCLogFileSize=10M

Total memory

free

             total       used       free     shared    buffers     cached

Mem:      16434004   16125340     308664         60     172872    5565184

-/+ buffers/cache:   10387284    6046720

Swap:            0          0          0

Heap settings in cassandra-env.sh

MAX_HEAP_SIZE="8192M"

HEAP_NEWSIZE="800M"

```

On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <ra...@gmail.com> wrote:

Thanks Jeff,

Since low writes and high reads most of the time data in memtables only.  When I noticed intially issue no stables on disk everything in memtable only. 

On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:

Also given your short ttl and low write rate, you may want to think about how you can keep more in memory - this may mean larger memtable and high flush thresholds (reading from the memtable), or perhaps the partition cache (if you are likely to read the same key multiple times). You’ll also probably win some with basic perf and GC tuning, but can’t really do that via email. Cassandra-8150 has some pointers. 

-- 

Jeff Jirsa

On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:

You’ll only ever have one tombstone per read, so your load is based on normal read rate not tombstones. The metric isn’t wrong, but it’s not indicative of a problem here given your data model 

You’re using STCS do you may be reading from more than one sstable if you update column2 for a given column1, otherwise you’re probably just seeing normal read load. Consider dropping your compression chunk size a bit (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on a lot of factors, but STCS is probably causing a fair bit of unnecessary compactions and probably is very slow to expire data).

-- 

Jeff Jirsa

On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:

Do you see anything wrong with this metric.

metric to scan tombstones

increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])

And sametime CPU Spike to 50% whenever I see high tombstone alert.

On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:

Your schema is such that you’ll never read more than one tombstone per select (unless you’re also doing range reads / table scans that you didn’t mention) - I’m not quite sure what you’re alerting on, but you’re not going to have tombstone problems with that table / that select. 

-- 

Jeff Jirsa

On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:

Changing gcgs didn't help

CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;

```CREATE TABLE keyspace."table" (

    "column1" text PRIMARY KEY,

    "column2" text

) WITH bloom_filter_fp_chance = 0.01

    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

    AND comment = ''

    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}

    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0

    AND dclocal_read_repair_chance = 0.1

    AND default_time_to_live = 18000

    AND gc_grace_seconds = 60

    AND max_index_interval = 2048

    AND memtable_flush_period_in_ms = 0

    AND min_index_interval = 128

    AND read_repair_chance = 0.0

    AND speculative_retry = '99PERCENTILE';

flushed table and took tsstabledump     

grep -i '"expired" : true' SSTables.txt|wc -l

16439

grep -i '"expired" : false'  SSTables.txt |wc -l

2657

ttl is 4 hours.

INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';

SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';

metric to scan tombstones 

increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])

during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.

```

```tablestats

Read Count: 605231874

Read Latency: 0.021268529760215503 ms.

Write Count: 2763352

Write Latency: 0.027924007871599422 ms.

Pending Flushes: 0

Table: name

SSTable count: 1

Space used (live): 1413203

Space used (total): 1413203

Space used by snapshots (total): 0

Off heap memory used (total): 28813

SSTable Compression Ratio: 0.5015090954531143

Number of partitions (estimate): 19568

Memtable cell count: 573

Memtable data size: 22971

Memtable off heap memory used: 0

Memtable switch count: 6

Local read count: 529868919

Local read latency: 0.020 ms

Local write count: 2707371

Local write latency: 0.024 ms

Pending flushes: 0

Percent repaired: 0.0

Bloom filter false positives: 1

Bloom filter false ratio: 0.00000

Bloom filter space used: 23888

Bloom filter off heap memory used: 23880

Index summary off heap memory used: 4717

Compression metadata off heap memory used: 216

Compacted partition minimum bytes: 73

Compacted partition maximum bytes: 124

Compacted partition mean bytes: 99

Average live cells per slice (last five minutes): 1.0

Maximum live cells per slice (last five minutes): 1

Average tombstones per slice (last five minutes): 1.0

Maximum tombstones per slice (last five minutes): 1

Dropped Mutations: 0

histograms

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count

                              (micros)          (micros)           (bytes)                  

50%             000             20.50             17.08                86                 1

75%             0.00             24.60             20.50               124                 1

95%             0.00             35.43             29.52               124                 1

98%             0.00             35.43             42.51               124                 1

99%             0.00             42.51             51.01               124                 1

Min             0.00              8.24              5.72                73                 0

Max             1.00             42.51            152.32               124                 1

```

3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge

On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:

Would also be good to see your schema (anonymized if needed) and the select queries you’re running

-- 

Jeff Jirsa

On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:

Thanks Jeff,

I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 

On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:

If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).

Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.

-- 

Jeff Jirsa

On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.

Kenneth Brotman

From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
Sent: Thursday, February 21, 2019 7:06 AM
To: user@cassandra.apache.org
Subject: Tombstones in memtable

We have small table records are about 5k .

All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.

Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

```jvm setting

-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:StringTableSize=1000003
-XX:+AlwaysPreTouch
-XX:-UseBiasedLocking
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:+UseNUMA
-XX:+PerfDisableSharedMem
-Djava.net.preferIPv4Stack=true
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M

Total memory
free
             total       used       free     shared    buffers     cached
Mem:      16434004   16125340     308664         60     172872    5565184
-/+ buffers/cache:   10387284    6046720
Swap:            0          0          0

Heap settings in cassandra-env.sh
MAX_HEAP_SIZE="8192M"
HEAP_NEWSIZE="800M"
```

On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <ra...@gmail.com> wrote:

> Thanks Jeff,
>
> Since low writes and high reads most of the time data in memtables only.
> When I noticed intially issue no stables on disk everything in memtable
> only.
>
> On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
>> Also given your short ttl and low write rate, you may want to think about
>> how you can keep more in memory - this may mean larger memtable and high
>> flush thresholds (reading from the memtable), or perhaps the partition
>> cache (if you are likely to read the same key multiple times). You’ll also
>> probably win some with basic perf and GC tuning, but can’t really do that
>> via email. Cassandra-8150 has some pointers.
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>
>> You’ll only ever have one tombstone per read, so your load is based on
>> normal read rate not tombstones. The metric isn’t wrong, but it’s not
>> indicative of a problem here given your data model.
>>
>> You’re using STCS do you may be reading from more than one sstable if you
>> update column2 for a given column1, otherwise you’re probably just seeing
>> normal read load. Consider dropping your compression chunk size a bit
>> (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and
>> maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on
>> a lot of factors, but STCS is probably causing a fair bit of unnecessary
>> compactions and probably is very slow to expire data).
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com>
>> wrote:
>>
>> Do you see anything wrong with this metric.
>>
>> metric to scan tombstones
>>
>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>
>> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>>
>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>> Your schema is such that you’ll never read more than one tombstone per
>>> select (unless you’re also doing range reads / table scans that you didn’t
>>> mention) - I’m not quite sure what you’re alerting on, but you’re not going
>>> to have tombstone problems with that table / that select.
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com>
>>> wrote:
>>>
>>> Changing gcgs didn't help
>>>
>>> CREATE KEYSPACE ksname WITH replication = {'class':
>>> 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
>>> true;
>>>
>>>
>>> ```CREATE TABLE keyspace."table" (
>>>     "column1" text PRIMARY KEY,
>>>     "column2" text
>>> ) WITH bloom_filter_fp_chance = 0.01
>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>     AND comment = ''
>>>     AND compaction = {'class':
>>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>>> 'max_threshold': '32', 'min_threshold': '4'}
>>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>     AND crc_check_chance = 1.0
>>>     AND dclocal_read_repair_chance = 0.1
>>>     AND default_time_to_live = 18000
>>>     AND gc_grace_seconds = 60
>>>     AND max_index_interval = 2048
>>>     AND memtable_flush_period_in_ms = 0
>>>     AND min_index_interval = 128
>>>     AND read_repair_chance = 0.0
>>>     AND speculative_retry = '99PERCENTILE';
>>>
>>> flushed table and took tsstabledump
>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>> 16439
>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>> 2657
>>>
>>> ttl is 4 hours.
>>>
>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
>>> USING TTL(4hours) ?';
>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>>
>>> metric to scan tombstones
>>>
>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>
>>> during peak hours. we only have couple of hundred inserts and 5-8k
>>> reads/s per node.
>>> ```
>>>
>>> ```tablestats
>>> Read Count: 605231874
>>> Read Latency: 0.021268529760215503 ms.
>>> Write Count: 2763352
>>> Write Latency: 0.027924007871599422 ms.
>>> Pending Flushes: 0
>>> Table: name
>>> SSTable count: 1
>>> Space used (live): 1413203
>>> Space used (total): 1413203
>>> Space used by snapshots (total): 0
>>> Off heap memory used (total): 28813
>>> SSTable Compression Ratio: 0.5015090954531143
>>> Number of partitions (estimate): 19568
>>> Memtable cell count: 573
>>> Memtable data size: 22971
>>> Memtable off heap memory used: 0
>>> Memtable switch count: 6
>>> Local read count: 529868919
>>> Local read latency: 0.020 ms
>>> Local write count: 2707371
>>> Local write latency: 0.024 ms
>>> Pending flushes: 0
>>> Percent repaired: 0.0
>>> Bloom filter false positives: 1
>>> Bloom filter false ratio: 0.00000
>>> Bloom filter space used: 23888
>>> Bloom filter off heap memory used: 23880
>>> Index summary off heap memory used: 4717
>>> Compression metadata off heap memory used: 216
>>> Compacted partition minimum bytes: 73
>>> Compacted partition maximum bytes: 124
>>> Compacted partition mean bytes: 99
>>> Average live cells per slice (last five minutes): 1.0
>>> Maximum live cells per slice (last five minutes): 1
>>> Average tombstones per slice (last five minutes): 1.0
>>> Maximum tombstones per slice (last five minutes): 1
>>> Dropped Mutations: 0
>>> histograms
>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>> Size        Cell Count
>>>                               (micros)          (micros)
>>>  (bytes)
>>> 50%             0.00             20.50             17.08
>>> 86                 1
>>> 75%             0.00             24.60             20.50
>>>  124                 1
>>> 95%             0.00             35.43             29.52
>>>  124                 1
>>> 98%             0.00             35.43             42.51
>>>  124                 1
>>> 99%             0.00             42.51             51.01
>>>  124                 1
>>> Min             0.00              8.24              5.72
>>> 73                 0
>>> Max             1.00             42.51            152.32
>>>  124                 1
>>> ```
>>>
>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
>>> m4.xlarge
>>>
>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>>> Would also be good to see your schema (anonymized if needed) and the
>>>> select queries you’re running
>>>>
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Jeff,
>>>>
>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5
>>>> hours compared to insert ttl to 4 hours .  Tracing on doesn't show any
>>>> tombstone scans for the reads.  And also log doesn't show tombstone scan
>>>> alerts. Has the reads are happening 5-8k reads per node during the peak
>>>> hours it shows 1M tombstone scans count per read.
>>>>
>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>
>>>>> If all of your data is TTL’d and you never explicitly delete a cell
>>>>> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>
>>>>> Which compaction strategy are you using? You need a way to clear out
>>>>> those tombstones. There exist tombstone compaction sub properties that can
>>>>> help encourage compaction to grab sstables just because they’re full of
>>>>> tombstones which will probably help you.
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Jirsa
>>>>>
>>>>>
>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <
>>>>> kenbrotman@yahoo.com.invalid> wrote:
>>>>>
>>>>> Can we see the histogram?  Why wouldn’t you at times have that many
>>>>> tombstones?  Makes sense.
>>>>>
>>>>>
>>>>>
>>>>> Kenneth Brotman
>>>>>
>>>>>
>>>>>
>>>>> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
>>>>> <ra...@gmail.com>]
>>>>> *Sent:* Thursday, February 21, 2019 7:06 AM
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Tombstones in memtable
>>>>>
>>>>>
>>>>>
>>>>> We have small table records are about 5k .
>>>>>
>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and
>>>>> gc grace seconds has 3 hours.  We do 5k reads a second during peak load
>>>>> During the peak load seeing Alerts for tomstone scanned histogram reaching
>>>>> million.
>>>>>
>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone
>>>>> scan can be avoided in memtable
>>>>>
>>>>>

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

Thanks Jeff,

Since low writes and high reads most of the time data in memtables only.
When I noticed intially issue no stables on disk everything in memtable
only.

On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jj...@gmail.com> wrote:

> Also given your short ttl and low write rate, you may want to think about
> how you can keep more in memory - this may mean larger memtable and high
> flush thresholds (reading from the memtable), or perhaps the partition
> cache (if you are likely to read the same key multiple times). You’ll also
> probably win some with basic perf and GC tuning, but can’t really do that
> via email. Cassandra-8150 has some pointers.
>
> --
> Jeff Jirsa
>
>
> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>
> You’ll only ever have one tombstone per read, so your load is based on
> normal read rate not tombstones. The metric isn’t wrong, but it’s not
> indicative of a problem here given your data model.
>
> You’re using STCS do you may be reading from more than one sstable if you
> update column2 for a given column1, otherwise you’re probably just seeing
> normal read load. Consider dropping your compression chunk size a bit
> (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and
> maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on
> a lot of factors, but STCS is probably causing a fair bit of unnecessary
> compactions and probably is very slow to expire data).
>
> --
> Jeff Jirsa
>
>
> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> Do you see anything wrong with this metric.
>
> metric to scan tombstones
>
> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>
> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>
> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
>> Your schema is such that you’ll never read more than one tombstone per
>> select (unless you’re also doing range reads / table scans that you didn’t
>> mention) - I’m not quite sure what you’re alerting on, but you’re not going
>> to have tombstone problems with that table / that select.
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com>
>> wrote:
>>
>> Changing gcgs didn't help
>>
>> CREATE KEYSPACE ksname WITH replication = {'class':
>> 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
>> true;
>>
>>
>> ```CREATE TABLE keyspace."table" (
>>     "column1" text PRIMARY KEY,
>>     "column2" text
>> ) WITH bloom_filter_fp_chance = 0.01
>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>     AND comment = ''
>>     AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>> 'max_threshold': '32', 'min_threshold': '4'}
>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>     AND crc_check_chance = 1.0
>>     AND dclocal_read_repair_chance = 0.1
>>     AND default_time_to_live = 18000
>>     AND gc_grace_seconds = 60
>>     AND max_index_interval = 2048
>>     AND memtable_flush_period_in_ms = 0
>>     AND min_index_interval = 128
>>     AND read_repair_chance = 0.0
>>     AND speculative_retry = '99PERCENTILE';
>>
>> flushed table and took tsstabledump
>> grep -i '"expired" : true' SSTables.txt|wc -l
>> 16439
>> grep -i '"expired" : false'  SSTables.txt |wc -l
>> 2657
>>
>> ttl is 4 hours.
>>
>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
>> USING TTL(4hours) ?';
>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>
>> metric to scan tombstones
>>
>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>
>> during peak hours. we only have couple of hundred inserts and 5-8k
>> reads/s per node.
>> ```
>>
>> ```tablestats
>> Read Count: 605231874
>> Read Latency: 0.021268529760215503 ms.
>> Write Count: 2763352
>> Write Latency: 0.027924007871599422 ms.
>> Pending Flushes: 0
>> Table: name
>> SSTable count: 1
>> Space used (live): 1413203
>> Space used (total): 1413203
>> Space used by snapshots (total): 0
>> Off heap memory used (total): 28813
>> SSTable Compression Ratio: 0.5015090954531143
>> Number of partitions (estimate): 19568
>> Memtable cell count: 573
>> Memtable data size: 22971
>> Memtable off heap memory used: 0
>> Memtable switch count: 6
>> Local read count: 529868919
>> Local read latency: 0.020 ms
>> Local write count: 2707371
>> Local write latency: 0.024 ms
>> Pending flushes: 0
>> Percent repaired: 0.0
>> Bloom filter false positives: 1
>> Bloom filter false ratio: 0.00000
>> Bloom filter space used: 23888
>> Bloom filter off heap memory used: 23880
>> Index summary off heap memory used: 4717
>> Compression metadata off heap memory used: 216
>> Compacted partition minimum bytes: 73
>> Compacted partition maximum bytes: 124
>> Compacted partition mean bytes: 99
>> Average live cells per slice (last five minutes): 1.0
>> Maximum live cells per slice (last five minutes): 1
>> Average tombstones per slice (last five minutes): 1.0
>> Maximum tombstones per slice (last five minutes): 1
>> Dropped Mutations: 0
>> histograms
>> Percentile  SSTables     Write Latency      Read Latency    Partition
>> Size        Cell Count
>>                               (micros)          (micros)
>>  (bytes)
>> 50%             0.00             20.50             17.08
>> 86                 1
>> 75%             0.00             24.60             20.50
>>  124                 1
>> 95%             0.00             35.43             29.52
>>  124                 1
>> 98%             0.00             35.43             42.51
>>  124                 1
>> 99%             0.00             42.51             51.01
>>  124                 1
>> Min             0.00              8.24              5.72
>> 73                 0
>> Max             1.00             42.51            152.32
>>  124                 1
>> ```
>>
>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
>> m4.xlarge
>>
>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>> Would also be good to see your schema (anonymized if needed) and the
>>> select queries you’re running
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com>
>>> wrote:
>>>
>>> Thanks Jeff,
>>>
>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5
>>> hours compared to insert ttl to 4 hours .  Tracing on doesn't show any
>>> tombstone scans for the reads.  And also log doesn't show tombstone scan
>>> alerts. Has the reads are happening 5-8k reads per node during the peak
>>> hours it shows 1M tombstone scans count per read.
>>>
>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>>> If all of your data is TTL’d and you never explicitly delete a cell
>>>> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>
>>>> Which compaction strategy are you using? You need a way to clear out
>>>> those tombstones. There exist tombstone compaction sub properties that can
>>>> help encourage compaction to grab sstables just because they’re full of
>>>> tombstones which will probably help you.
>>>>
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <
>>>> kenbrotman@yahoo.com.invalid> wrote:
>>>>
>>>> Can we see the histogram?  Why wouldn’t you at times have that many
>>>> tombstones?  Makes sense.
>>>>
>>>>
>>>>
>>>> Kenneth Brotman
>>>>
>>>>
>>>>
>>>> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
>>>> <ra...@gmail.com>]
>>>> *Sent:* Thursday, February 21, 2019 7:06 AM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Tombstones in memtable
>>>>
>>>>
>>>>
>>>> We have small table records are about 5k .
>>>>
>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and
>>>> gc grace seconds has 3 hours.  We do 5k reads a second during peak load
>>>> During the peak load seeing Alerts for tomstone scanned histogram reaching
>>>> million.
>>>>
>>>> Cassandra version 3.11.1. Please let me know how can this tombstone
>>>> scan can be avoided in memtable
>>>>
>>>>

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

Also given your short ttl and low write rate, you may want to think about how you can keep more in memory - this may mean larger memtable and high flush thresholds (reading from the memtable), or perhaps the partition cache (if you are likely to read the same key multiple times). You’ll also probably win some with basic perf and GC tuning, but can’t really do that via email. Cassandra-8150 has some pointers. 

-- 
Jeff Jirsa


> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jj...@gmail.com> wrote:
> 
> You’ll only ever have one tombstone per read, so your load is based on normal read rate not tombstones. The metric isn’t wrong, but it’s not indicative of a problem here given your data model. 
> 
> You’re using STCS do you may be reading from more than one sstable if you update column2 for a given column1, otherwise you’re probably just seeing normal read load. Consider dropping your compression chunk size a bit (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on a lot of factors, but STCS is probably causing a fair bit of unnecessary compactions and probably is very slow to expire data).
> 
> -- 
> Jeff Jirsa
> 
> 
>> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:
>> 
>> Do you see anything wrong with this metric.
>> 
>> metric to scan tombstones
>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>> 
>> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>> 
>>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>> Your schema is such that you’ll never read more than one tombstone per select (unless you’re also doing range reads / table scans that you didn’t mention) - I’m not quite sure what you’re alerting on, but you’re not going to have tombstone problems with that table / that select. 
>>> 
>>> -- 
>>> Jeff Jirsa
>>> 
>>> 
>>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>> 
>>>> Changing gcgs didn't help
>>>> 
>>>> CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;
>>>> 
>>>> 
>>>> ```CREATE TABLE keyspace."table" (
>>>>     "column1" text PRIMARY KEY,
>>>>     "column2" text
>>>> ) WITH bloom_filter_fp_chance = 0.01
>>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>>     AND comment = ''
>>>>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>>>>     AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>>     AND crc_check_chance = 1.0
>>>>     AND dclocal_read_repair_chance = 0.1
>>>>     AND default_time_to_live = 18000
>>>>     AND gc_grace_seconds = 60
>>>>     AND max_index_interval = 2048
>>>>     AND memtable_flush_period_in_ms = 0
>>>>     AND min_index_interval = 128
>>>>     AND read_repair_chance = 0.0
>>>>     AND speculative_retry = '99PERCENTILE';
>>>> 
>>>> flushed table and took tsstabledump     
>>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>>> 16439
>>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>>> 2657
>>>> 
>>>> ttl is 4 hours.
>>>> 
>>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';
>>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>>> 
>>>> metric to scan tombstones 
>>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>> 
>>>> during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.
>>>> ```
>>>> 
>>>> ```tablestats
>>>> 	Read Count: 605231874
>>>> 	Read Latency: 0.021268529760215503 ms.
>>>> 	Write Count: 2763352
>>>> 	Write Latency: 0.027924007871599422 ms.
>>>> 	Pending Flushes: 0
>>>> 		Table: name
>>>> 		SSTable count: 1
>>>> 		Space used (live): 1413203
>>>> 		Space used (total): 1413203
>>>> 		Space used by snapshots (total): 0
>>>> 		Off heap memory used (total): 28813
>>>> 		SSTable Compression Ratio: 0.5015090954531143
>>>> 		Number of partitions (estimate): 19568
>>>> 		Memtable cell count: 573
>>>> 		Memtable data size: 22971
>>>> 		Memtable off heap memory used: 0
>>>> 		Memtable switch count: 6
>>>> 		Local read count: 529868919
>>>> 		Local read latency: 0.020 ms
>>>> 		Local write count: 2707371
>>>> 		Local write latency: 0.024 ms
>>>> 		Pending flushes: 0
>>>> 		Percent repaired: 0.0
>>>> 		Bloom filter false positives: 1
>>>> 		Bloom filter false ratio: 0.00000
>>>> 		Bloom filter space used: 23888
>>>> 		Bloom filter off heap memory used: 23880
>>>> 		Index summary off heap memory used: 4717
>>>> 		Compression metadata off heap memory used: 216
>>>> 		Compacted partition minimum bytes: 73
>>>> 		Compacted partition maximum bytes: 124
>>>> 		Compacted partition mean bytes: 99
>>>> 		Average live cells per slice (last five minutes): 1.0
>>>> 		Maximum live cells per slice (last five minutes): 1
>>>> 		Average tombstones per slice (last five minutes): 1.0
>>>> 		Maximum tombstones per slice (last five minutes): 1
>>>> 		Dropped Mutations: 0
>>>> 		
>>>> 		histograms
>>>> Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
>>>>                               (micros)          (micros)           (bytes)                  
>>>> 50%             0.00             20.50             17.08                86                 1
>>>> 75%             0.00             24.60             20.50               124                 1
>>>> 95%             0.00             35.43             29.52               124                 1
>>>> 98%             0.00             35.43             42.51               124                 1
>>>> 99%             0.00             42.51             51.01               124                 1
>>>> Min             0.00              8.24              5.72                73                 0
>>>> Max             1.00             42.51            152.32               124                 1
>>>> ```
>>>> 
>>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge
>>>> 
>>>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>> Would also be good to see your schema (anonymized if needed) and the select queries you’re running
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jeff Jirsa
>>>>> 
>>>>> 
>>>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>>> 
>>>>>> Thanks Jeff,
>>>>>> 
>>>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
>>>>>> 
>>>>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>>> 
>>>>>>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Jeff Jirsa
>>>>>>> 
>>>>>>> 
>>>>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>>>>>>> 
>>>>>>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> Kenneth Brotman
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>>>>>>> Sent: Thursday, February 21, 2019 7:06 AM
>>>>>>>> To: user@cassandra.apache.org
>>>>>>>> Subject: Tombstones in memtable
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> We have small table records are about 5k .
>>>>>>>> 
>>>>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>>>>>>> 
>>>>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

You’ll only ever have one tombstone per read, so your load is based on normal read rate not tombstones. The metric isn’t wrong, but it’s not indicative of a problem here given your data model. 

You’re using STCS do you may be reading from more than one sstable if you update column2 for a given column1, otherwise you’re probably just seeing normal read load. Consider dropping your compression chunk size a bit (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on a lot of factors, but STCS is probably causing a fair bit of unnecessary compactions and probably is very slow to expire data).

-- 
Jeff Jirsa


> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <ra...@gmail.com> wrote:
> 
> Do you see anything wrong with this metric.
> 
> metric to scan tombstones
> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
> 
> And sametime CPU Spike to 50% whenever I see high tombstone alert.
> 
>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:
>> Your schema is such that you’ll never read more than one tombstone per select (unless you’re also doing range reads / table scans that you didn’t mention) - I’m not quite sure what you’re alerting on, but you’re not going to have tombstone problems with that table / that select. 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>> 
>>> Changing gcgs didn't help
>>> 
>>> CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;
>>> 
>>> 
>>> ```CREATE TABLE keyspace."table" (
>>>     "column1" text PRIMARY KEY,
>>>     "column2" text
>>> ) WITH bloom_filter_fp_chance = 0.01
>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>     AND comment = ''
>>>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>>>     AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>     AND crc_check_chance = 1.0
>>>     AND dclocal_read_repair_chance = 0.1
>>>     AND default_time_to_live = 18000
>>>     AND gc_grace_seconds = 60
>>>     AND max_index_interval = 2048
>>>     AND memtable_flush_period_in_ms = 0
>>>     AND min_index_interval = 128
>>>     AND read_repair_chance = 0.0
>>>     AND speculative_retry = '99PERCENTILE';
>>> 
>>> flushed table and took tsstabledump     
>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>> 16439
>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>> 2657
>>> 
>>> ttl is 4 hours.
>>> 
>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';
>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>> 
>>> metric to scan tombstones 
>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>> 
>>> during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.
>>> ```
>>> 
>>> ```tablestats
>>> 	Read Count: 605231874
>>> 	Read Latency: 0.021268529760215503 ms.
>>> 	Write Count: 2763352
>>> 	Write Latency: 0.027924007871599422 ms.
>>> 	Pending Flushes: 0
>>> 		Table: name
>>> 		SSTable count: 1
>>> 		Space used (live): 1413203
>>> 		Space used (total): 1413203
>>> 		Space used by snapshots (total): 0
>>> 		Off heap memory used (total): 28813
>>> 		SSTable Compression Ratio: 0.5015090954531143
>>> 		Number of partitions (estimate): 19568
>>> 		Memtable cell count: 573
>>> 		Memtable data size: 22971
>>> 		Memtable off heap memory used: 0
>>> 		Memtable switch count: 6
>>> 		Local read count: 529868919
>>> 		Local read latency: 0.020 ms
>>> 		Local write count: 2707371
>>> 		Local write latency: 0.024 ms
>>> 		Pending flushes: 0
>>> 		Percent repaired: 0.0
>>> 		Bloom filter false positives: 1
>>> 		Bloom filter false ratio: 0.00000
>>> 		Bloom filter space used: 23888
>>> 		Bloom filter off heap memory used: 23880
>>> 		Index summary off heap memory used: 4717
>>> 		Compression metadata off heap memory used: 216
>>> 		Compacted partition minimum bytes: 73
>>> 		Compacted partition maximum bytes: 124
>>> 		Compacted partition mean bytes: 99
>>> 		Average live cells per slice (last five minutes): 1.0
>>> 		Maximum live cells per slice (last five minutes): 1
>>> 		Average tombstones per slice (last five minutes): 1.0
>>> 		Maximum tombstones per slice (last five minutes): 1
>>> 		Dropped Mutations: 0
>>> 		
>>> 		histograms
>>> Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
>>>                               (micros)          (micros)           (bytes)                  
>>> 50%             0.00             20.50             17.08                86                 1
>>> 75%             0.00             24.60             20.50               124                 1
>>> 95%             0.00             35.43             29.52               124                 1
>>> 98%             0.00             35.43             42.51               124                 1
>>> 99%             0.00             42.51             51.01               124                 1
>>> Min             0.00              8.24              5.72                73                 0
>>> Max             1.00             42.51            152.32               124                 1
>>> ```
>>> 
>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge
>>> 
>>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>> Would also be good to see your schema (anonymized if needed) and the select queries you’re running
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Jirsa
>>>> 
>>>> 
>>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>>>> 
>>>>> Thanks Jeff,
>>>>> 
>>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
>>>>> 
>>>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>> 
>>>>>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Jeff Jirsa
>>>>>> 
>>>>>> 
>>>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>>>>>> 
>>>>>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> Kenneth Brotman
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>>>>>> Sent: Thursday, February 21, 2019 7:06 AM
>>>>>>> To: user@cassandra.apache.org
>>>>>>> Subject: Tombstones in memtable
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> We have small table records are about 5k .
>>>>>>> 
>>>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>>>>>> 
>>>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

Do you see anything wrong with this metric.

metric to scan tombstones
increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])

And sametime CPU Spike to 50% whenever I see high tombstone alert.

On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jj...@gmail.com> wrote:

> Your schema is such that you’ll never read more than one tombstone per
> select (unless you’re also doing range reads / table scans that you didn’t
> mention) - I’m not quite sure what you’re alerting on, but you’re not going
> to have tombstone problems with that table / that select.
>
> --
> Jeff Jirsa
>
>
> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> Changing gcgs didn't help
>
> CREATE KEYSPACE ksname WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
> true;
>
>
> ```CREATE TABLE keyspace."table" (
>     "column1" text PRIMARY KEY,
>     "column2" text
> ) WITH bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 18000
>     AND gc_grace_seconds = 60
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
>
> flushed table and took tsstabledump
> grep -i '"expired" : true' SSTables.txt|wc -l
> 16439
> grep -i '"expired" : false'  SSTables.txt |wc -l
> 2657
>
> ttl is 4 hours.
>
> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
> USING TTL(4hours) ?';
> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>
> metric to scan tombstones
>
> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>
> during peak hours. we only have couple of hundred inserts and 5-8k reads/s
> per node.
> ```
>
> ```tablestats
> Read Count: 605231874
> Read Latency: 0.021268529760215503 ms.
> Write Count: 2763352
> Write Latency: 0.027924007871599422 ms.
> Pending Flushes: 0
> Table: name
> SSTable count: 1
> Space used (live): 1413203
> Space used (total): 1413203
> Space used by snapshots (total): 0
> Off heap memory used (total): 28813
> SSTable Compression Ratio: 0.5015090954531143
> Number of partitions (estimate): 19568
> Memtable cell count: 573
> Memtable data size: 22971
> Memtable off heap memory used: 0
> Memtable switch count: 6
> Local read count: 529868919
> Local read latency: 0.020 ms
> Local write count: 2707371
> Local write latency: 0.024 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 1
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 23888
> Bloom filter off heap memory used: 23880
> Index summary off heap memory used: 4717
> Compression metadata off heap memory used: 216
> Compacted partition minimum bytes: 73
> Compacted partition maximum bytes: 124
> Compacted partition mean bytes: 99
> Average live cells per slice (last five minutes): 1.0
> Maximum live cells per slice (last five minutes): 1
> Average tombstones per slice (last five minutes): 1.0
> Maximum tombstones per slice (last five minutes): 1
> Dropped Mutations: 0
> histograms
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>                               (micros)          (micros)
>  (bytes)
> 50%             0.00             20.50             17.08
> 86                 1
> 75%             0.00             24.60             20.50
>  124                 1
> 95%             0.00             35.43             29.52
>  124                 1
> 98%             0.00             35.43             42.51
>  124                 1
> 99%             0.00             42.51             51.01
>  124                 1
> Min             0.00              8.24              5.72
> 73                 0
> Max             1.00             42.51            152.32
>  124                 1
> ```
>
> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
> m4.xlarge
>
> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
>> Would also be good to see your schema (anonymized if needed) and the
>> select queries you’re running
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com>
>> wrote:
>>
>> Thanks Jeff,
>>
>> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours
>> compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone
>> scans for the reads.  And also log doesn't show tombstone scan alerts. Has
>> the reads are happening 5-8k reads per node during the peak hours it shows
>> 1M tombstone scans count per read.
>>
>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>> If all of your data is TTL’d and you never explicitly delete a cell
>>> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>
>>> Which compaction strategy are you using? You need a way to clear out
>>> those tombstones. There exist tombstone compaction sub properties that can
>>> help encourage compaction to grab sstables just because they’re full of
>>> tombstones which will probably help you.
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <
>>> kenbrotman@yahoo.com.invalid> wrote:
>>>
>>> Can we see the histogram?  Why wouldn’t you at times have that many
>>> tombstones?  Makes sense.
>>>
>>>
>>>
>>> Kenneth Brotman
>>>
>>>
>>>
>>> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
>>> <ra...@gmail.com>]
>>> *Sent:* Thursday, February 21, 2019 7:06 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Tombstones in memtable
>>>
>>>
>>>
>>> We have small table records are about 5k .
>>>
>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and
>>> gc grace seconds has 3 hours.  We do 5k reads a second during peak load
>>> During the peak load seeing Alerts for tomstone scanned histogram reaching
>>> million.
>>>
>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan
>>> can be avoided in memtable
>>>
>>>

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

Your schema is such that you’ll never read more than one tombstone per select (unless you’re also doing range reads / table scans that you didn’t mention) - I’m not quite sure what you’re alerting on, but you’re not going to have tombstone problems with that table / that select. 

-- 
Jeff Jirsa


> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <ra...@gmail.com> wrote:
> 
> Changing gcgs didn't help
> 
> CREATE KEYSPACE ksname WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes = true;
> 
> 
> ```CREATE TABLE keyspace."table" (
>     "column1" text PRIMARY KEY,
>     "column2" text
> ) WITH bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 18000
>     AND gc_grace_seconds = 60
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
> 
> flushed table and took tsstabledump     
> grep -i '"expired" : true' SSTables.txt|wc -l
> 16439
> grep -i '"expired" : false'  SSTables.txt |wc -l
> 2657
> 
> ttl is 4 hours.
> 
> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?) USING TTL(4hours) ?';
> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
> 
> metric to scan tombstones 
> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
> 
> during peak hours. we only have couple of hundred inserts and 5-8k reads/s per node.
> ```
> 
> ```tablestats
> 	Read Count: 605231874
> 	Read Latency: 0.021268529760215503 ms.
> 	Write Count: 2763352
> 	Write Latency: 0.027924007871599422 ms.
> 	Pending Flushes: 0
> 		Table: name
> 		SSTable count: 1
> 		Space used (live): 1413203
> 		Space used (total): 1413203
> 		Space used by snapshots (total): 0
> 		Off heap memory used (total): 28813
> 		SSTable Compression Ratio: 0.5015090954531143
> 		Number of partitions (estimate): 19568
> 		Memtable cell count: 573
> 		Memtable data size: 22971
> 		Memtable off heap memory used: 0
> 		Memtable switch count: 6
> 		Local read count: 529868919
> 		Local read latency: 0.020 ms
> 		Local write count: 2707371
> 		Local write latency: 0.024 ms
> 		Pending flushes: 0
> 		Percent repaired: 0.0
> 		Bloom filter false positives: 1
> 		Bloom filter false ratio: 0.00000
> 		Bloom filter space used: 23888
> 		Bloom filter off heap memory used: 23880
> 		Index summary off heap memory used: 4717
> 		Compression metadata off heap memory used: 216
> 		Compacted partition minimum bytes: 73
> 		Compacted partition maximum bytes: 124
> 		Compacted partition mean bytes: 99
> 		Average live cells per slice (last five minutes): 1.0
> 		Maximum live cells per slice (last five minutes): 1
> 		Average tombstones per slice (last five minutes): 1.0
> 		Maximum tombstones per slice (last five minutes): 1
> 		Dropped Mutations: 0
> 		
> 		histograms
> Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
>                               (micros)          (micros)           (bytes)                  
> 50%             0.00             20.50             17.08                86                 1
> 75%             0.00             24.60             20.50               124                 1
> 95%             0.00             35.43             29.52               124                 1
> 98%             0.00             35.43             42.51               124                 1
> 99%             0.00             42.51             51.01               124                 1
> Min             0.00              8.24              5.72                73                 0
> Max             1.00             42.51            152.32               124                 1
> ```
> 
> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2 m4.xlarge
> 
>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:
>> Would also be good to see your schema (anonymized if needed) and the select queries you’re running
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>>> 
>>> Thanks Jeff,
>>> 
>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
>>> 
>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>> 
>>>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Jirsa
>>>> 
>>>> 
>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>>>> 
>>>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>>>> 
>>>>>  
>>>>> 
>>>>> Kenneth Brotman
>>>>> 
>>>>>  
>>>>> 
>>>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>>>> Sent: Thursday, February 21, 2019 7:06 AM
>>>>> To: user@cassandra.apache.org
>>>>> Subject: Tombstones in memtable
>>>>> 
>>>>>  
>>>>> 
>>>>> We have small table records are about 5k .
>>>>> 
>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>>>> 
>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

Changing gcgs didn't help

CREATE KEYSPACE ksname WITH replication = {'class':
'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
true;


```CREATE TABLE keyspace."table" (
    "column1" text PRIMARY KEY,
    "column2" text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 18000
    AND gc_grace_seconds = 60
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

flushed table and took tsstabledump
grep -i '"expired" : true' SSTables.txt|wc -l
16439
grep -i '"expired" : false'  SSTables.txt |wc -l
2657

ttl is 4 hours.

INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
USING TTL(4hours) ?';
SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';

metric to scan tombstones
increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])

during peak hours. we only have couple of hundred inserts and 5-8k reads/s
per node.
```

```tablestats
Read Count: 605231874
Read Latency: 0.021268529760215503 ms.
Write Count: 2763352
Write Latency: 0.027924007871599422 ms.
Pending Flushes: 0
Table: name
SSTable count: 1
Space used (live): 1413203
Space used (total): 1413203
Space used by snapshots (total): 0
Off heap memory used (total): 28813
SSTable Compression Ratio: 0.5015090954531143
Number of partitions (estimate): 19568
Memtable cell count: 573
Memtable data size: 22971
Memtable off heap memory used: 0
Memtable switch count: 6
Local read count: 529868919
Local read latency: 0.020 ms
Local write count: 2707371
Local write latency: 0.024 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 1
Bloom filter false ratio: 0.00000
Bloom filter space used: 23888
Bloom filter off heap memory used: 23880
Index summary off heap memory used: 4717
Compression metadata off heap memory used: 216
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 124
Compacted partition mean bytes: 99
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size
      Cell Count
                              (micros)          (micros)           (bytes)

50%             0.00             20.50             17.08                86
               1
75%             0.00             24.60             20.50               124
               1
95%             0.00             35.43             29.52               124
               1
98%             0.00             35.43             42.51               124
               1
99%             0.00             42.51             51.01               124
               1
Min             0.00              8.24              5.72                73
               0
Max             1.00             42.51            152.32               124
               1
```

3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
m4.xlarge

On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jj...@gmail.com> wrote:

> Would also be good to see your schema (anonymized if needed) and the
> select queries you’re running
>
>
> --
> Jeff Jirsa
>
>
> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
>
> Thanks Jeff,
>
> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours
> compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone
> scans for the reads.  And also log doesn't show tombstone scan alerts. Has
> the reads are happening 5-8k reads per node during the peak hours it shows
> 1M tombstone scans count per read.
>
> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>
>> If all of your data is TTL’d and you never explicitly delete a cell
>> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>
>> Which compaction strategy are you using? You need a way to clear out
>> those tombstones. There exist tombstone compaction sub properties that can
>> help encourage compaction to grab sstables just because they’re full of
>> tombstones which will probably help you.
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <
>> kenbrotman@yahoo.com.invalid> wrote:
>>
>> Can we see the histogram?  Why wouldn’t you at times have that many
>> tombstones?  Makes sense.
>>
>>
>>
>> Kenneth Brotman
>>
>>
>>
>> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
>> <ra...@gmail.com>]
>> *Sent:* Thursday, February 21, 2019 7:06 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Tombstones in memtable
>>
>>
>>
>> We have small table records are about 5k .
>>
>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc
>> grace seconds has 3 hours.  We do 5k reads a second during peak load During
>> the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>
>> Cassandra version 3.11.1. Please let me know how can this tombstone scan
>> can be avoided in memtable
>>
>>

RE: Tombstones in memtable

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.

Rahul,

 

Please see this DataStax article which suggests you might be using Cassandra as a queue-like dataset – and that’s an anti-pattern for Cassandra.  It could be you need to use a different database.  It could be your data model is wrong:

https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jjirsa@gmail.com] 
Sent: Saturday, February 23, 2019 4:47 PM
To: user@cassandra.apache.org
Subject: Re: Tombstones in memtable

 

Would also be good to see your schema (anonymized if needed) and the select queries you’re running

 

-- 

Jeff Jirsa

 


On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:

Thanks Jeff,

 

I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 

 

On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:

If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).

 

Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.

 

-- 

Jeff Jirsa

 


On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.

 

Kenneth Brotman

 

From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
Sent: Thursday, February 21, 2019 7:06 AM
To: user@cassandra.apache.org
Subject: Tombstones in memtable

 

We have small table records are about 5k .

All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.

Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

Would also be good to see your schema (anonymized if needed) and the select queries you’re running


-- 
Jeff Jirsa


> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
> 
> Thanks Jeff,
> 
> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
> 
>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>> 
>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>> 
>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>> 
>>>  
>>> 
>>> Kenneth Brotman
>>> 
>>>  
>>> 
>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>> Sent: Thursday, February 21, 2019 7:06 AM
>>> To: user@cassandra.apache.org
>>> Subject: Tombstones in memtable
>>> 
>>>  
>>> 
>>> We have small table records are about 5k .
>>> 
>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>> 
>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

I’m not parsing this - did the lower gcgs help or not ? Seeing the table histograms is the next step if this is still a problem 

The table level TTL doesn’t matter if you set a TTL on each insert 



-- 
Jeff Jirsa


> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <ra...@gmail.com> wrote:
> 
> Thanks Jeff,
> 
> I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone scans for the reads.  And also log doesn't show tombstone scan alerts. Has the reads are happening 5-8k reads per node during the peak hours it shows 1M tombstone scans count per read. 
> 
>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:
>> If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>> 
>> Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
>>> 
>>> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>>> 
>>>  
>>> 
>>> Kenneth Brotman
>>> 
>>>  
>>> 
>>> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
>>> Sent: Thursday, February 21, 2019 7:06 AM
>>> To: user@cassandra.apache.org
>>> Subject: Tombstones in memtable
>>> 
>>>  
>>> 
>>> We have small table records are about 5k .
>>> 
>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
>>> 
>>> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

Re: Tombstones in memtable

Posted by Rahul Reddy <ra...@gmail.com>.

Thanks Jeff,

I'm having gcgs set to 10 mins and changed the table ttl also to 5  hours
compared to insert ttl to 4 hours .  Tracing on doesn't show any tombstone
scans for the reads.  And also log doesn't show tombstone scan alerts. Has
the reads are happening 5-8k reads per node during the peak hours it shows
1M tombstone scans count per read.

On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jj...@gmail.com> wrote:

> If all of your data is TTL’d and you never explicitly delete a cell
> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>
> Which compaction strategy are you using? You need a way to clear out those
> tombstones. There exist tombstone compaction sub properties that can help
> encourage compaction to grab sstables just because they’re full of
> tombstones which will probably help you.
>
>
> --
> Jeff Jirsa
>
>
> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid>
> wrote:
>
> Can we see the histogram?  Why wouldn’t you at times have that many
> tombstones?  Makes sense.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Rahul Reddy [mailto:rahulreddy1234@gmail.com
> <ra...@gmail.com>]
> *Sent:* Thursday, February 21, 2019 7:06 AM
> *To:* user@cassandra.apache.org
> *Subject:* Tombstones in memtable
>
>
>
> We have small table records are about 5k .
>
> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc
> grace seconds has 3 hours.  We do 5k reads a second during peak load During
> the peak load seeing Alerts for tomstone scanned histogram reaching million.
>
> Cassandra version 3.11.1. Please let me know how can this tombstone scan
> can be avoided in memtable
>
>

Re: Tombstones in memtable

Posted by Jeff Jirsa <jj...@gmail.com>.

If all of your data is TTL’d and you never explicitly delete a cell without using s TTL, you can probably drop your GCGS to 1 hour (or less).

Which compaction strategy are you using? You need a way to clear out those tombstones. There exist tombstone compaction sub properties that can help encourage compaction to grab sstables just because they’re full of tombstones which will probably help you.

-- 
Jeff Jirsa

> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <ke...@yahoo.com.invalid> wrote:
> 
> Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.
>  
> Kenneth Brotman
>  
> From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
> Sent: Thursday, February 21, 2019 7:06 AM
> To: user@cassandra.apache.org
> Subject: Tombstones in memtable
>  
> We have small table records are about 5k .
> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.
> Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable

RE: Tombstones in memtable

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.

Can we see the histogram?  Why wouldn’t you at times have that many tombstones?  Makes sense.

 

Kenneth Brotman

 

From: Rahul Reddy [mailto:rahulreddy1234@gmail.com] 
Sent: Thursday, February 21, 2019 7:06 AM
To: user@cassandra.apache.org
Subject: Tombstones in memtable

 

We have small table records are about 5k .

All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace seconds has 3 hours.  We do 5k reads a second during peak load During the peak load seeing Alerts for tomstone scanned histogram reaching million.

Cassandra version 3.11.1. Please let me know how can this tombstone scan can be avoided in memtable