You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by eugene miretsky <eu...@gmail.com> on 2017/10/04 21:05:39 UTC

How do TTLs generate tombstones

Hello,

The following link says that TTLs generate tombstones -
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.

What exactly is the process that converts the TTL into a tombstone?

   1. Is an actual new tombstone cell created when the TTL expires?
   2. Or, is the TTLed cell treated as a tombstone?


Also, does gc_grace_period have an effect on TTLed cells? gc_grace_period
is meant to protect from deleted data re-appearing if the tombstone is
compacted away before all nodes have reached a consistent state. However,
since the ttl is stored in the cell (in liveness_info), there is no way for
the cell to re-appear (the ttl will still be there)

Cheers,
Eugene

Re: How do TTLs generate tombstones

Posted by eugene miretsky <eu...@gmail.com>.

Thanks,

We have turned off read repair, and read with consistency = one. This
leaves repairs and old timestamps (generate by the client) as possible
causes for the overlap. We are writing from Spark, and don't have NTP set
up on the cluster - I think that was causing some of the issues, but we
have fixed it, and the problem remains.

It is hard for me to believe the C* repair has a bug, so before creating a
JIRA, I would appreciate if you could take a look at the attached sstables
(produced using sstablemetadata) from two different time points over the
last 2 week (we ran compaction between).

In both cases, there are sstables generated around 8 pm that span over very
long time periods (sometimes over a day). We run repair daily at 8 pm.

Cheers,
Eugene









On Wed, Oct 11, 2017 at 12:53 PM, Jeff Jirsa <jj...@gmail.com> wrote:

> Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
> should stream sections of (and/or possibly entire) sstables from one
> replica to another. Assuming the original sstable was entirely contained in
> a single time window, the resulting sstable fragment streamed to the
> neighbor node will similarly be entirely contained within a single time
> window, and will be joined with the sstables in that window. If you find
> this isn't the case, open a JIRA, that's a bug (it was explicitly a design
> goal of TWCS, as it was one of my biggest gripes with early versions of
> DTCS).
>
> Read repairs, however, will pollute the memtable and cause overlaps. There
> are two types of read repairs:
> - Blocking read repair due to consistency level (read at quorum, and one
> of the replicas is missing data, the coordinator will issue mutations to
> the missing replica, which will go into the memtable and flush into the
> newest time window). This can not be disabled (period), and is probably the
> reason most people have overlaps (because people tend to read their writes
> pretty quickly after writes in time series use cases, often before hints or
> normal repair can be successful, especially in environments where nodes are
> bounced often).
> - Background read repair (tunable with the read_repair_chance and
> dclocal_read_repair_chance table options), which is like blocking read
> repair, but happens probabilistically (ie: there's a 1% chance on any read
> that the coordinator will scan the partition and copy any missing data to
> the replicas missing that data. Again, this goes to the memtable, and will
> flush into the newest time window).
>
> There's a pretty good argument to be made against manual repairs if (and
> only if) you only use TTLs, never explicitly delete data, and can tolerate
> the business risk of losing two machines at a time (that is: in the very
> very rare case that you somehow lose 2 machines before you can rebuild,
> you'll lose some subset of data that never made it to the sole remaining
> replica; is your business going to lose millions of dollars, or will you
> just have a gap in an analytics dashboard somewhere that nobody's going to
> worry about).
>
> - Jeff
>
>
> On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
> spasupuleti@netflix.com.invalid> wrote:
>
>> Hi Eugene,
>>
>> Common contributors to overlapping SSTables are
>> 1. Hints
>> 2. Repairs
>> 3. New writes with old timestamps (should be rare but technically
>> possible)
>>
>> I would not run repairs with TWCS - as you indicated, it is going to
>> result in overlapping SSTables which impacts disk space and read latency
>> since reads now have to encompass multiple SSTables.
>>
>> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would
>> not worry about data resurrection as long as all the writes carry TTL with
>> them.
>>
>> We faced similar overlapping issues with TWCS (it wss due to
>> dclocal_read_repair_chance) - we developed an SSTable tool that would give
>> topN or bottomN keys in an SSTable based on writetime/deletion time - we
>> used this to identify the specific keys responsible for overlap between
>> SSTables.
>>
>> Thanks,
>> Sumanth
>>
>>
>> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Alain!
>>>
>>> We are using TWCS compaction, and I read your blog multiple times - it
>>> was very useful, thanks!
>>>
>>> We are seeing a lot of overlapping SSTables, leading to a lot of
>>> problems: (a) large number of tombstones read in queries, (b) high CPU
>>> usage, (c) fairly long Young Gen GC collection (300ms)
>>>
>>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>>> true, gc_grace_seconds = 3h,  but we read and write with consistency =
>>> 1.
>>>
>>> I'm suspecting the overlap is coming from either hinted handoff or a
>>> repair job we run nightly.
>>>
>>> 1) Is running repair with TWCS recommended? It seems like it will
>>> always create a neverending overlap (the repair SSTable will have data from
>>> all 24 hours), an effect that seems to get amplified with anti-compaction.
>>> 2) TWCS seems to introduce a tradeoff between eventual consistency and
>>> write/read availability. If all repairs are turned off, then the choice is
>>> either (a) user strong consistency level, and pay the price of lower
>>> availability and slowers reads or writes, or (b) use lower consistency
>>> level, and risk inconsistent data (data is never repaired)
>>>
>>> I will try your last link but reappearing data sound a bit scary :)
>>>
>>> Any advice on how to debug this further would be greatly apprecaited.
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> If we never use updates (time series data), is it safe to set
>>>>> gc_grace_seconds=0.
>>>>
>>>>
>>>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>>>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>>>> storage window size in Apache Cassandra.
>>>>
>>>> Yet time series data with fixed TTLs allows a very efficient use of
>>>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>>>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>>>> definitely give it a try.
>>>>
>>>> Here is a post from my colleague Alex that I believe could be useful in
>>>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>>>
>>>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>>>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>>>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>>>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>>>> interest in consistency).
>>>>
>>>> This way you could expire entires SSTables, without compaction. If
>>>> overlaps in SSTables become a problem, you could even consider to give a
>>>> try to a more aggressive SSTable expiration
>>>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>> 2017-10-05 23:44 GMT+01:00 kurt greaves <ku...@instaclustr.com>:
>>>>
>>>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>>>> the table. If you are never doing updates and manual deletes and you always
>>>>> insert with a ttl you can get away with setting it to the hinted handoff
>>>>> period.
>>>>>
>>>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jeff,
>>>>>>
>>>>>> Make sense.
>>>>>> If we never use updates (time series data), is it safe to set
>>>>>> gc_grace_seconds=0.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies
>>>>>>> to TTL'd cells, because even though the data is TTL'd, it may have been
>>>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>>>
>>>>>>> Imagine a test table, simple key->value (k, v).
>>>>>>>
>>>>>>> INSERT INTO table(k,v) values(1,1);
>>>>>>> Kill 1 of the 3 nodes
>>>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>>>>> when that dead node comes back to life, it needs to learn of the deletion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The following link says that TTLs generate tombstones -
>>>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>>>
>>>>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>>>>
>>>>>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

On Wed, Oct 11, 2017 at 9:53 AM, Jeff Jirsa <jj...@gmail.com> wrote:

> Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
> should stream sections of (and/or possibly entire) sstables from one
> replica to another. Assuming the original sstable was entirely contained in
> a single time window, the resulting sstable fragment streamed to the
> neighbor node will similarly be entirely contained within a single time
> window, and will be joined with the sstables in that window. If you find
> this isn't the case, open a JIRA, that's a bug (it was explicitly a design
> goal of TWCS, as it was one of my biggest gripes with early versions of
> DTCS).
>
> Read repairs, however, will pollute the memtable and cause overlaps. There
> are two types of read repairs:
> - Blocking read repair due to consistency level (read at quorum, and one
> of the replicas is missing data, the coordinator will issue mutations to
> the missing replica, which will go into the memtable and flush into the
> newest time window). This can not be disabled (period), and is probably the
> reason most people have overlaps (because people tend to read their writes
> pretty quickly after writes in time series use cases, often before hints or
> normal repair can be successful, especially in environments where nodes are
> bounced often).
> - Background read repair (tunable with the read_repair_chance and
> dclocal_read_repair_chance table options), which is like blocking read
> repair, but happens probabilistically (ie: there's a 1% chance on any read
> that the coordinator will scan the partition and copy any missing data to
> the replicas missing that data. Again, this goes to the memtable, and will
> flush into the newest time window).
>
> There's a pretty good argument to be made against manual repairs if (and
> only if) you only use TTLs, never explicitly delete data, and can tolerate
> the business risk of losing two machines at a time (that is: in the very
> very rare case that you somehow lose 2 machines before you can rebuild,
> you'll lose some subset of data that never made it to the sole remaining
> replica; is your business going to lose millions of dollars, or will you
> just have a gap in an analytics dashboard somewhere that nobody's going to
> worry about).
>
> - Jeff
>
>
> On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
> spasupuleti@netflix.com.invalid> wrote:
>
>> Hi Eugene,
>>
>> Common contributors to overlapping SSTables are
>> 1. Hints
>> 2. Repairs
>> 3. New writes with old timestamps (should be rare but technically
>> possible)
>>
>> I would not run repairs with TWCS - as you indicated, it is going to
>> result in overlapping SSTables which impacts disk space and read latency
>> since reads now have to encompass multiple SSTables.
>>
>> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would
>> not worry about data resurrection as long as all the writes carry TTL with
>> them.
>>
>> We faced similar overlapping issues with TWCS (it wss due to
>> dclocal_read_repair_chance) - we developed an SSTable tool that would give
>> topN or bottomN keys in an SSTable based on writetime/deletion time - we
>> used this to identify the specific keys responsible for overlap between
>> SSTables.
>>
>> Thanks,
>> Sumanth
>>
>>
>> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Alain!
>>>
>>> We are using TWCS compaction, and I read your blog multiple times - it
>>> was very useful, thanks!
>>>
>>> We are seeing a lot of overlapping SSTables, leading to a lot of
>>> problems: (a) large number of tombstones read in queries, (b) high CPU
>>> usage, (c) fairly long Young Gen GC collection (300ms)
>>>
>>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>>> true, gc_grace_seconds = 3h,  but we read and write with consistency =
>>> 1.
>>>
>>> I'm suspecting the overlap is coming from either hinted handoff or a
>>> repair job we run nightly.
>>>
>>> 1) Is running repair with TWCS recommended? It seems like it will
>>> always create a neverending overlap (the repair SSTable will have data from
>>> all 24 hours), an effect that seems to get amplified with anti-compaction.
>>> 2) TWCS seems to introduce a tradeoff between eventual consistency and
>>> write/read availability. If all repairs are turned off, then the choice is
>>> either (a) user strong consistency level, and pay the price of lower
>>> availability and slowers reads or writes, or (b) use lower consistency
>>> level, and risk inconsistent data (data is never repaired)
>>>
>>> I will try your last link but reappearing data sound a bit scary :)
>>>
>>> Any advice on how to debug this further would be greatly apprecaited.
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> If we never use updates (time series data), is it safe to set
>>>>> gc_grace_seconds=0.
>>>>
>>>>
>>>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>>>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>>>> storage window size in Apache Cassandra.
>>>>
>>>> Yet time series data with fixed TTLs allows a very efficient use of
>>>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>>>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>>>> definitely give it a try.
>>>>
>>>> Here is a post from my colleague Alex that I believe could be useful in
>>>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>>>
>>>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>>>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>>>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>>>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>>>> interest in consistency).
>>>>
>>>> This way you could expire entires SSTables, without compaction. If
>>>> overlaps in SSTables become a problem, you could even consider to give a
>>>> try to a more aggressive SSTable expiration
>>>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>> 2017-10-05 23:44 GMT+01:00 kurt greaves <ku...@instaclustr.com>:
>>>>
>>>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>>>> the table. If you are never doing updates and manual deletes and you always
>>>>> insert with a ttl you can get away with setting it to the hinted handoff
>>>>> period.
>>>>>
>>>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jeff,
>>>>>>
>>>>>> Make sense.
>>>>>> If we never use updates (time series data), is it safe to set
>>>>>> gc_grace_seconds=0.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies
>>>>>>> to TTL'd cells, because even though the data is TTL'd, it may have been
>>>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>>>
>>>>>>> Imagine a test table, simple key->value (k, v).
>>>>>>>
>>>>>>> INSERT INTO table(k,v) values(1,1);
>>>>>>> Kill 1 of the 3 nodes
>>>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>>>>> when that dead node comes back to life, it needs to learn of the deletion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The following link says that TTLs generate tombstones -
>>>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>>>
>>>>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>>>>
>>>>>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: How do TTLs generate tombstones

Posted by Jeff Jirsa <jj...@gmail.com>.

Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
should stream sections of (and/or possibly entire) sstables from one
replica to another. Assuming the original sstable was entirely contained in
a single time window, the resulting sstable fragment streamed to the
neighbor node will similarly be entirely contained within a single time
window, and will be joined with the sstables in that window. If you find
this isn't the case, open a JIRA, that's a bug (it was explicitly a design
goal of TWCS, as it was one of my biggest gripes with early versions of
DTCS).

Read repairs, however, will pollute the memtable and cause overlaps. There
are two types of read repairs:
- Blocking read repair due to consistency level (read at quorum, and one of
the replicas is missing data, the coordinator will issue mutations to the
missing replica, which will go into the memtable and flush into the newest
time window). This can not be disabled (period), and is probably the reason
most people have overlaps (because people tend to read their writes pretty
quickly after writes in time series use cases, often before hints or normal
repair can be successful, especially in environments where nodes are
bounced often).
- Background read repair (tunable with the read_repair_chance and
dclocal_read_repair_chance table options), which is like blocking read
repair, but happens probabilistically (ie: there's a 1% chance on any read
that the coordinator will scan the partition and copy any missing data to
the replicas missing that data. Again, this goes to the memtable, and will
flush into the newest time window).

There's a pretty good argument to be made against manual repairs if (and
only if) you only use TTLs, never explicitly delete data, and can tolerate
the business risk of losing two machines at a time (that is: in the very
very rare case that you somehow lose 2 machines before you can rebuild,
you'll lose some subset of data that never made it to the sole remaining
replica; is your business going to lose millions of dollars, or will you
just have a gap in an analytics dashboard somewhere that nobody's going to
worry about).

- Jeff


On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
spasupuleti@netflix.com.invalid> wrote:

> Hi Eugene,
>
> Common contributors to overlapping SSTables are
> 1. Hints
> 2. Repairs
> 3. New writes with old timestamps (should be rare but technically possible)
>
> I would not run repairs with TWCS - as you indicated, it is going to
> result in overlapping SSTables which impacts disk space and read latency
> since reads now have to encompass multiple SSTables.
>
> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would not
> worry about data resurrection as long as all the writes carry TTL with them.
>
> We faced similar overlapping issues with TWCS (it wss due to
> dclocal_read_repair_chance) - we developed an SSTable tool that would give
> topN or bottomN keys in an SSTable based on writetime/deletion time - we
> used this to identify the specific keys responsible for overlap between
> SSTables.
>
> Thanks,
> Sumanth
>
>
> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <eugene.miretsky@gmail.com
> > wrote:
>
>> Thanks Alain!
>>
>> We are using TWCS compaction, and I read your blog multiple times - it
>> was very useful, thanks!
>>
>> We are seeing a lot of overlapping SSTables, leading to a lot of
>> problems: (a) large number of tombstones read in queries, (b) high CPU
>> usage, (c) fairly long Young Gen GC collection (300ms)
>>
>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>> true, gc_grace_seconds = 3h,  but we read and write with consistency =
>> 1.
>>
>> I'm suspecting the overlap is coming from either hinted handoff or a
>> repair job we run nightly.
>>
>> 1) Is running repair with TWCS recommended? It seems like it will always
>> create a neverending overlap (the repair SSTable will have data from all 24
>> hours), an effect that seems to get amplified with anti-compaction.
>> 2) TWCS seems to introduce a tradeoff between eventual consistency and
>> write/read availability. If all repairs are turned off, then the choice is
>> either (a) user strong consistency level, and pay the price of lower
>> availability and slowers reads or writes, or (b) use lower consistency
>> level, and risk inconsistent data (data is never repaired)
>>
>> I will try your last link but reappearing data sound a bit scary :)
>>
>> Any advice on how to debug this further would be greatly apprecaited.
>>
>> Cheers,
>> Eugene
>>
>> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> Hi Eugene,
>>>
>>> If we never use updates (time series data), is it safe to set
>>>> gc_grace_seconds=0.
>>>
>>>
>>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>>> storage window size in Apache Cassandra.
>>>
>>> Yet time series data with fixed TTLs allows a very efficient use of
>>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>>> definitely give it a try.
>>>
>>> Here is a post from my colleague Alex that I believe could be useful in
>>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>>
>>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>>> interest in consistency).
>>>
>>> This way you could expire entires SSTables, without compaction. If
>>> overlaps in SSTables become a problem, you could even consider to give a
>>> try to a more aggressive SSTable expiration
>>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>>
>>> C*heers,
>>> -----------------------
>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>> 2017-10-05 23:44 GMT+01:00 kurt greaves <ku...@instaclustr.com>:
>>>
>>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>>> the table. If you are never doing updates and manual deletes and you always
>>>> insert with a ttl you can get away with setting it to the hinted handoff
>>>> period.
>>>>
>>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Jeff,
>>>>>
>>>>> Make sense.
>>>>> If we never use updates (time series data), is it safe to set
>>>>> gc_grace_seconds=0.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>>>>>> TTL'd cells, because even though the data is TTL'd, it may have been
>>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>>
>>>>>> Imagine a test table, simple key->value (k, v).
>>>>>>
>>>>>> INSERT INTO table(k,v) values(1,1);
>>>>>> Kill 1 of the 3 nodes
>>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>>>> when that dead node comes back to life, it needs to learn of the deletion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> The following link says that TTLs generate tombstones -
>>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>>
>>>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>>>
>>>>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>>
>>>>>>>
>>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: How do TTLs generate tombstones

Posted by Sumanth Pasupuleti <sp...@netflix.com.INVALID>.

Hi Eugene,

Common contributors to overlapping SSTables are
1. Hints
2. Repairs
3. New writes with old timestamps (should be rare but technically possible)

I would not run repairs with TWCS - as you indicated, it is going to result
in overlapping SSTables which impacts disk space and read latency since
reads now have to encompass multiple SSTables.

As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would not
worry about data resurrection as long as all the writes carry TTL with them.

We faced similar overlapping issues with TWCS (it wss due to
dclocal_read_repair_chance) - we developed an SSTable tool that would give
topN or bottomN keys in an SSTable based on writetime/deletion time - we
used this to identify the specific keys responsible for overlap between
SSTables.

Thanks,
Sumanth


On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <eu...@gmail.com>
wrote:

> Thanks Alain!
>
> We are using TWCS compaction, and I read your blog multiple times - it was
> very useful, thanks!
>
> We are seeing a lot of overlapping SSTables, leading to a lot of problems:
> (a) large number of tombstones read in queries, (b) high CPU usage, (c)
> fairly long Young Gen GC collection (300ms)
>
> We have read_repair_change = 0, and unchecked_tombstone_compaction =
> true, gc_grace_seconds = 3h,  but we read and write with consistency = 1.
>
> I'm suspecting the overlap is coming from either hinted handoff or a
> repair job we run nightly.
>
> 1) Is running repair with TWCS recommended? It seems like it will always
> create a neverending overlap (the repair SSTable will have data from all 24
> hours), an effect that seems to get amplified with anti-compaction.
> 2) TWCS seems to introduce a tradeoff between eventual consistency and
> write/read availability. If all repairs are turned off, then the choice is
> either (a) user strong consistency level, and pay the price of lower
> availability and slowers reads or writes, or (b) use lower consistency
> level, and risk inconsistent data (data is never repaired)
>
> I will try your last link but reappearing data sound a bit scary :)
>
> Any advice on how to debug this further would be greatly apprecaited.
>
> Cheers,
> Eugene
>
> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> Hi Eugene,
>>
>> If we never use updates (time series data), is it safe to set
>>> gc_grace_seconds=0.
>>
>>
>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>> storage window size in Apache Cassandra.
>>
>> Yet time series data with fixed TTLs allows a very efficient use of
>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>> definitely give it a try.
>>
>> Here is a post from my colleague Alex that I believe could be useful in
>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>
>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>> interest in consistency).
>>
>> This way you could expire entires SSTables, without compaction. If
>> overlaps in SSTables become a problem, you could even consider to give a
>> try to a more aggressive SSTable expiration
>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> 2017-10-05 23:44 GMT+01:00 kurt greaves <ku...@instaclustr.com>:
>>
>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>> the table. If you are never doing updates and manual deletes and you always
>>> insert with a ttl you can get away with setting it to the hinted handoff
>>> period.
>>>
>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Jeff,
>>>>
>>>> Make sense.
>>>> If we never use updates (time series data), is it safe to set
>>>> gc_grace_seconds=0.
>>>>
>>>>
>>>>
>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>>
>>>>>
>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>>>>> TTL'd cells, because even though the data is TTL'd, it may have been
>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>
>>>>> Imagine a test table, simple key->value (k, v).
>>>>>
>>>>> INSERT INTO table(k,v) values(1,1);
>>>>> Kill 1 of the 3 nodes
>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>>> when that dead node comes back to life, it needs to learn of the deletion.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> The following link says that TTLs generate tombstones -
>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>
>>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>>
>>>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>
>>>>>>
>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: How do TTLs generate tombstones

Posted by eugene miretsky <eu...@gmail.com>.

Thanks Alain!

We are using TWCS compaction, and I read your blog multiple times - it was
very useful, thanks!

We are seeing a lot of overlapping SSTables, leading to a lot of problems:
(a) large number of tombstones read in queries, (b) high CPU usage, (c)
fairly long Young Gen GC collection (300ms)

We have read_repair_change = 0, and unchecked_tombstone_compaction =
true, gc_grace_seconds
= 3h,  but we read and write with consistency = 1.

I'm suspecting the overlap is coming from either hinted handoff or a repair
job we run nightly.

1) Is running repair with TWCS recommended? It seems like it will always
create a neverending overlap (the repair SSTable will have data from all 24
hours), an effect that seems to get amplified with anti-compaction.
2) TWCS seems to introduce a tradeoff between eventual consistency and
write/read availability. If all repairs are turned off, then the choice is
either (a) user strong consistency level, and pay the price of lower
availability and slowers reads or writes, or (b) use lower consistency
level, and risk inconsistent data (data is never repaired)

I will try your last link but reappearing data sound a bit scary :)

Any advice on how to debug this further would be greatly apprecaited.

Cheers,
Eugene

On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hi Eugene,
>
> If we never use updates (time series data), is it safe to set
>> gc_grace_seconds=0.
>
>
> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
> storage window size in Apache Cassandra.
>
> Yet time series data with fixed TTLs allows a very efficient use of
> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
> definitely give it a try.
>
> Here is a post from my colleague Alex that I believe could be useful in
> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>
> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
> 'max_hint_window_in_ms' should be really effective. Make sure to use a
> strong consistency level (generally RF = 3, CL.Read = CL.Write =
> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
> interest in consistency).
>
> This way you could expire entires SSTables, without compaction. If
> overlaps in SSTables become a problem, you could even consider to give a
> try to a more aggressive SSTable expiration https://issues.apache.org/
> jira/browse/CASSANDRA-13418.
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> 2017-10-05 23:44 GMT+01:00 kurt greaves <ku...@instaclustr.com>:
>
>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>> the table. If you are never doing updates and manual deletes and you always
>> insert with a ttl you can get away with setting it to the hinted handoff
>> period.
>>
>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com>
>> wrote:
>>
>>> Thanks Jeff,
>>>
>>> Make sense.
>>> If we never use updates (time series data), is it safe to set
>>> gc_grace_seconds=0.
>>>
>>>
>>>
>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>>>
>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>>>> TTL'd cells, because even though the data is TTL'd, it may have been
>>>> written on top of another live cell that wasn't ttl'd:
>>>>
>>>> Imagine a test table, simple key->value (k, v).
>>>>
>>>> INSERT INTO table(k,v) values(1,1);
>>>> Kill 1 of the 3 nodes
>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>> when that dead node comes back to life, it needs to learn of the deletion.
>>>>
>>>>
>>>>
>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The following link says that TTLs generate tombstones -
>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>
>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>
>>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>
>>>>>
>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>
>>>
>

Re: How do TTLs generate tombstones

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi Eugene,

If we never use updates (time series data), is it safe to set
> gc_grace_seconds=0.


As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
'max_hint_window_in_ms' as the min off these 2 values is used for hints
storage window size in Apache Cassandra.

Yet time series data with fixed TTLs allows a very efficient use of
Cassandra, specially when using Time Window Compaction Strategy (TWCS).
Funny fact is that Jeff brought it to Apache Cassandra :-). I would
definitely give it a try.

Here is a post from my colleague Alex that I believe could be useful in
your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html

Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
'max_hint_window_in_ms' should be really effective. Make sure to use a
strong consistency level (generally RF = 3, CL.Read = CL.Write =
LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
interest in consistency).

This way you could expire entires SSTables, without compaction. If overlaps
in SSTables become a problem, you could even consider to give a try to a
more aggressive SSTable expiration
https://issues.apache.org/jira/browse/CASSANDRA-13418.

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2017-10-05 23:44 GMT+01:00 kurt greaves <ku...@instaclustr.com>:

> No it's never safe to set it to 0 as you'll disable hinted handoff for the
> table. If you are never doing updates and manual deletes and you always
> insert with a ttl you can get away with setting it to the hinted handoff
> period.
>
> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com>
> wrote:
>
>> Thanks Jeff,
>>
>> Make sense.
>> If we never use updates (time series data), is it safe to set
>> gc_grace_seconds=0.
>>
>>
>>
>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>>
>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>>> TTL'd cells, because even though the data is TTL'd, it may have been
>>> written on top of another live cell that wasn't ttl'd:
>>>
>>> Imagine a test table, simple key->value (k, v).
>>>
>>> INSERT INTO table(k,v) values(1,1);
>>> Kill 1 of the 3 nodes
>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>> 60 seconds later, the live nodes will see that data as deleted, but when
>>> that dead node comes back to life, it needs to learn of the deletion.
>>>
>>>
>>>
>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> The following link says that TTLs generate tombstones -
>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>
>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>
>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>
>>>>
>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>> tombstone is compacted away before all nodes have reached a consistent
>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>
>>

Re: How do TTLs generate tombstones

Posted by kurt greaves <ku...@instaclustr.com>.

No it's never safe to set it to 0 as you'll disable hinted handoff for the
table. If you are never doing updates and manual deletes and you always
insert with a ttl you can get away with setting it to the hinted handoff
period.

On 6 Oct. 2017 1:28 am, "eugene miretsky" <eu...@gmail.com> wrote:

> Thanks Jeff,
>
> Make sense.
> If we never use updates (time series data), is it safe to set
> gc_grace_seconds=0.
>
>
>
> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>
>>
>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>> TTL'd cells, because even though the data is TTL'd, it may have been
>> written on top of another live cell that wasn't ttl'd:
>>
>> Imagine a test table, simple key->value (k, v).
>>
>> INSERT INTO table(k,v) values(1,1);
>> Kill 1 of the 3 nodes
>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>> 60 seconds later, the live nodes will see that data as deleted, but when
>> that dead node comes back to life, it needs to learn of the deletion.
>>
>>
>>
>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> The following link says that TTLs generate tombstones -
>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>
>>> What exactly is the process that converts the TTL into a tombstone?
>>>
>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>
>>>
>>> Also, does gc_grace_period have an effect on TTLed cells?
>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>> tombstone is compacted away before all nodes have reached a consistent
>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>
>

Re: How do TTLs generate tombstones

Posted by eugene miretsky <eu...@gmail.com>.

Thanks Jeff,

Make sense.
If we never use updates (time series data), is it safe to set
gc_grace_seconds=0.



On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jj...@gmail.com> wrote:

>
> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
> TTL'd cells, because even though the data is TTL'd, it may have been
> written on top of another live cell that wasn't ttl'd:
>
> Imagine a test table, simple key->value (k, v).
>
> INSERT INTO table(k,v) values(1,1);
> Kill 1 of the 3 nodes
> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
> 60 seconds later, the live nodes will see that data as deleted, but when
> that dead node comes back to life, it needs to learn of the deletion.
>
>
>
> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <eugene.miretsky@gmail.com
> > wrote:
>
>> Hello,
>>
>> The following link says that TTLs generate tombstones -
>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>
>> What exactly is the process that converts the TTL into a tombstone?
>>
>>    1. Is an actual new tombstone cell created when the TTL expires?
>>    2. Or, is the TTLed cell treated as a tombstone?
>>
>>
>> Also, does gc_grace_period have an effect on TTLed cells? gc_grace_period
>> is meant to protect from deleted data re-appearing if the tombstone is
>> compacted away before all nodes have reached a consistent state. However,
>> since the ttl is stored in the cell (in liveness_info), there is no way for
>> the cell to re-appear (the ttl will still be there)
>>
>> Cheers,
>> Eugene
>>
>>
>

Re: How do TTLs generate tombstones

Posted by Jeff Jirsa <jj...@gmail.com>.

The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to TTL'd
cells, because even though the data is TTL'd, it may have been written on
top of another live cell that wasn't ttl'd:

Imagine a test table, simple key->value (k, v).

INSERT INTO table(k,v) values(1,1);
Kill 1 of the 3 nodes
UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
60 seconds later, the live nodes will see that data as deleted, but when
that dead node comes back to life, it needs to learn of the deletion.

On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <eu...@gmail.com>
wrote:

> Hello,
>
> The following link says that TTLs generate tombstones -
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>
> What exactly is the process that converts the TTL into a tombstone?
>
>    1. Is an actual new tombstone cell created when the TTL expires?
>    2. Or, is the TTLed cell treated as a tombstone?
>
>
> Also, does gc_grace_period have an effect on TTLed cells? gc_grace_period
> is meant to protect from deleted data re-appearing if the tombstone is
> compacted away before all nodes have reached a consistent state. However,
> since the ttl is stored in the cell (in liveness_info), there is no way for
> the cell to re-appear (the ttl will still be there)
>
> Cheers,
> Eugene
>
>