You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Stefano Ortolani <os...@gmail.com> on 2017/05/11 14:15:54 UTC

LCS, range tombstones, and eviction

Hi all,

I am trying to wrap my head around how C* evicts tombstones when using LCS.
Based on what I understood reading the docs, if the ratio of garbage
collectable tomstones exceeds the "tombstone_threshold", C* should start
compacting and evicting.

I am quite puzzled however by what might happen when dealing with range
tombstones. In that case a single tombstone might actually stand for an
arbitrary number of normal tombstones. In other words, do range tombstones
contribute to the "tombstone_threshold"? If so, how?

I am also a bit confused by the "tombstone_compaction_interval". If I am
dealing with a big partition in LCS which is receiving new records every
day,
and a weekly incremental repair job continously anticompacting the data and
thus creating SStables, what is the likelhood of the default interval
(10 days) to be actually hit?

Hopefully somebody will be able to shed some lights here!

Thanks in advance!
Stefano

Re: LCS, range tombstones, and eviction

Posted by Stefano Ortolani <os...@gmail.com>.
Thanks a lot Blake, that definitely helps!

I actually found a ticket re range tombstones and how they are accounted
for: https://issues.apache.org/jira/browse/CASSANDRA-8527

I am wondering now what happens when a node receives a read request. Are
the range tombstones read before scanning the SStables? More interestingly,
given that a single partition might be split across different levels, and
that some range tombstones might be in L0 while all the rest of the data in
L1, are all the tombstones prefetched from _all_ the involved SStables
before doing any table scan?

Regards,
Stefano

On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston <be...@apple.com>
wrote:

> Hi Stefano,
>
> Based on what I understood reading the docs, if the ratio of garbage
> collectable tomstones exceeds the "tombstone_threshold", C* should start
> compacting and evicting.
>
>
> If there are no other normal compaction tasks to be run, LCS will attempt
> to compact the sstables it estimates it will be able to drop the most
> tombstones from. It does this by estimating the number of tombstones an
> sstable has that have passed the gc grace period. Whether or not a
> tombstone will actually be evicted is more complicated. Even if a tombstone
> has passed gc grace, it can't be dropped if the data it's deleting still
> exists in another sstable, otherwise the data would appear to return. So, a
> tombstone won't be dropped if there is data for the same partition in other
> sstables that is older than the tombstone being evaluated for eviction.
>
> I am quite puzzled however by what might happen when dealing with range
> tombstones. In that case a single tombstone might actually stand for an
> arbitrary number of normal tombstones. In other words, do range tombstones
> contribute to the "tombstone_threshold"? If so, how?
>
>
> From what I can tell, each end of the range tombstone is counted as a
> single tombstone tombstone. So a range tombstone effectively contributes
> '2' to the count of tombstones for an sstable. I'm not 100% sure, but I
> haven't seen any sstable writing logic that tracks open tombstones and
> counts covered cells as tombstones. So, it's likely that the effect of
> range tombstones covering many rows are under represented in the droppable
> tombstone estimate.
>
> I am also a bit confused by the "tombstone_compaction_interval". If I am
> dealing with a big partition in LCS which is receiving new records every
> day,
> and a weekly incremental repair job continously anticompacting the data
> and
> thus creating SStables, what is the likelhood of the default interval
> (10 days) to be actually hit?
>
>
> It will be hit, but probably only in the repaired data. Once the data is
> marked repaired, it shouldn't be anticompacted again, and should get old
> enough to pass the compaction interval. That shouldn't be an issue though,
> because you should be running repair often enough that data is repaired
> before it can ever get past the gc grace period. Otherwise you'll have
> other problems. Also, keep in mind that tombstone eviction is a part of all
> compactions, it's just that occasionally a compaction is run specifically
> for that purpose. Finally, you probably shouldn't run incremental repair on
> data that is deleted. There is a design flaw in the incremental repair used
> in pre-4.0 of cassandra that can cause consistency issues. It can also
> cause a *lot* of over streaming, so you might want to take a look at how
> much streaming your cluster is doing with full repairs, and incremental
> repairs. It might actually be more efficient to run full repairs.
>
> Hope that helps,
>
> Blake
>
> On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostefano@gmail.com)
> wrote:
>
> Hi all,
>
> I am trying to wrap my head around how C* evicts tombstones when using LCS.
> Based on what I understood reading the docs, if the ratio of garbage
> collectable tomstones exceeds the "tombstone_threshold", C* should start
> compacting and evicting.
>
> I am quite puzzled however by what might happen when dealing with range
> tombstones. In that case a single tombstone might actually stand for an
> arbitrary number of normal tombstones. In other words, do range tombstones
> contribute to the "tombstone_threshold"? If so, how?
>
> I am also a bit confused by the "tombstone_compaction_interval". If I am
> dealing with a big partition in LCS which is receiving new records every
> day,
> and a weekly incremental repair job continously anticompacting the data
> and
> thus creating SStables, what is the likelhood of the default interval
> (10 days) to be actually hit?
>
> Hopefully somebody will be able to shed some lights here!
>
> Thanks in advance!
> Stefano
>
>

Re: LCS, range tombstones, and eviction

Posted by Stefano Ortolani <os...@gmail.com>.
That makes sense.
I see however some unexpected performance data on my test, but I will start
another thread for that.

Thanks again!

On Fri, May 12, 2017 at 6:56 PM, Blake Eggleston <be...@apple.com>
wrote:

> The start and end points of a range tombstone are basically stored as
> special purpose rows alongside the normal data in an sstable. As part of a
> read, they're reconciled with the data from the other sstables into a
> single partition, just like the other rows. The only difference is that
> they don't contain any 'real' data, and, of course, they prevent 'deleted'
> data from being returned in the read. It's a bit more complicated than
> that, but that's the general idea.
>
>
> On May 12, 2017 at 6:23:01 AM, Stefano Ortolani (ostefano@gmail.com)
> wrote:
>
> Thanks a lot Blake, that definitely helps!
>
> I actually found a ticket re range tombstones and how they are accounted
> for: https://issues.apache.org/jira/browse/CASSANDRA-8527
>
> I am wondering now what happens when a node receives a read request. Are
> the range tombstones read before scanning the SStables? More interestingly,
> given that a single partition might be split across different levels, and
> that some range tombstones might be in L0 while all the rest of the data in
> L1, are all the tombstones prefetched from _all_ the involved SStables
> before doing any table scan?
>
> Regards,
> Stefano
>
> On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston <be...@apple.com>
> wrote:
>
>> Hi Stefano,
>>
>> Based on what I understood reading the docs, if the ratio of garbage
>> collectable tomstones exceeds the "tombstone_threshold", C* should start
>> compacting and evicting.
>>
>>
>> If there are no other normal compaction tasks to be run, LCS will attempt
>> to compact the sstables it estimates it will be able to drop the most
>> tombstones from. It does this by estimating the number of tombstones an
>> sstable has that have passed the gc grace period. Whether or not a
>> tombstone will actually be evicted is more complicated. Even if a tombstone
>> has passed gc grace, it can't be dropped if the data it's deleting still
>> exists in another sstable, otherwise the data would appear to return. So, a
>> tombstone won't be dropped if there is data for the same partition in other
>> sstables that is older than the tombstone being evaluated for eviction.
>>
>> I am quite puzzled however by what might happen when dealing with range
>> tombstones. In that case a single tombstone might actually stand for an
>> arbitrary number of normal tombstones. In other words, do range
>> tombstones
>> contribute to the "tombstone_threshold"? If so, how?
>>
>>
>> From what I can tell, each end of the range tombstone is counted as a
>> single tombstone tombstone. So a range tombstone effectively contributes
>> '2' to the count of tombstones for an sstable. I'm not 100% sure, but I
>> haven't seen any sstable writing logic that tracks open tombstones and
>> counts covered cells as tombstones. So, it's likely that the effect of
>> range tombstones covering many rows are under represented in the droppable
>> tombstone estimate.
>>
>> I am also a bit confused by the "tombstone_compaction_interval". If I am
>> dealing with a big partition in LCS which is receiving new records every
>> day,
>> and a weekly incremental repair job continously anticompacting the data
>> and
>> thus creating SStables, what is the likelhood of the default interval
>> (10 days) to be actually hit?
>>
>>
>> It will be hit, but probably only in the repaired data. Once the data is
>> marked repaired, it shouldn't be anticompacted again, and should get old
>> enough to pass the compaction interval. That shouldn't be an issue though,
>> because you should be running repair often enough that data is repaired
>> before it can ever get past the gc grace period. Otherwise you'll have
>> other problems. Also, keep in mind that tombstone eviction is a part of all
>> compactions, it's just that occasionally a compaction is run specifically
>> for that purpose. Finally, you probably shouldn't run incremental repair on
>> data that is deleted. There is a design flaw in the incremental repair used
>> in pre-4.0 of cassandra that can cause consistency issues. It can also
>> cause a *lot* of over streaming, so you might want to take a look at how
>> much streaming your cluster is doing with full repairs, and incremental
>> repairs. It might actually be more efficient to run full repairs.
>>
>> Hope that helps,
>>
>> Blake
>>
>> On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostefano@gmail.com)
>> wrote:
>>
>> Hi all,
>>
>> I am trying to wrap my head around how C* evicts tombstones when using
>> LCS.
>> Based on what I understood reading the docs, if the ratio of garbage
>> collectable tomstones exceeds the "tombstone_threshold", C* should start
>> compacting and evicting.
>>
>> I am quite puzzled however by what might happen when dealing with range
>> tombstones. In that case a single tombstone might actually stand for an
>> arbitrary number of normal tombstones. In other words, do range
>> tombstones
>> contribute to the "tombstone_threshold"? If so, how?
>>
>> I am also a bit confused by the "tombstone_compaction_interval". If I am
>> dealing with a big partition in LCS which is receiving new records every
>> day,
>> and a weekly incremental repair job continously anticompacting the data
>> and
>> thus creating SStables, what is the likelhood of the default interval
>> (10 days) to be actually hit?
>>
>> Hopefully somebody will be able to shed some lights here!
>>
>> Thanks in advance!
>> Stefano
>>
>>
>

Re: LCS, range tombstones, and eviction

Posted by Blake Eggleston <be...@apple.com>.
The start and end points of a range tombstone are basically stored as special purpose rows alongside the normal data in an sstable. As part of a read, they're reconciled with the data from the other sstables into a single partition, just like the other rows. The only difference is that they don't contain any 'real' data, and, of course, they prevent 'deleted' data from being returned in the read. It's a bit more complicated than that, but that's the general idea.


On May 12, 2017 at 6:23:01 AM, Stefano Ortolani (ostefano@gmail.com) wrote:

Thanks a lot Blake, that definitely helps!

I actually found a ticket re range tombstones and how they are accounted for: https://issues.apache.org/jira/browse/CASSANDRA-8527

I am wondering now what happens when a node receives a read request. Are the range tombstones read before scanning the SStables? More interestingly, given that a single partition might be split across different levels, and that some range tombstones might be in L0 while all the rest of the data in L1, are all the tombstones prefetched from _all_ the involved SStables before doing any table scan?

Regards,
Stefano

On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston <be...@apple.com> wrote:
Hi Stefano,

Based on what I understood reading the docs, if the ratio of garbage 
collectable tomstones exceeds the "tombstone_threshold", C* should start 
compacting and evicting.

If there are no other normal compaction tasks to be run, LCS will attempt to compact the sstables it estimates it will be able to drop the most tombstones from. It does this by estimating the number of tombstones an sstable has that have passed the gc grace period. Whether or not a tombstone will actually be evicted is more complicated. Even if a tombstone has passed gc grace, it can't be dropped if the data it's deleting still exists in another sstable, otherwise the data would appear to return. So, a tombstone won't be dropped if there is data for the same partition in other sstables that is older than the tombstone being evaluated for eviction.

I am quite puzzled however by what might happen when dealing with range 
tombstones. In that case a single tombstone might actually stand for an 
arbitrary number of normal tombstones. In other words, do range tombstones 
contribute to the "tombstone_threshold"? If so, how?

From what I can tell, each end of the range tombstone is counted as a single tombstone tombstone. So a range tombstone effectively contributes '2' to the count of tombstones for an sstable. I'm not 100% sure, but I haven't seen any sstable writing logic that tracks open tombstones and counts covered cells as tombstones. So, it's likely that the effect of range tombstones covering many rows are under represented in the droppable tombstone estimate.

I am also a bit confused by the "tombstone_compaction_interval". If I am 
dealing with a big partition in LCS which is receiving new records every day, 
and a weekly incremental repair job continously anticompacting the data and 
thus creating SStables, what is the likelhood of the default interval 
(10 days) to be actually hit?

It will be hit, but probably only in the repaired data. Once the data is marked repaired, it shouldn't be anticompacted again, and should get old enough to pass the compaction interval. That shouldn't be an issue though, because you should be running repair often enough that data is repaired before it can ever get past the gc grace period. Otherwise you'll have other problems. Also, keep in mind that tombstone eviction is a part of all compactions, it's just that occasionally a compaction is run specifically for that purpose. Finally, you probably shouldn't run incremental repair on data that is deleted. There is a design flaw in the incremental repair used in pre-4.0 of cassandra that can cause consistency issues. It can also cause a *lot* of over streaming, so you might want to take a look at how much streaming your cluster is doing with full repairs, and incremental repairs. It might actually be more efficient to run full repairs.

Hope that helps,

Blake

On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostefano@gmail.com) wrote:

Hi all,

I am trying to wrap my head around how C* evicts tombstones when using LCS.
Based on what I understood reading the docs, if the ratio of garbage 
collectable tomstones exceeds the "tombstone_threshold", C* should start 
compacting and evicting.

I am quite puzzled however by what might happen when dealing with range 
tombstones. In that case a single tombstone might actually stand for an 
arbitrary number of normal tombstones. In other words, do range tombstones 
contribute to the "tombstone_threshold"? If so, how?

I am also a bit confused by the "tombstone_compaction_interval". If I am 
dealing with a big partition in LCS which is receiving new records every day, 
and a weekly incremental repair job continously anticompacting the data and 
thus creating SStables, what is the likelhood of the default interval 
(10 days) to be actually hit?

Hopefully somebody will be able to shed some lights here!

Thanks in advance! 
Stefano 



Re: LCS, range tombstones, and eviction

Posted by Blake Eggleston <be...@apple.com>.
Hi Stefano,

Based on what I understood reading the docs, if the ratio of garbage 
collectable tomstones exceeds the "tombstone_threshold", C* should start 
compacting and evicting.

If there are no other normal compaction tasks to be run, LCS will attempt to compact the sstables it estimates it will be able to drop the most tombstones from. It does this by estimating the number of tombstones an sstable has that have passed the gc grace period. Whether or not a tombstone will actually be evicted is more complicated. Even if a tombstone has passed gc grace, it can't be dropped if the data it's deleting still exists in another sstable, otherwise the data would appear to return. So, a tombstone won't be dropped if there is data for the same partition in other sstables that is older than the tombstone being evaluated for eviction.

I am quite puzzled however by what might happen when dealing with range 
tombstones. In that case a single tombstone might actually stand for an 
arbitrary number of normal tombstones. In other words, do range tombstones 
contribute to the "tombstone_threshold"? If so, how?

From what I can tell, each end of the range tombstone is counted as a single tombstone tombstone. So a range tombstone effectively contributes '2' to the count of tombstones for an sstable. I'm not 100% sure, but I haven't seen any sstable writing logic that tracks open tombstones and counts covered cells as tombstones. So, it's likely that the effect of range tombstones covering many rows are under represented in the droppable tombstone estimate.

I am also a bit confused by the "tombstone_compaction_interval". If I am 
dealing with a big partition in LCS which is receiving new records every day, 
and a weekly incremental repair job continously anticompacting the data and 
thus creating SStables, what is the likelhood of the default interval 
(10 days) to be actually hit?

It will be hit, but probably only in the repaired data. Once the data is marked repaired, it shouldn't be anticompacted again, and should get old enough to pass the compaction interval. That shouldn't be an issue though, because you should be running repair often enough that data is repaired before it can ever get past the gc grace period. Otherwise you'll have other problems. Also, keep in mind that tombstone eviction is a part of all compactions, it's just that occasionally a compaction is run specifically for that purpose. Finally, you probably shouldn't run incremental repair on data that is deleted. There is a design flaw in the incremental repair used in pre-4.0 of cassandra that can cause consistency issues. It can also cause a *lot* of over streaming, so you might want to take a look at how much streaming your cluster is doing with full repairs, and incremental repairs. It might actually be more efficient to run full repairs.

Hope that helps,

Blake

On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostefano@gmail.com) wrote:

Hi all,

I am trying to wrap my head around how C* evicts tombstones when using LCS.
Based on what I understood reading the docs, if the ratio of garbage 
collectable tomstones exceeds the "tombstone_threshold", C* should start 
compacting and evicting.

I am quite puzzled however by what might happen when dealing with range 
tombstones. In that case a single tombstone might actually stand for an 
arbitrary number of normal tombstones. In other words, do range tombstones 
contribute to the "tombstone_threshold"? If so, how?

I am also a bit confused by the "tombstone_compaction_interval". If I am 
dealing with a big partition in LCS which is receiving new records every day, 
and a weekly incremental repair job continously anticompacting the data and 
thus creating SStables, what is the likelhood of the default interval 
(10 days) to be actually hit?

Hopefully somebody will be able to shed some lights here!

Thanks in advance! 
Stefano