You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mike Smith <mi...@mailchannels.com> on 2012/12/13 15:01:44 UTC

Does a scrub remove deleted/expired columns?

I'm using 1.0.12 and I find that large sstables tend to get compacted
infrequently. I've got data that gets deleted or expired frequently. Is it
possible to use scrub to accelerate the clean up of expired/deleted data?

-- 
Mike Smith
Director Development, MailChannels

Re: Does a scrub remove deleted/expired columns?

Posted by "B. Todd Burruss" <bt...@gmail.com>.
i will add that we have had a good experience with leveled compaction
cleaning out tombstoned data faster than size tiered, therefore
keeping our total disk usage much more reasonable than size tiered.
it is at the cost of I/O ... maybe 2X the I/O??  but that is not
bothering us.

what is bothering us is a bug that exists in 1.1.6 ... and we can't
update just yet

https://issues.apache.org/jira/browse/CASSANDRA-3306



On Sun, Dec 16, 2012 at 12:17 PM, aaron morton <aa...@thelastpickle.com> wrote:
> periodically trimming the row by by deleting the oldest columns, the deleted
> columns won't get cleaned up until all fragments of the row exist in a
> single sstable and that sstable undergoes a compaction?
>
> Nope.
> They are purged when all of the fragments of the row exist in the same
> SSTabels (plural) being compacted.
>
> Say you create a row and write to it for a while, it may be spread into 2 or
> 3 new stables. When there are 4 they are compacted into one, which will be
> bigger than the original 4. When there are 4 at the next size bucket they
> are compacted and so on.
>
> If you row exists in one size bucket only it GC will be purged.
>
> If you have a row you have been writing to for a long time it may be spread
> out in many buckets. That's not normally a big problem, but if you also do
> lots of deletes the tombstones will not get purged.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/12/2012, at 4:45 PM, Mike Smith <mi...@mailchannels.com> wrote:
>
> Thanks for the great explanation.
>
> I'd just like some clarification on the last point. Is it the case that if I
> constantly add new columns to a row, while periodically trimming the row by
> by deleting the oldest columns, the deleted columns won't get cleaned up
> until all fragments of the row exist in a single sstable and that sstable
> undergoes a compaction?
>
> If my understanding is correct, do you know if 1.2 will enable cleanup of
> columns in rows that have scattered fragments? Or, should I take a different
> approach?
>
>
>
> On Thu, Dec 13, 2012 at 5:52 PM, aaron morton <aa...@thelastpickle.com>
> wrote:
>>
>>  Is it possible to use scrub to accelerate the clean up of expired/deleted
>> data?
>>
>> No.
>> Scrub, and upgradesstables, are used to re-write each file on disk. Scrub
>> may remove some rows from a file because of corruption, however
>> upgradesstables will not.
>>
>> If you have long lived rows and a mixed work load of writes and deletes
>> there are a couple of options.
>>
>> You can try levelled compaction
>> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>>
>> You can tune the default sized tiered compaction by increasing the
>> min_compaction_threshold. This will increase the number of files that must
>> exist in each size tier before it will be compacted. As a result the speed
>> at which rows move into the higher tiers will slow down.
>>
>> Note that having lots of files may have a negative impact on read
>> performance. You can measure this my looking at the SSTables per read metric
>> in the cfhistograms.
>>
>> Lastly you can run a user defined or major compaction. User defined
>> compaction is available via JMX and allows you to compact any file you want.
>> Manual / major compaction is available via node tool. We usually discourage
>> it's use as it will create one big file that will not get compacted for a
>> while.
>>
>>
>> For background the tombstones / expired columns for a row are only purged
>> from the database when all fragments of the row are  in the files been
>> compacted. So if you have an old row that is spread out over many files it
>> may not get purged.
>>
>> Hope that helps.
>>
>>
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 14/12/2012, at 3:01 AM, Mike Smith <mi...@mailchannels.com> wrote:
>>
>> I'm using 1.0.12 and I find that large sstables tend to get compacted
>> infrequently. I've got data that gets deleted or expired frequently. Is it
>> possible to use scrub to accelerate the clean up of expired/deleted data?
>>
>> --
>> Mike Smith
>> Director Development, MailChannels
>>
>>
>
>
>
> --
> Mike Smith
> Director Development, MailChannels
>
>

Re: Does a scrub remove deleted/expired columns?

Posted by aaron morton <aa...@thelastpickle.com>.
> periodically trimming the row by by deleting the oldest columns, the deleted columns won't get cleaned up until all fragments of the row exist in a single sstable and that sstable undergoes a compaction?
Nope. 
They are purged when all of the fragments of the row exist in the same SSTabels (plural) being compacted. 

Say you create a row and write to it for a while, it may be spread into 2 or 3 new stables. When there are 4 they are compacted into one, which will be bigger than the original 4. When there are 4 at the next size bucket they are compacted and so on. 

If you row exists in one size bucket only it GC will be purged. 

If you have a row you have been writing to for a long time it may be spread out in many buckets. That's not normally a big problem, but if you also do lots of deletes the tombstones will not get purged. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 4:45 PM, Mike Smith <mi...@mailchannels.com> wrote:

> Thanks for the great explanation.
> 
> I'd just like some clarification on the last point. Is it the case that if I constantly add new columns to a row, while periodically trimming the row by by deleting the oldest columns, the deleted columns won't get cleaned up until all fragments of the row exist in a single sstable and that sstable undergoes a compaction?
> 
> If my understanding is correct, do you know if 1.2 will enable cleanup of columns in rows that have scattered fragments? Or, should I take a different approach?
> 
> 
> 
> On Thu, Dec 13, 2012 at 5:52 PM, aaron morton <aa...@thelastpickle.com> wrote:
>>  Is it possible to use scrub to accelerate the clean up of expired/deleted data?
> No.
> Scrub, and upgradesstables, are used to re-write each file on disk. Scrub may remove some rows from a file because of corruption, however upgradesstables will not. 
> 
> If you have long lived rows and a mixed work load of writes and deletes there are a couple of options. 
> 
> You can try levelled compaction http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
> 
> You can tune the default sized tiered compaction by increasing the min_compaction_threshold. This will increase the number of files that must exist in each size tier before it will be compacted. As a result the speed at which rows move into the higher tiers will slow down. 
> 
> Note that having lots of files may have a negative impact on read performance. You can measure this my looking at the SSTables per read metric in the cfhistograms. 
> 
> Lastly you can run a user defined or major compaction. User defined compaction is available via JMX and allows you to compact any file you want. Manual / major compaction is available via node tool. We usually discourage it's use as it will create one big file that will not get compacted for a while. 
> 
> 
> For background the tombstones / expired columns for a row are only purged from the database when all fragments of the row are  in the files been compacted. So if you have an old row that is spread out over many files it may not get purged. 
> 
> Hope that helps. 
> 
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 14/12/2012, at 3:01 AM, Mike Smith <mi...@mailchannels.com> wrote:
> 
>> I'm using 1.0.12 and I find that large sstables tend to get compacted infrequently. I've got data that gets deleted or expired frequently. Is it possible to use scrub to accelerate the clean up of expired/deleted data?
>> 
>> -- 
>> Mike Smith
>> Director Development, MailChannels
>> 
> 
> 
> 
> 
> -- 
> Mike Smith
> Director Development, MailChannels
> 


Re: Does a scrub remove deleted/expired columns?

Posted by Mike Smith <mi...@mailchannels.com>.
Thanks for the great explanation.

I'd just like some clarification on the last point. Is it the case that if
I constantly add new columns to a row, while periodically trimming the row
by by deleting the oldest columns, the deleted columns won't get cleaned up
until all fragments of the row exist in a single sstable and that sstable
undergoes a compaction?

If my understanding is correct, do you know if 1.2 will enable cleanup of
columns in rows that have scattered fragments? Or, should I take a
different approach?



On Thu, Dec 13, 2012 at 5:52 PM, aaron morton <aa...@thelastpickle.com>wrote:

>  Is it possible to use scrub to accelerate the clean up of expired/deleted
> data?
>
> No.
> Scrub, and upgradesstables, are used to re-write each file on disk. Scrub
> may remove some rows from a file because of corruption, however
> upgradesstables will not.
>
> If you have long lived rows and a mixed work load of writes and deletes
> there are a couple of options.
>
> You can try levelled compaction
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>
> You can tune the default sized tiered compaction by increasing the
> min_compaction_threshold. This will increase the number of files that must
> exist in each size tier before it will be compacted. As a result the speed
> at which rows move into the higher tiers will slow down.
>
> Note that having lots of files may have a negative impact on read
> performance. You can measure this my looking at the SSTables per read
> metric in the cfhistograms.
>
> Lastly you can run a user defined or major compaction. User defined
> compaction is available via JMX and allows you to compact any file you
> want. Manual / major compaction is available via node tool. We usually
> discourage it's use as it will create one big file that will not get
> compacted for a while.
>
>
> For background the tombstones / expired columns for a row are only purged
> from the database when all fragments of the row are  in the files been
> compacted. So if you have an old row that is spread out over many files it
> may not get purged.
>
> Hope that helps.
>
>
>
>    -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/12/2012, at 3:01 AM, Mike Smith <mi...@mailchannels.com> wrote:
>
> I'm using 1.0.12 and I find that large sstables tend to get compacted
> infrequently. I've got data that gets deleted or expired frequently. Is it
> possible to use scrub to accelerate the clean up of expired/deleted data?
>
> --
> Mike Smith
> Director Development, MailChannels
>
>
>


-- 
Mike Smith
Director Development, MailChannels

Re: Does a scrub remove deleted/expired columns?

Posted by aaron morton <aa...@thelastpickle.com>.
>  Is it possible to use scrub to accelerate the clean up of expired/deleted data?
No.
Scrub, and upgradesstables, are used to re-write each file on disk. Scrub may remove some rows from a file because of corruption, however upgradesstables will not. 

If you have long lived rows and a mixed work load of writes and deletes there are a couple of options. 

You can try levelled compaction http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

You can tune the default sized tiered compaction by increasing the min_compaction_threshold. This will increase the number of files that must exist in each size tier before it will be compacted. As a result the speed at which rows move into the higher tiers will slow down. 

Note that having lots of files may have a negative impact on read performance. You can measure this my looking at the SSTables per read metric in the cfhistograms. 

Lastly you can run a user defined or major compaction. User defined compaction is available via JMX and allows you to compact any file you want. Manual / major compaction is available via node tool. We usually discourage it's use as it will create one big file that will not get compacted for a while. 


For background the tombstones / expired columns for a row are only purged from the database when all fragments of the row are  in the files been compacted. So if you have an old row that is spread out over many files it may not get purged. 

Hope that helps. 



-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 3:01 AM, Mike Smith <mi...@mailchannels.com> wrote:

> I'm using 1.0.12 and I find that large sstables tend to get compacted infrequently. I've got data that gets deleted or expired frequently. Is it possible to use scrub to accelerate the clean up of expired/deleted data?
> 
> -- 
> Mike Smith
> Director Development, MailChannels
>