You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Oleksandr Shulgin <ol...@zalando.de> on 2018/09/10 13:53:14 UTC

Drop TTLd rows: upgradesstables -a or scrub?

Hello,

We have some tables with significant amount of TTLd rows that have expired
by now (and more gc_grace_seconds have passed since the TTL).  We have
stopped writing more data to these tables quite a while ago, so background
compaction isn't running.  The compaction strategy is the default
SizeTiered one.

Now we would like to get rid of all the droppable tombstones in these
tables.  What would be the approach that puts the least stress on the
cluster?

We've considered a few, but the most promising ones seem to be these two:
`nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
version 3.0.

Now, this docs page recommends to use upgradesstables wherever possible:
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
What is the reason behind it?

From source code I can see that Scrubber the class which is going to drop
the tombstones (and report the total number in the logs):
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308

I couldn't find similar handling in the upgradesstables code path.  Is the
assumption correct that this one will not drop the tombstone as a side
effect of rewriting the files?

Any drawbacks of using scrub for this task?

Thanks,
-- 
Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
Services | Zalando SE | Tel: +49 176 127-59-707

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 11, 2018 at 11:07 AM Steinmaurer, Thomas <
thomas.steinmaurer@dynatrace.com> wrote:

>
> a single (largish) SSTable or any other SSTable for a table, which does
> not get any writes (with e.g. deletes) anymore, will most likely not be
> part of an automatic minor compaction anymore, thus may stay forever on
> disk, if I don’t miss anything crucial here.
>

I would also expect that, but that's totally fine for us.

> Might be different though, if you are entirely writing TTL-based, cause
> single SSTable based automatic tombstone compaction may kick in here, but
> I’m not really experienced with that.
>

Yes, we were writing with a TTL of 2 years to these tables, and in about 1
years from now 100% of the data in them will expire.  We would be able to
simply truncate them at that point.

Now that you mention single-SSTable tombstone compaction again, I don't
think this is happening in our case.  For example, on one of the nodes I
see estimated droppable tombstones ratio range from 0.24 to slightly over 1
(1.09).  Yet, no single-SSTable compaction was triggered apparently,
because the data files are all 6 months old now.  We are using all the
default settings for tombstone_threshold, tombstone_compaction_interval
and unchecked_tombstone_compaction.

Does this mean that these all SSTable files do indeed overlap and because
we don't allow unchecked_tombstone_compaction, no actual compaction is
triggered?

We had been suffering a lot with storing timeseries data with STCS and disk
> capacity to have the cluster working smoothly and automatic minor
> compactions kicking out aged timeseries data according to our retention
> policies in the business logic. TWCS is unfortunately not an option for us.
> So, we did run major compactions every X weeks to reclaim disk space, thus
> from an operational perspective, by far not nice. Thus, finally decided to
> change STCS min_threshold from default 4 to 2, to let minor compactions
> kick in more frequently. We can live with the additional IO/CPU this is
> causing, thus is our current approach to disk space and sizing issues we
> had in the past.
>

For our new generation of tables we have switched to use TWCS, that's the
reason we don't write anymore to those old tables which are still using
STCS.

Cheers,
--
Alex

RE: Drop TTLd rows: upgradesstables -a or scrub?

Posted by "Steinmaurer, Thomas" <th...@dynatrace.com>.

Alex,

a single (largish) SSTable or any other SSTable for a table, which does not get any writes (with e.g. deletes) anymore, will most likely not be part of an automatic minor compaction anymore, thus may stay forever on disk, if I don’t miss anything crucial here. Might be different though, if you are entirely writing TTL-based, cause single SSTable based automatic tombstone compaction may kick in here, but I’m not really experienced with that.

We had been suffering a lot with storing timeseries data with STCS and disk capacity to have the cluster working smoothly and automatic minor compactions kicking out aged timeseries data according to our retention policies in the business logic. TWCS is unfortunately not an option for us. So, we did run major compactions every X weeks to reclaim disk space, thus from an operational perspective, by far not nice. Thus, finally decided to change STCS min_threshold from default 4 to 2, to let minor compactions kick in more frequently. We can live with the additional IO/CPU this is causing, thus is our current approach to disk space and sizing issues we had in the past.

Thomas

From: Oleksandr Shulgin <ol...@zalando.de>
Sent: Dienstag, 11. September 2018 09:47
To: User <us...@cassandra.apache.org>
Subject: Re: Drop TTLd rows: upgradesstables -a or scrub?

On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <th...@dynatrace.com>> wrote:
As far as I remember, in newer Cassandra versions, with STCS, nodetool compact offers a ‘-s’ command-line option to split the output into files with 50%, 25% … in size, thus in this case, not a single largish SSTable anymore. By default, without -s, it is a single SSTable though.

Thanks Thomas, I've also spotted the option while testing this approach.  I understand that doing major compactions is generally not recommended, but do you see any real drawback of having a single SSTable file in case we stopped writing new data to the table?

--
Alex

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 11, 2018 at 10:04 AM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

>
> Yet another surprising aspect of using `nodetool compact` is that it
> triggers major compaction on *all* nodes in the cluster at the same time.
> I don't see where this is documented and this was contrary to my
> expectation.  Does this behavior make sense to anyone?  Is this a bug?  The
> version is 3.0.
>

Whoops, taking back this one.  It was me who triggered the compaction on
all nodes at the same time.  Trying to do too many things at the same time.
:(

--
Alex

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
> thomas.steinmaurer@dynatrace.com> wrote:
>
>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>> compact offers a ‘-s’ command-line option to split the output into files
>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>> anymore. By default, without -s, it is a single SSTable though.
>>
>
> Thanks Thomas, I've also spotted the option while testing this approach.
>

Yet another surprising aspect of using `nodetool compact` is that it
triggers major compaction on *all* nodes in the cluster at the same time.
I don't see where this is documented and this was contrary to my
expectation.  Does this behavior make sense to anyone?  Is this a bug?  The
version is 3.0.

--
Alex

Re: Major compaction ignoring one SSTable? (was Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?))

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 18, 2018 at 10:38 AM Steinmaurer, Thomas <
thomas.steinmaurer@dynatrace.com> wrote:

>
> any indications in Cassandra log about insufficient disk space during
> compactions?
>

Bingo!  The following was logged around the time compaction was started
(and I only looked around when it was finishing):

Not enough space for compaction, 284674.12MB estimated.  Reducing scope.

That still leaves a question why the estimate doesn't take into account the
tombstones which will be dropped in the process.  Because actually it takes
only slightly more than 100GB in the end, as seen on the other nodes.

Thanks, Thomas!
--
Alex

RE: Major compaction ignoring one SSTable? (was Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?))

Posted by "Steinmaurer, Thomas" <th...@dynatrace.com>.

Alex,

any indications in Cassandra log about insufficient disk space during compactions?

Thomas

From: Oleksandr Shulgin <ol...@zalando.de>
Sent: Dienstag, 18. September 2018 10:01
To: User <us...@cassandra.apache.org>
Subject: Major compaction ignoring one SSTable? (was Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?))

On Mon, Sep 17, 2018 at 4:29 PM Oleksandr Shulgin <ol...@zalando.de>> wrote:

Thanks for your reply!  Indeed it could be coming from single-SSTable compaction, this I didn't think about.  By any chance looking into compaction_history table could be useful to trace it down?

Hello,

Yet another unexpected thing we are seeing is that after a major compaction completed on one of the nodes there are two SSTables instead of only one (time is UTC):

-rw-r--r-- 1 999 root 99G Sep 18 00:13 mc-583-big-Data.db -rw-r--r-- 1 999 root 70G Mar 8 2018 mc-74-big-Data.db

The more recent one must be the result of major compaction on this table, but why the other one from March was not included?

The logs don't help to understand the reason, and from compaction history on this node the following record seems to be the only trace:

@ Row 1
-------------------+------------------------------------------------------------------
 id                | b6feb180-bad7-11e8-9f42-f1a67c22839a
 bytes_in          | 223804299627
 bytes_out         | 105322622473
 columnfamily_name | XXX
 compacted_at      | 2018-09-18 00:13:48+0000
 keyspace_name     | YYY
 rows_merged       | {1: 31321943, 2: 11722759, 3: 382232, 4: 23405, 5: 2250, 6: 134}

This also doesn't tell us a lot.

This has happened only on one node out of 10 where the same command was used to start major compaction on this table.

Any ideas what could be the reason?

For now we have just started major compaction again to ensure these two last tables are compacted together, but we would really like to understand the reason for this behavior.

Regards,
--
Alex

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313

Major compaction ignoring one SSTable? (was Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?))

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, Sep 17, 2018 at 4:29 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

>
> Thanks for your reply!  Indeed it could be coming from single-SSTable
> compaction, this I didn't think about.  By any chance looking into
> compaction_history table could be useful to trace it down?
>

Hello,

Yet another unexpected thing we are seeing is that after a major compaction
completed on one of the nodes there are two SSTables instead of only one
(time is UTC):

-rw-r--r-- 1 999 root 99G Sep 18 00:13 mc-583-big-Data.db -rw-r--r-- 1 999
root 70G Mar 8 2018 mc-74-big-Data.db

The more recent one must be the result of major compaction on this table,
but why the other one from March was not included?

The logs don't help to understand the reason, and from compaction history
on this node the following record seems to be the only trace:

@ Row 1
-------------------+------------------------------------------------------------------
 id                | b6feb180-bad7-11e8-9f42-f1a67c22839a
 bytes_in          | 223804299627
 bytes_out         | 105322622473
 columnfamily_name | XXX
 compacted_at      | 2018-09-18 00:13:48+0000
 keyspace_name     | YYY
 rows_merged       | {1: 31321943, 2: 11722759, 3: 382232, 4: 23405, 5:
2250, 6: 134}

This also doesn't tell us a lot.

This has happened only on one node out of 10 where the same command was
used to start major compaction on this table.

Any ideas what could be the reason?

For now we have just started major compaction again to ensure these two
last tables are compacted together, but we would really like to understand
the reason for this behavior.

Regards,
--
Alex

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, Sep 17, 2018 at 4:41 PM Jeff Jirsa <jj...@gmail.com> wrote:

> Marcus’ idea of row lifting seems more likely, since you’re using STCS -
> it’s an optimization to “lift” expensive reads into a single sstable for
> future reads (if a read touches more than - I think - 4? sstables, we copy
> it back into the memtable so it’s flushed into a single sstable), so if you
> have STCS and you’re still doing reads, it could definitely be that.
>

A-ha, that's eye-opening: it could definitely be that.  Thanks for
explanation!

--
Alex

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Jeff Jirsa <jj...@gmail.com>.

> On Sep 17, 2018, at 7:29 AM, Oleksandr Shulgin <ol...@zalando.de> wrote:
> 
> On Mon, Sep 17, 2018 at 4:04 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>> Again, given that the tables are not updated anymore from the application and we have repaired them successfully multiple times already, how can it be that any inconsistency would be found by read-repair or normal repair?
>>> 
>>> We have seen this on a number of nodes, including SSTables written at the time there was guaranteed no repair running.
>> Not obvious to me where the sstable is coming from - you’d have to look in the logs. If it’s read repair, it’ll be created during a memtable flush. If it’s nodetool repair, it’ll be streamed in. It could also be compaction (especially tombstone compaction), in which case it’ll be in the compaction logs and it’ll have an sstable ancestor in the metadata.
> 
> Jeff,
> 
> Thanks for your reply!  Indeed it could be coming from single-SSTable compaction, this I didn't think about.  By any chance looking into compaction_history table could be useful to trace it down?
> 

Maybe. Also check your normal system / debug logs (depending on your version), which will usually tell you inputs and outputs

Marcus’ idea of row lifting seems more likely, since you’re using STCS - it’s an optimization to “lift” expensive reads into a single sstable for future reads (if a read touches more than - I think - 4? sstables, we copy it back into the memtable so it’s flushed into a single sstable), so if you have STCS and you’re still doing reads, it could definitely be that.

- Jeff

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, Sep 17, 2018 at 4:04 PM Jeff Jirsa <jj...@gmail.com> wrote:

> Again, given that the tables are not updated anymore from the application
> and we have repaired them successfully multiple times already, how can it
> be that any inconsistency would be found by read-repair or normal repair?
>
> We have seen this on a number of nodes, including SSTables written at the
> time there was guaranteed no repair running.
>
> Not obvious to me where the sstable is coming from - you’d have to look in
> the logs. If it’s read repair, it’ll be created during a memtable flush. If
> it’s nodetool repair, it’ll be streamed in. It could also be compaction
> (especially tombstone compaction), in which case it’ll be in the compaction
> logs and it’ll have an sstable ancestor in the metadata.
>

Jeff,

Thanks for your reply!  Indeed it could be coming from single-SSTable
compaction, this I didn't think about.  By any chance looking into
compaction_history table could be useful to trace it down?

--
Alex

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Marcus Eriksson <kr...@gmail.com>.

It could also be https://issues.apache.org/jira/browse/CASSANDRA-2503

On Mon, Sep 17, 2018 at 4:04 PM Jeff Jirsa <jj...@gmail.com> wrote:

>
>
> On Sep 17, 2018, at 2:34 AM, Oleksandr Shulgin <
> oleksandr.shulgin@zalando.de> wrote:
>
> On Tue, Sep 11, 2018 at 8:10 PM Oleksandr Shulgin <
> oleksandr.shulgin@zalando.de> wrote:
>
>> On Tue, 11 Sep 2018, 19:26 Jeff Jirsa, <jj...@gmail.com> wrote:
>>
>>> Repair or read-repair
>>>
>>
>> Could you be more specific please?
>>
>> Why any data would be streamed in if there is no (as far as I can see)
>> possibilities for the nodes to have inconsistency?
>>
>
> Again, given that the tables are not updated anymore from the application
> and we have repaired them successfully multiple times already, how can it
> be that any inconsistency would be found by read-repair or normal repair?
>
> We have seen this on a number of nodes, including SSTables written at the
> time there was guaranteed no repair running.
>
>
> Not obvious to me where the sstable is coming from - you’d have to look in
> the logs. If it’s read repair, it’ll be created during a memtable flush. If
> it’s nodetool repair, it’ll be streamed in. It could also be compaction
> (especially tombstone compaction), in which case it’ll be in the compaction
> logs and it’ll have an sstable ancestor in the metadata.
>
>
>

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Jeff Jirsa <jj...@gmail.com>.

> On Sep 17, 2018, at 2:34 AM, Oleksandr Shulgin <ol...@zalando.de> wrote:
> 
>> On Tue, Sep 11, 2018 at 8:10 PM Oleksandr Shulgin <ol...@zalando.de> wrote:
>>> On Tue, 11 Sep 2018, 19:26 Jeff Jirsa, <jj...@gmail.com> wrote:
>>> Repair or read-repair
>> 
>> 
>> Could you be more specific please?
>> 
>> Why any data would be streamed in if there is no (as far as I can see) possibilities for the nodes to have inconsistency?
> 
> Again, given that the tables are not updated anymore from the application and we have repaired them successfully multiple times already, how can it be that any inconsistency would be found by read-repair or normal repair?
> 
> We have seen this on a number of nodes, including SSTables written at the time there was guaranteed no repair running.
> 

Not obvious to me where the sstable is coming from - you’d have to look in the logs. If it’s read repair, it’ll be created during a memtable flush. If it’s nodetool repair, it’ll be streamed in. It could also be compaction (especially tombstone compaction), in which case it’ll be in the compaction logs and it’ll have an sstable ancestor in the metadata.

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 11, 2018 at 8:10 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Tue, 11 Sep 2018, 19:26 Jeff Jirsa, <jj...@gmail.com> wrote:
>
>> Repair or read-repair
>>
>
> Could you be more specific please?
>
> Why any data would be streamed in if there is no (as far as I can see)
> possibilities for the nodes to have inconsistency?
>

Again, given that the tables are not updated anymore from the application
and we have repaired them successfully multiple times already, how can it
be that any inconsistency would be found by read-repair or normal repair?

We have seen this on a number of nodes, including SSTables written at the
time there was guaranteed no repair running.

Regards,
--
Alex

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, 11 Sep 2018, 19:26 Jeff Jirsa, <jj...@gmail.com> wrote:

> Repair or read-repair
>

Jeff,

Could you be more specific please?

Why any data would be streamed in if there is no (as far as I can see)
possibilities for the nodes to have inconsistency?

--
Alex

On Tue, Sep 11, 2018 at 12:58 AM Oleksandr Shulgin <
> oleksandr.shulgin@zalando.de> wrote:
>
>> On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
>> oleksandr.shulgin@zalando.de> wrote:
>>
>>> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
>>> thomas.steinmaurer@dynatrace.com> wrote:
>>>
>>>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>>>> compact offers a ‘-s’ command-line option to split the output into files
>>>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>>>> anymore. By default, without -s, it is a single SSTable though.
>>>>
>>>
>>> Thanks Thomas, I've also spotted the option while testing this
>>> approach.  I understand that doing major compactions is generally not
>>> recommended, but do you see any real drawback of having a single SSTable
>>> file in case we stopped writing new data to the table?
>>>
>>
>> A related question is: given that we are not writing new data to these
>> tables, it would make sense to exclude them from the routine repair
>> regardless of the option we use in the end to remove the tombstones.
>>
>> However, I've just checked the timestamps of the SSTable files on one of
>> the nodes and to my surprise I can find some files written only a few weeks
>> ago (most of the files are half a year ago, which is expected because it
>> was the time we were adding this DC).  But we've stopped writing to the
>> tables about a year ago and we repair the cluster very week.
>>
>> What could explain that we suddenly see these new SSTable files?  They
>> shouldn't be there even due to overstreaming, because one would need to
>> find some differences in the Merkle tree in the first place, but I don't
>> see how that could actually happen in our case.
>>
>> Any ideas?
>>
>> Thanks,
>> --
>> Alex
>>
>>

Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Jeff Jirsa <jj...@gmail.com>.

Repair or read-repair


On Tue, Sep 11, 2018 at 12:58 AM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
> oleksandr.shulgin@zalando.de> wrote:
>
>> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
>> thomas.steinmaurer@dynatrace.com> wrote:
>>
>>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>>> compact offers a ‘-s’ command-line option to split the output into files
>>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>>> anymore. By default, without -s, it is a single SSTable though.
>>>
>>
>> Thanks Thomas, I've also spotted the option while testing this approach.
>> I understand that doing major compactions is generally not recommended, but
>> do you see any real drawback of having a single SSTable file in case we
>> stopped writing new data to the table?
>>
>
> A related question is: given that we are not writing new data to these
> tables, it would make sense to exclude them from the routine repair
> regardless of the option we use in the end to remove the tombstones.
>
> However, I've just checked the timestamps of the SSTable files on one of
> the nodes and to my surprise I can find some files written only a few weeks
> ago (most of the files are half a year ago, which is expected because it
> was the time we were adding this DC).  But we've stopped writing to the
> tables about a year ago and we repair the cluster very week.
>
> What could explain that we suddenly see these new SSTable files?  They
> shouldn't be there even due to overstreaming, because one would need to
> find some differences in the Merkle tree in the first place, but I don't
> see how that could actually happen in our case.
>
> Any ideas?
>
> Thanks,
> --
> Alex
>
>

Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
> thomas.steinmaurer@dynatrace.com> wrote:
>
>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>> compact offers a ‘-s’ command-line option to split the output into files
>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>> anymore. By default, without -s, it is a single SSTable though.
>>
>
> Thanks Thomas, I've also spotted the option while testing this approach.
> I understand that doing major compactions is generally not recommended, but
> do you see any real drawback of having a single SSTable file in case we
> stopped writing new data to the table?
>

A related question is: given that we are not writing new data to these
tables, it would make sense to exclude them from the routine repair
regardless of the option we use in the end to remove the tombstones.

However, I've just checked the timestamps of the SSTable files on one of
the nodes and to my surprise I can find some files written only a few weeks
ago (most of the files are half a year ago, which is expected because it
was the time we were adding this DC).  But we've stopped writing to the
tables about a year ago and we repair the cluster very week.

What could explain that we suddenly see these new SSTable files?  They
shouldn't be there even due to overstreaming, because one would need to
find some differences in the Merkle tree in the first place, but I don't
see how that could actually happen in our case.

Any ideas?

Thanks,
--
Alex

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
thomas.steinmaurer@dynatrace.com> wrote:

> As far as I remember, in newer Cassandra versions, with STCS, nodetool
> compact offers a ‘-s’ command-line option to split the output into files
> with 50%, 25% … in size, thus in this case, not a single largish SSTable
> anymore. By default, without -s, it is a single SSTable though.
>

Thanks Thomas, I've also spotted the option while testing this approach.  I
understand that doing major compactions is generally not recommended, but
do you see any real drawback of having a single SSTable file in case we
stopped writing new data to the table?

--
Alex

RE: Drop TTLd rows: upgradesstables -a or scrub?

Posted by "Steinmaurer, Thomas" <th...@dynatrace.com>.

As far as I remember, in newer Cassandra versions, with STCS, nodetool compact offers a ‘-s’ command-line option to split the output into files with 50%, 25% … in size, thus in this case, not a single largish SSTable anymore. By default, without -s, it is a single SSTable though.

Thomas

From: Jeff Jirsa <jj...@gmail.com>
Sent: Montag, 10. September 2018 19:40
To: cassandra <us...@cassandra.apache.org>
Subject: Re: Drop TTLd rows: upgradesstables -a or scrub?

I think it's important to describe exactly what's going on for people who just read the list but who don't have context. This blog does a really good job: http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthelastpickle.com%2Fblog%2F2016%2F07%2F27%2Fabout-deletes-and-tombstones.html&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=QsmCCwsIvZC0iBvjyM8f47iNPB4i0i6SJNxmVtEixI0%3D&reserved=0> , but briefly:

- When a TTL expires, we treat it as a tombstone, because it may have been written ON TOP of another piece of live data, so we need to get that deletion marker to all hosts, just like a manual explicit delete
- Tombstones in sstable A may shadow data in sstable B, so doing anything on just one sstable MAY NOT remove the tombstone - we can't get rid of the tombstone if sstable A overlaps another sstable with the same partition (which we identify via bloom filter) that has any data with a lower timestamp (we don't check the sstable for a shadowed value, we just look at the minimum live timestamp of the table)

"nodetool garbagecollect" looks for sstables that overlap (partition keys) and combine them together, which makes tombstones past GCGS purgable and should remove them (and data shadowed by them).

If you're on a version without nodetool garbagecollection, you can approximate it using user defined compaction ( http://thelastpickle.com/blog/2016/10/18/user-defined-compaction.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthelastpickle.com%2Fblog%2F2016%2F10%2F18%2Fuser-defined-compaction.html&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=oPBoTnhhYOqY6vjxayVXuo3sevdph0Zm0cUmtV2r7nU%3D&reserved=0> ) - it's a JMX endpoint that let's you tell cassandra to compact one or more sstables together based on parameters you choose. This is somewhat like upgradesstables or scrub, but you can combine sstables as well. If you choose candidates intelligently (notably, oldest sstables first, or sstables you know overlap), you can likely manually clean things up pretty quickly. At one point, I had a jar that would do single sstable at a time, oldest sstable first, and it pretty much worked for this purpose most of the time.

If you have room, a "nodetool compact" on stcs will also work, but it'll give you one huge sstable, which will be unfortunate long term (probably less of a problem if you're no longer writing to this table).

On Mon, Sep 10, 2018 at 10:29 AM Charulata Sharma (charshar) <ch...@cisco.com.invalid>> wrote:
Scrub takes a very long time and does not remove the tombstones. You should do garbage cleaning. It immediately removes the tombstones.

Thaks,
Charu

From: Oleksandr Shulgin <ol...@zalando.de>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 10, 2018 at 6:53 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Drop TTLd rows: upgradesstables -a or scrub?

Hello,

We have some tables with significant amount of TTLd rows that have expired by now (and more gc_grace_seconds have passed since the TTL).  We have stopped writing more data to these tables quite a while ago, so background compaction isn't running.  The compaction strategy is the default SizeTiered one.

Now we would like to get rid of all the droppable tombstones in these tables.  What would be the approach that puts the least stress on the cluster?

We've considered a few, but the most promising ones seem to be these two: `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra version 3.0.

Now, this docs page recommends to use upgradesstables wherever possible: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.datastax.com%2Fen%2Fcassandra%2F3.0%2Fcassandra%2Ftools%2FtoolsScrub.html&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=bLlEXcX7M4%2FQvZaVfkusSosZxFXpOmHn6QftqgP%2Fwsk%3D&reserved=0>
What is the reason behind it?

From source code I can see that Scrubber the class which is going to drop the tombstones (and report the total number in the logs): https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fcassandra%2Fblob%2Fcassandra-3.0%2Fsrc%2Fjava%2Forg%2Fapache%2Fcassandra%2Fdb%2Fcompaction%2FScrubber.java%23L308&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=Is9QfCYwrFTWhmud9u15rAa7zWkMgRBwJP2NYqUuxFg%3D&reserved=0>

I couldn't find similar handling in the upgradesstables code path.  Is the assumption correct that this one will not drop the tombstone as a side effect of rewriting the files?

Any drawbacks of using scrub for this task?

Thanks,
--
Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data Services | Zalando SE | Tel: +49 176 127-59-707

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313

RE: Drop TTLd rows: upgradesstables -a or scrub?

Posted by "Steinmaurer, Thomas" <th...@dynatrace.com>.

From: Jeff Jirsa <jj...@gmail.com>
Sent: Montag, 10. September 2018 19:40
To: cassandra <us...@cassandra.apache.org>
Subject: Re: Drop TTLd rows: upgradesstables -a or scrub?

I think it's important to describe exactly what's going on for people who just read the list but who don't have context. This blog does a really good job: http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthelastpickle.com%2Fblog%2F2016%2F07%2F27%2Fabout-deletes-and-tombstones.html&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=QsmCCwsIvZC0iBvjyM8f47iNPB4i0i6SJNxmVtEixI0%3D&reserved=0> , but briefly:

- When a TTL expires, we treat it as a tombstone, because it may have been written ON TOP of another piece of live data, so we need to get that deletion marker to all hosts, just like a manual explicit delete
- Tombstones in sstable A may shadow data in sstable B, so doing anything on just one sstable MAY NOT remove the tombstone - we can't get rid of the tombstone if sstable A overlaps another sstable with the same partition (which we identify via bloom filter) that has any data with a lower timestamp (we don't check the sstable for a shadowed value, we just look at the minimum live timestamp of the table)

"nodetool garbagecollect" looks for sstables that overlap (partition keys) and combine them together, which makes tombstones past GCGS purgable and should remove them (and data shadowed by them).

If you're on a version without nodetool garbagecollection, you can approximate it using user defined compaction ( http://thelastpickle.com/blog/2016/10/18/user-defined-compaction.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthelastpickle.com%2Fblog%2F2016%2F10%2F18%2Fuser-defined-compaction.html&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=oPBoTnhhYOqY6vjxayVXuo3sevdph0Zm0cUmtV2r7nU%3D&reserved=0> ) - it's a JMX endpoint that let's you tell cassandra to compact one or more sstables together based on parameters you choose. This is somewhat like upgradesstables or scrub, but you can combine sstables as well. If you choose candidates intelligently (notably, oldest sstables first, or sstables you know overlap), you can likely manually clean things up pretty quickly. At one point, I had a jar that would do single sstable at a time, oldest sstable first, and it pretty much worked for this purpose most of the time.

If you have room, a "nodetool compact" on stcs will also work, but it'll give you one huge sstable, which will be unfortunate long term (probably less of a problem if you're no longer writing to this table).

On Mon, Sep 10, 2018 at 10:29 AM Charulata Sharma (charshar) <ch...@cisco.com.invalid>> wrote:
Scrub takes a very long time and does not remove the tombstones. You should do garbage cleaning. It immediately removes the tombstones.

Thaks,
Charu

From: Oleksandr Shulgin <ol...@zalando.de>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 10, 2018 at 6:53 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Drop TTLd rows: upgradesstables -a or scrub?

Hello,

We have some tables with significant amount of TTLd rows that have expired by now (and more gc_grace_seconds have passed since the TTL).  We have stopped writing more data to these tables quite a while ago, so background compaction isn't running.  The compaction strategy is the default SizeTiered one.

Now we would like to get rid of all the droppable tombstones in these tables.  What would be the approach that puts the least stress on the cluster?

We've considered a few, but the most promising ones seem to be these two: `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra version 3.0.

Now, this docs page recommends to use upgradesstables wherever possible: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.datastax.com%2Fen%2Fcassandra%2F3.0%2Fcassandra%2Ftools%2FtoolsScrub.html&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=bLlEXcX7M4%2FQvZaVfkusSosZxFXpOmHn6QftqgP%2Fwsk%3D&reserved=0>
What is the reason behind it?

From source code I can see that Scrubber the class which is going to drop the tombstones (and report the total number in the logs): https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fcassandra%2Fblob%2Fcassandra-3.0%2Fsrc%2Fjava%2Forg%2Fapache%2Fcassandra%2Fdb%2Fcompaction%2FScrubber.java%23L308&data=01%7C01%7Cthomas.steinmaurer%40dynatrace.com%7Cba2e0ee3b8494113460008d617456159%7C70ebe3a35b30435d9d677716d74ca190%7C1&sdata=Is9QfCYwrFTWhmud9u15rAa7zWkMgRBwJP2NYqUuxFg%3D&reserved=0>

I couldn't find similar handling in the upgradesstables code path.  Is the assumption correct that this one will not drop the tombstone as a side effect of rewriting the files?

Any drawbacks of using scrub for this task?

Thanks,
--
Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data Services | Zalando SE | Tel: +49 176 127-59-707

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, Sep 10, 2018 at 10:03 PM Jeff Jirsa <jj...@gmail.com> wrote:

> How much free space do you have, and how big is the table?
>

So there are 2 tables, one is around 120GB and the other is around 250GB on
every node.  On the node with the most free disk space we still have around
500GB available and on the node with the least free space: 300GB.

So if I understand it correctly, we could still do major compaction while
keeping STCS and we should not hit 100% disk space, if we first compact one
of the tables, and then the other (we expect quite some free space to
become available due to to all those TTL tombstones being removed in the
process).

Is there any real drawback of having a single big SSTable in our case where
we never going to append more data to the table?

Switching to LCS is another option.
>

Hm, this is interesting idea.  The expectation should be that even if we
don't remove 100% of the tombstones, we should be able to get rid of 90%
them on the highest level, right?  And if we would have less space
available, using LCS could make progress by re-organizing the partitions in
smaller increments, so we could still do it if we had less free space than
the smallest table?

Cheers,
--
Alex

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Jeff Jirsa <jj...@gmail.com>.

How much free space do you have, and how big is the table?

Switching to LCS is another option. 

-- 
Jeff Jirsa


> On Sep 10, 2018, at 12:09 PM, Oleksandr Shulgin <ol...@zalando.de> wrote:
> 
>> On Mon, 10 Sep 2018, 19:40 Jeff Jirsa, <jj...@gmail.com> wrote:
>> I think it's important to describe exactly what's going on for people who just read the list but who don't have context. This blog does a really good job: http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html , but briefly:
>> 
>> - When a TTL expires, we treat it as a tombstone, because it may have been written ON TOP of another piece of live data, so we need to get that deletion marker to all hosts, just like a manual explicit delete
>> - Tombstones in sstable A may shadow data in sstable B, so doing anything on just one sstable MAY NOT remove the tombstone - we can't get rid of the tombstone if sstable A overlaps another sstable with the same partition (which we identify via bloom filter) that has any data with a lower timestamp (we don't check the sstable for a shadowed value, we just look at the minimum live timestamp of the table)
>> 
>> "nodetool garbagecollect" looks for sstables that overlap (partition keys) and combine them together, which makes tombstones past GCGS purgable and should remove them (and data shadowed by them).
>> 
>> If you're on a version without nodetool garbagecollection, you can approximate it using user defined compaction ( http://thelastpickle.com/blog/2016/10/18/user-defined-compaction.html ) - it's a JMX endpoint that let's you tell cassandra to compact one or more sstables together based on parameters you choose. This is somewhat like upgradesstables or scrub, but you can combine sstables as well. If you choose candidates intelligently (notably, oldest sstables first, or sstables you know overlap), you can likely manually clean things up pretty quickly. At one point, I had a jar that would do single sstable at a time, oldest sstable first, and it pretty much worked for this purpose most of the time. 
>> 
>> If you have room, a "nodetool compact" on stcs will also work, but it'll give you one huge sstable, which will be unfortunate long term (probably less of a problem if you're no longer writing to this table).
> 
> 
> That's a really nice refresher, thanks Jeff!
> 
> From the nature of the data at hand and because of the SizeTiered compaction, I would expect that more or less all tables do overlap with each other.
> 
> Even if we would be able to identify the overlapping ones (how?), I expect that we would have to do an equivalent of the major compaction, but (maybe) in multiple stages. Not sure that's really worth the trouble for us.
> 
> Thanks,
> --
> Alex
> 
>>> On Mon, Sep 10, 2018 at 10:29 AM Charulata Sharma (charshar) <ch...@cisco.com.invalid> wrote:
>>> Scrub takes a very long time and does not remove the tombstones. You should do garbage cleaning. It immediately removes the tombstones.
>>> 
>>>  
>>> 
>>> Thaks,
>>> 
>>> Charu
>>> 
>>>  
>>> 
>>> From: Oleksandr Shulgin <ol...@zalando.de>
>>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Date: Monday, September 10, 2018 at 6:53 AM
>>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Subject: Drop TTLd rows: upgradesstables -a or scrub?
>>> 
>>>  
>>> 
>>> Hello,
>>> 
>>>  
>>> 
>>> We have some tables with significant amount of TTLd rows that have expired by now (and more gc_grace_seconds have passed since the TTL).  We have stopped writing more data to these tables quite a while ago, so background compaction isn't running.  The compaction strategy is the default SizeTiered one.
>>> 
>>>  
>>> 
>>> Now we would like to get rid of all the droppable tombstones in these tables.  What would be the approach that puts the least stress on the cluster?
>>> 
>>>  
>>> 
>>> We've considered a few, but the most promising ones seem to be these two: `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra version 3.0.
>>> 
>>>  
>>> 
>>> Now, this docs page recommends to use upgradesstables wherever possible: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
>>> 
>>> What is the reason behind it?
>>> 
>>>  
>>> 
>>> From source code I can see that Scrubber the class which is going to drop the tombstones (and report the total number in the logs): https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308
>>> 
>>>  
>>> 
>>> I couldn't find similar handling in the upgradesstables code path.  Is the assumption correct that this one will not drop the tombstone as a side effect of rewriting the files?
>>> 
>>>  
>>> 
>>> Any drawbacks of using scrub for this task?
>>> 
>>>  
>>> 
>>> Thanks,
>>> --
>>> 
>>> Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data Services | Zalando SE | Tel: +49 176 127-59-707
>>> 
>>>

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, 10 Sep 2018, 19:40 Jeff Jirsa, <jj...@gmail.com> wrote:

> I think it's important to describe exactly what's going on for people who
> just read the list but who don't have context. This blog does a really good
> job:
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
> , but briefly:
>
> - When a TTL expires, we treat it as a tombstone, because it may have been
> written ON TOP of another piece of live data, so we need to get that
> deletion marker to all hosts, just like a manual explicit delete
> - Tombstones in sstable A may shadow data in sstable B, so doing anything
> on just one sstable MAY NOT remove the tombstone - we can't get rid of the
> tombstone if sstable A overlaps another sstable with the same partition
> (which we identify via bloom filter) that has any data with a lower
> timestamp (we don't check the sstable for a shadowed value, we just look at
> the minimum live timestamp of the table)
>
> "nodetool garbagecollect" looks for sstables that overlap (partition keys)
> and combine them together, which makes tombstones past GCGS purgable and
> should remove them (and data shadowed by them).
>
> If you're on a version without nodetool garbagecollection, you can
> approximate it using user defined compaction (
> http://thelastpickle.com/blog/2016/10/18/user-defined-compaction.html ) -
> it's a JMX endpoint that let's you tell cassandra to compact one or more
> sstables together based on parameters you choose. This is somewhat like
> upgradesstables or scrub, but you can combine sstables as well. If you
> choose candidates intelligently (notably, oldest sstables first, or
> sstables you know overlap), you can likely manually clean things up pretty
> quickly. At one point, I had a jar that would do single sstable at a time,
> oldest sstable first, and it pretty much worked for this purpose most of
> the time.
>
> If you have room, a "nodetool compact" on stcs will also work, but it'll
> give you one huge sstable, which will be unfortunate long term (probably
> less of a problem if you're no longer writing to this table).
>

That's a really nice refresher, thanks Jeff!

From the nature of the data at hand and because of the SizeTiered
compaction, I would expect that more or less all tables do overlap with
each other.

Even if we would be able to identify the overlapping ones (how?), I expect
that we would have to do an equivalent of the major compaction, but (maybe)
in multiple stages. Not sure that's really worth the trouble for us.

Thanks,
--
Alex

On Mon, Sep 10, 2018 at 10:29 AM Charulata Sharma (charshar)
> <ch...@cisco.com.invalid> wrote:
>
>> Scrub takes a very long time and does not remove the tombstones. You
>> should do garbage cleaning. It immediately removes the tombstones.
>>
>>
>>
>> Thaks,
>>
>> Charu
>>
>>
>>
>> *From: *Oleksandr Shulgin <ol...@zalando.de>
>> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>> *Date: *Monday, September 10, 2018 at 6:53 AM
>> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>> *Subject: *Drop TTLd rows: upgradesstables -a or scrub?
>>
>>
>>
>> Hello,
>>
>>
>>
>> We have some tables with significant amount of TTLd rows that have
>> expired by now (and more gc_grace_seconds have passed since the TTL).  We
>> have stopped writing more data to these tables quite a while ago, so
>> background compaction isn't running.  The compaction strategy is the
>> default SizeTiered one.
>>
>>
>>
>> Now we would like to get rid of all the droppable tombstones in these
>> tables.  What would be the approach that puts the least stress on the
>> cluster?
>>
>>
>>
>> We've considered a few, but the most promising ones seem to be these two:
>> `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
>> version 3.0.
>>
>>
>>
>> Now, this docs page recommends to use upgradesstables wherever possible:
>> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
>>
>> What is the reason behind it?
>>
>>
>>
>> From source code I can see that Scrubber the class which is going to drop
>> the tombstones (and report the total number in the logs):
>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308
>>
>>
>>
>> I couldn't find similar handling in the upgradesstables code path.  Is
>> the assumption correct that this one will not drop the tombstone as a side
>> effect of rewriting the files?
>>
>>
>>
>> Any drawbacks of using scrub for this task?
>>
>>
>>
>> Thanks,
>> --
>>
>> Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
>> Services | Zalando SE | Tel: +49 176 127-59-707
>>
>>
>>
>

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Jeff Jirsa <jj...@gmail.com>.

I think it's important to describe exactly what's going on for people who
just read the list but who don't have context. This blog does a really good
job:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
, but briefly:

- When a TTL expires, we treat it as a tombstone, because it may have been
written ON TOP of another piece of live data, so we need to get that
deletion marker to all hosts, just like a manual explicit delete
- Tombstones in sstable A may shadow data in sstable B, so doing anything
on just one sstable MAY NOT remove the tombstone - we can't get rid of the
tombstone if sstable A overlaps another sstable with the same partition
(which we identify via bloom filter) that has any data with a lower
timestamp (we don't check the sstable for a shadowed value, we just look at
the minimum live timestamp of the table)

"nodetool garbagecollect" looks for sstables that overlap (partition keys)
and combine them together, which makes tombstones past GCGS purgable and
should remove them (and data shadowed by them).

If you're on a version without nodetool garbagecollection, you can
approximate it using user defined compaction (
http://thelastpickle.com/blog/2016/10/18/user-defined-compaction.html ) -
it's a JMX endpoint that let's you tell cassandra to compact one or more
sstables together based on parameters you choose. This is somewhat like
upgradesstables or scrub, but you can combine sstables as well. If you
choose candidates intelligently (notably, oldest sstables first, or
sstables you know overlap), you can likely manually clean things up pretty
quickly. At one point, I had a jar that would do single sstable at a time,
oldest sstable first, and it pretty much worked for this purpose most of
the time.

If you have room, a "nodetool compact" on stcs will also work, but it'll
give you one huge sstable, which will be unfortunate long term (probably
less of a problem if you're no longer writing to this table).

On Mon, Sep 10, 2018 at 10:29 AM Charulata Sharma (charshar)
<ch...@cisco.com.invalid> wrote:

> Scrub takes a very long time and does not remove the tombstones. You
> should do garbage cleaning. It immediately removes the tombstones.
>
>
>
> Thaks,
>
> Charu
>
>
>
> *From: *Oleksandr Shulgin <ol...@zalando.de>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Monday, September 10, 2018 at 6:53 AM
> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Subject: *Drop TTLd rows: upgradesstables -a or scrub?
>
>
>
> Hello,
>
>
>
> We have some tables with significant amount of TTLd rows that have expired
> by now (and more gc_grace_seconds have passed since the TTL).  We have
> stopped writing more data to these tables quite a while ago, so background
> compaction isn't running.  The compaction strategy is the default
> SizeTiered one.
>
>
>
> Now we would like to get rid of all the droppable tombstones in these
> tables.  What would be the approach that puts the least stress on the
> cluster?
>
>
>
> We've considered a few, but the most promising ones seem to be these two:
> `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
> version 3.0.
>
>
>
> Now, this docs page recommends to use upgradesstables wherever possible:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
>
> What is the reason behind it?
>
>
>
> From source code I can see that Scrubber the class which is going to drop
> the tombstones (and report the total number in the logs):
> https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308
>
>
>
> I couldn't find similar handling in the upgradesstables code path.  Is the
> assumption correct that this one will not drop the tombstone as a side
> effect of rewriting the files?
>
>
>
> Any drawbacks of using scrub for this task?
>
>
>
> Thanks,
> --
>
> Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
> Services | Zalando SE | Tel: +49 176 127-59-707
>
>
>

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, 10 Sep 2018, 19:29 Charulata Sharma (charshar),
<ch...@cisco.com.invalid> wrote:

> Scrub takes a very long time and does not remove the tombstones.
>
Charu,

Why is that if the documentation clearly says it does?

> should do garbage cleaning. It immediately removes the tombstones.
>
If you mean 'nodetool garbagecollect' - that command is not available in
the version we are using. It only became available in 3.10.

--
Alex

>
> *From: *Oleksandr Shulgin <ol...@zalando.de>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Monday, September 10, 2018 at 6:53 AM
> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Subject: *Drop TTLd rows: upgradesstables -a or scrub?
>
>
>
> Hello,
>
>
>
> We have some tables with significant amount of TTLd rows that have expired
> by now (and more gc_grace_seconds have passed since the TTL).  We have
> stopped writing more data to these tables quite a while ago, so background
> compaction isn't running.  The compaction strategy is the default
> SizeTiered one.
>
>
>
> Now we would like to get rid of all the droppable tombstones in these
> tables.  What would be the approach that puts the least stress on the
> cluster?
>
>
>
> We've considered a few, but the most promising ones seem to be these two:
> `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
> version 3.0.
>
>
>
> Now, this docs page recommends to use upgradesstables wherever possible:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
>
> What is the reason behind it?
>
>
>
> From source code I can see that Scrubber the class which is going to drop
> the tombstones (and report the total number in the logs):
> https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308
>
>
>
> I couldn't find similar handling in the upgradesstables code path.  Is the
> assumption correct that this one will not drop the tombstone as a side
> effect of rewriting the files?
>
>
>
> Any drawbacks of using scrub for this task?
>
>
>
> Thanks,
> --
>
> Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
> Services | Zalando SE | Tel: +49 176 127-59-707
>
>
>

Re: Drop TTLd rows: upgradesstables -a or scrub?

Posted by "Charulata Sharma (charshar)" <ch...@cisco.com.INVALID>.

Scrub takes a very long time and does not remove the tombstones. You should do garbage cleaning. It immediately removes the tombstones.

Thaks,
Charu

From: Oleksandr Shulgin <ol...@zalando.de>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Monday, September 10, 2018 at 6:53 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Drop TTLd rows: upgradesstables -a or scrub?

Hello,

We have some tables with significant amount of TTLd rows that have expired by now (and more gc_grace_seconds have passed since the TTL).  We have stopped writing more data to these tables quite a while ago, so background compaction isn't running.  The compaction strategy is the default SizeTiered one.

Now we would like to get rid of all the droppable tombstones in these tables.  What would be the approach that puts the least stress on the cluster?

We've considered a few, but the most promising ones seem to be these two: `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra version 3.0.

Now, this docs page recommends to use upgradesstables wherever possible: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
What is the reason behind it?

From source code I can see that Scrubber the class which is going to drop the tombstones (and report the total number in the logs): https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308

I couldn't find similar handling in the upgradesstables code path.  Is the assumption correct that this one will not drop the tombstone as a side effect of rewriting the files?

Any drawbacks of using scrub for this task?

Thanks,
--
Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data Services | Zalando SE | Tel: +49 176 127-59-707