You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Marcus Olsson <ma...@ericsson.com> on 2016/11/23 15:52:56 UTC
STCS in L0 behaviour
Hi everyone,
TL;DR
Should LCS be changed to always prefer an STCS compaction in L0 if it's
falling behind? Assuming that STCS in L0 is enabled.
Currently LCS seems to check if there is a possible L0->L1 compaction
before checking if it's falling behind, which in our case used between
15-30% of the compaction thread CPU.
TL;DR
So first some background:
We have a Apache Cassandra 2.2 cluster running with a high load. In that
cluster there is a table with a moderate amount of writes per second
that is using LeveledCompactionStrategy. The test was to run repair on
that table while we monitored the cluster through JMC and with Flight
Recordings enabled. This resulted in a large amount of sstables for that
table, which I assume others have experienced as well. In this case I
think it was between 15-20k.
From the Flight Recording one thing we saw was that 15-30% of the CPU
time in each of the compaction threads was spent on
"getNextBackgroundTask()" which retrieves the next compaction job. With
some further investigation this seems to mostly be when it's checking
for overlap in L0 sstables before performing an L0->L1 compaction. There
is a JIRA which seems to be related to this
https://issues.apache.org/jira/browse/CASSANDRA-11571 which we
backported to 2.2 and tested. In our testing it seemed to improve the
situation but it was still using noticeable CPU.
My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
1. Check each level (L1+)
- If a L1+ compaction is needed check if L0 is behind and do STCS if
that's the case, otherwise do the L1+ compaction.
2. Check L0 -> L1 compactions and if none is needed/possible check for
STCS in L0.
My proposal is to change this behavior to always check if L0 is far
behind first and do a STCS compaction in that case. This would avoid the
overlap check for L0 -> L1 compactions when L0 is behind and I think it
makes sense since we already prefer STCS to L1+ compactions. This would
not solve the repair situation, but it would lower some of the impact
that repair has on LCS.
For what version this could get in I think trunk would be enough since
compaction is pluggable.
--
Fwd: Footer
Ericsson <http://www.ericsson.com/>
*MARCUS OLSSON *
Software Developer
*Ericsson*
Sweden
marcus.olsson@ericsson.com <ma...@ericsson.com>
www.ericsson.com <http://www.ericsson.com>
Re: STCS in L0 behaviour
Posted by Marcus Olsson <ma...@ericsson.com>.
Hi,
In reply to Dikang Gu:
For the run where we incorporated the change from CASSANDRA-11571 the
stack trace was like this (from JMC):
*Stack Trace* *Sample Count* *Percentage(%)*
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int)
229 11.983
-org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates()
228 11.931
--org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(int)
221 11.565
---org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(SSTableReader,
Map) 201 10.518
----org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(Token,
Token, Map) 201 10.518
-----org.apache.cassandra.dht.Bounds.intersects(Bounds) 141 7.378
-----java.util.HashSet.add(Object) 56 2.93
This is for one of the compaction executors during an interval of 1
minute and 24 seconds, but we saw similar behavior for other compaction
threads as well. The full flight recording was 10 minutes and was
started at the same time as the repair. The interval was taken from the
end of the recording where the number of sstables had increased. During
this interval this compaction thread used ~10% of the total CPU.
I agree that optimally there shouldn't be many sstables in L0 and except
for when repair is running we don't have that many.
---
In reply to Jeff Jirsa/Nate McCall:
I might have been unclear about the compaction order in my first email,
I meant to say that there is a check for STCS right before L1+, but only
if a L1+ compaction is possible. We used version 2.2.7 for the test run
so https://issues.apache.org/jira/browse/CASSANDRA-10979 should be
included and have reduced some of the backlog of L0.
Correct me if I'm wrong but my interpretation of the scenario that
Sylvain describes in
https://issues.apache.org/jira/browse/CASSANDRA-5371 is when you either
almost constantly have 32+ SSTables in L0 or are close to it. My guess
is that this could be applied to having constant load during a certain
timespan as well. So when you get more than 32 sstables you start to do
STCS which in turn creates larger sstables which might span the whole of
L1. Then when these sstables should be promoted to L1 it re-writes the
whole L1 which creates a larger backlog in L0. So then the number of
sstables keeps rising and trigger a STCS again, and complete the circle.
Based on this interpretation it seems to me that if the write pattern
into L0 is "random" this might happen regardless if a STCS compaction
has occurred or not.
If my interpretation is correct it might be better to choose a higher
number of sstables before STCS starts in L0 and make it configurable.
With a reduced complexity it could be something like this:
1. Perform STCS in L0 if we have above X(1000?) sstables in L0.
2. Check L1+
3. Check for L0->L1
It should be possible to keep the current logic as well and only add a
configurable check before (step 1) to avoid the overlapping check with
larger backlogs. Another alternative might be
https://issues.apache.org/jira/browse/CASSANDRA-7409 and allow
overlapping sstables in more levels than L0. If it can quickly push
sorted data to L1 it might remove the need for STCS in LCS. The
previously mentioned potential cost of the overlapping check would still
be there if we have a large backlog, but the approach might reduce the
risk of getting into the situation. I'll try to get some time to run a
test with CASSANDRA-7409 in our test cluster.
BR
Marcus O
On 11/28/2016 06:48 PM, Eric Evans wrote:
> On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu<di...@gmail.com> wrote:
>> Hi Marcus,
>>
>> Do you have some stack trace to show that which function in the `
>> getNextBackgroundTask` is most expensive?
>>
>> Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
>> cluster, I try my best to reduce the impact of repair, and keep number of
>> sstables in L0 < 100.
>>
>> Thanks
>> Dikang.
>>
>> On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall<zz...@gmail.com> wrote:
>>
>>>> The reason is described here:
>>> https://issues.apache.org/jira/browse/CASSANDRA-5371?
>>> focusedCommentId=13621679&page=com.atlassian.jira.
>>> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>>>> /Marcus
>>> "...a lot of the work you've done you will redo when you compact your now
>>> bigger L0 sstable against L1."
>>>
>>> ^ Sylvain's hypothesis (next comment down) is actually something we see
>>> occasionally in practice: having to re-write the contents of L1 too often
>>> when large L0 SSTables are pulled in. Here is an example we took on a
>>> system with pending compaction spikes that was seeing this specific issue
>>> with four LCS-based tables:
>>>
>>> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>>>
>>> The significant part of this particular workload is a burst of heavy writes
>>> from long-duration scheduled jobs.
>>>
>>
>> --
>> Dikang
>
Re: STCS in L0 behaviour
Posted by Eric Evans <jo...@gmail.com>.
On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu <di...@gmail.com> wrote:
> Hi Marcus,
>
> Do you have some stack trace to show that which function in the `
> getNextBackgroundTask` is most expensive?
>
> Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
> cluster, I try my best to reduce the impact of repair, and keep number of
> sstables in L0 < 100.
>
> Thanks
> Dikang.
>
> On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall <zz...@gmail.com> wrote:
>
>> > The reason is described here:
>> https://issues.apache.org/jira/browse/CASSANDRA-5371?
>> focusedCommentId=13621679&page=com.atlassian.jira.
>> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>> >
>> > /Marcus
>>
>> "...a lot of the work you've done you will redo when you compact your now
>> bigger L0 sstable against L1."
>>
>> ^ Sylvain's hypothesis (next comment down) is actually something we see
>> occasionally in practice: having to re-write the contents of L1 too often
>> when large L0 SSTables are pulled in. Here is an example we took on a
>> system with pending compaction spikes that was seeing this specific issue
>> with four LCS-based tables:
>>
>> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>>
>> The significant part of this particular workload is a burst of heavy writes
>> from long-duration scheduled jobs.
>>
>
>
>
> --
> Dikang
--
Eric Evans
john.eric.evans@gmail.com
Re: STCS in L0 behaviour
Posted by Dikang Gu <di...@gmail.com>.
Hi Marcus,
Do you have some stack trace to show that which function in the `
getNextBackgroundTask` is most expensive?
Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
cluster, I try my best to reduce the impact of repair, and keep number of
sstables in L0 < 100.
Thanks
Dikang.
On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall <zz...@gmail.com> wrote:
> > The reason is described here:
> https://issues.apache.org/jira/browse/CASSANDRA-5371?
> focusedCommentId=13621679&page=com.atlassian.jira.
> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
> >
> > /Marcus
>
> "...a lot of the work you've done you will redo when you compact your now
> bigger L0 sstable against L1."
>
> ^ Sylvain's hypothesis (next comment down) is actually something we see
> occasionally in practice: having to re-write the contents of L1 too often
> when large L0 SSTables are pulled in. Here is an example we took on a
> system with pending compaction spikes that was seeing this specific issue
> with four LCS-based tables:
>
> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>
> The significant part of this particular workload is a burst of heavy writes
> from long-duration scheduled jobs.
>
--
Dikang
Re: STCS in L0 behaviour
Posted by Nate McCall <zz...@gmail.com>.
> The reason is described here:
https://issues.apache.org/jira/browse/CASSANDRA-5371?focusedCommentId=13621679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>
> /Marcus
"...a lot of the work you've done you will redo when you compact your now
bigger L0 sstable against L1."
^ Sylvain's hypothesis (next comment down) is actually something we see
occasionally in practice: having to re-write the contents of L1 too often
when large L0 SSTables are pulled in. Here is an example we took on a
system with pending compaction spikes that was seeing this specific issue
with four LCS-based tables:
https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
The significant part of this particular workload is a burst of heavy writes
from long-duration scheduled jobs.
Re: STCS in L0 behaviour
Posted by Marcus Eriksson <kr...@gmail.com>.
The reason is described here:
https://issues.apache.org/jira/browse/CASSANDRA-5371?focusedCommentId=13621679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
/Marcus
On Wed, Nov 23, 2016 at 6:31 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:
> What you’re describing seems very close to what’s discussed in
> https://issues.apache.org/jira/browse/CASSANDRA-10979 - worth reading
> that ticket a bit.
>
>
>
> There does seem to be a check for STCS in L0 before it tries higher
> levels:
>
> https://github.com/apache/cassandra/blob/cassandra-2.2/
> src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L324-L326
>
>
>
> Why it’s doing that within the for loop (https://github.com/apache/
> cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/
> db/compaction/LeveledManifest.java#L310 ) is unexpected to me, though –
> Carl / Marcus, any insight into why it’s within the loop instead of before
> it?
>
>
>
>
>
> *From: *Marcus Olsson <ma...@ericsson.com>
> *Organization: *Ericsson AB
> *Reply-To: *"dev@cassandra.apache.org" <de...@cassandra.apache.org>
> *Date: *Wednesday, November 23, 2016 at 7:52 AM
> *To: *"dev@cassandra.apache.org" <de...@cassandra.apache.org>
> *Subject: *STCS in L0 behaviour
>
>
>
> Hi everyone,
>
>
> TL;DR
> Should LCS be changed to always prefer an STCS compaction in L0 if it's
> falling behind? Assuming that STCS in L0 is enabled.
> Currently LCS seems to check if there is a possible L0->L1 compaction
> before checking if it's falling behind, which in our case used between
> 15-30% of the compaction thread CPU.
> TL;DR
>
> So first some background:
> We have a Apache Cassandra 2.2 cluster running with a high load. In that
> cluster there is a table with a moderate amount of writes per second that
> is using LeveledCompactionStrategy. The test was to run repair on that
> table while we monitored the cluster through JMC and with Flight Recordings
> enabled. This resulted in a large amount of sstables for that table, which
> I assume others have experienced as well. In this case I think it was
> between 15-20k.
>
> From the Flight Recording one thing we saw was that 15-30% of the CPU time
> in each of the compaction threads was spent on "getNextBackgroundTask()"
> which retrieves the next compaction job. With some further investigation
> this seems to mostly be when it's checking for overlap in L0 sstables
> before performing an L0->L1 compaction. There is a JIRA which seems to be
> related to this https://issues.apache.org/jira/browse/CASSANDRA-11571
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D11571&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=M8Whp4kcFqczCZGE9J1ZG0d1LKXlruo4bEJLrXUM9n0&s=KaBB4zTcZ8lcdDd0rSCaJdO55spqy1lq5xzEtCyK5As&e=>
> which we backported to 2.2 and tested. In our testing it seemed to improve
> the situation but it was still using noticeable CPU.
>
> My interpretation of the current logic of LCS is (if STCS in L0 is
> enabled):
> 1. Check each level (L1+)
> - If a L1+ compaction is needed check if L0 is behind and do STCS if
> that's the case, otherwise do the L1+ compaction.
> 2. Check L0 -> L1 compactions and if none is needed/possible check for
> STCS in L0.
>
> My proposal is to change this behavior to always check if L0 is far behind
> first and do a STCS compaction in that case. This would avoid the overlap
> check for L0 -> L1 compactions when L0 is behind and I think it makes sense
> since we already prefer STCS to L1+ compactions. This would not solve the
> repair situation, but it would lower some of the impact that repair has on
> LCS.
>
> For what version this could get in I think trunk would be enough since
> compaction is pluggable.
>
> --
>
>
>
> [image: ricsson]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ericsson.com_&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=M8Whp4kcFqczCZGE9J1ZG0d1LKXlruo4bEJLrXUM9n0&s=6tv3nY7zQR21E67RGe10u35WbYuJQ4gnMD6cao8il-E&e=>
>
> *MARCUS OLSSON *
> Software Developer
>
> *Ericsson*
> Sweden
> marcus.olsson@ericsson.com
> www.ericsson.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ericsson.com&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=M8Whp4kcFqczCZGE9J1ZG0d1LKXlruo4bEJLrXUM9n0&s=ozu2nd2H_Rto9M1Sme6640Z2ntrDYfkPOygOsu9uZUU&e=>
>
Re: STCS in L0 behaviour
Posted by Jeff Jirsa <je...@crowdstrike.com>.
What you’re describing seems very close to what’s discussed in https://issues.apache.org/jira/browse/CASSANDRA-10979 - worth reading that ticket a bit.
There does seem to be a check for STCS in L0 before it tries higher levels:
https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L324-L326
Why it’s doing that within the for loop (https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L310 ) is unexpected to me, though – Carl / Marcus, any insight into why it’s within the loop instead of before it?
From: Marcus Olsson <ma...@ericsson.com>
Organization: Ericsson AB
Reply-To: "dev@cassandra.apache.org" <de...@cassandra.apache.org>
Date: Wednesday, November 23, 2016 at 7:52 AM
To: "dev@cassandra.apache.org" <de...@cassandra.apache.org>
Subject: STCS in L0 behaviour
Hi everyone,
TL;DR
Should LCS be changed to always prefer an STCS compaction in L0 if it's falling behind? Assuming that STCS in L0 is enabled.
Currently LCS seems to check if there is a possible L0->L1 compaction before checking if it's falling behind, which in our case used between 15-30% of the compaction thread CPU.
TL;DR
So first some background:
We have a Apache Cassandra 2.2 cluster running with a high load. In that cluster there is a table with a moderate amount of writes per second that is using LeveledCompactionStrategy. The test was to run repair on that table while we monitored the cluster through JMC and with Flight Recordings enabled. This resulted in a large amount of sstables for that table, which I assume others have experienced as well. In this case I think it was between 15-20k.
From the Flight Recording one thing we saw was that 15-30% of the CPU time in each of the compaction threads was spent on "getNextBackgroundTask()" which retrieves the next compaction job. With some further investigation this seems to mostly be when it's checking for overlap in L0 sstables before performing an L0->L1 compaction. There is a JIRA which seems to be related to this https://issues.apache.org/jira/browse/CASSANDRA-11571 which we backported to 2.2 and tested. In our testing it seemed to improve the situation but it was still using noticeable CPU.
My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
1. Check each level (L1+)
- If a L1+ compaction is needed check if L0 is behind and do STCS if that's the case, otherwise do the L1+ compaction.
2. Check L0 -> L1 compactions and if none is needed/possible check for STCS in L0.
My proposal is to change this behavior to always check if L0 is far behind first and do a STCS compaction in that case. This would avoid the overlap check for L0 -> L1 compactions when L0 is behind and I think it makes sense since we already prefer STCS to L1+ compactions. This would not solve the repair situation, but it would lower some of the impact that repair has on LCS.
For what version this could get in I think trunk would be enough since compaction is pluggable.
--
MARCUS OLSSON
Software Developer
Ericsson
Sweden
marcus.olsson@ericsson.com
www.ericsson.com
Re: STCS in L0 behaviour
Posted by Jeff Jirsa <jj...@gmail.com>.
Without yet reading the code, what you describe sounds like a reasonable optimization / fix, suitable for 3.0+ (probably not 2.2, definitely not 2.1)
--
Jeff Jirsa
> On Nov 23, 2016, at 7:52 AM, Marcus Olsson <ma...@ericsson.com> wrote:
>
> Hi everyone,
>
> TL;DR
> Should LCS be changed to always prefer an STCS compaction in L0 if it's falling behind? Assuming that STCS in L0 is enabled.
> Currently LCS seems to check if there is a possible L0->L1 compaction before checking if it's falling behind, which in our case used between 15-30% of the compaction thread CPU.
> TL;DR
>
> So first some background:
> We have a Apache Cassandra 2.2 cluster running with a high load. In that cluster there is a table with a moderate amount of writes per second that is using LeveledCompactionStrategy. The test was to run repair on that table while we monitored the cluster through JMC and with Flight Recordings enabled. This resulted in a large amount of sstables for that table, which I assume others have experienced as well. In this case I think it was between 15-20k.
>
> From the Flight Recording one thing we saw was that 15-30% of the CPU time in each of the compaction threads was spent on "getNextBackgroundTask()" which retrieves the next compaction job. With some further investigation this seems to mostly be when it's checking for overlap in L0 sstables before performing an L0->L1 compaction. There is a JIRA which seems to be related to this https://issues.apache.org/jira/browse/CASSANDRA-11571 which we backported to 2.2 and tested. In our testing it seemed to improve the situation but it was still using noticeable CPU.
>
> My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
> 1. Check each level (L1+)
> - If a L1+ compaction is needed check if L0 is behind and do STCS if that's the case, otherwise do the L1+ compaction.
> 2. Check L0 -> L1 compactions and if none is needed/possible check for STCS in L0.
>
> My proposal is to change this behavior to always check if L0 is far behind first and do a STCS compaction in that case. This would avoid the overlap check for L0 -> L1 compactions when L0 is behind and I think it makes sense since we already prefer STCS to L1+ compactions. This would not solve the repair situation, but it would lower some of the impact that repair has on LCS.
>
> For what version this could get in I think trunk would be enough since compaction is pluggable.
>
>
> --
>
> <mime-attachment.png>
>
> MARCUS OLSSON
> Software Developer
>
> Ericsson
> Sweden
> marcus.olsson@ericsson.com
> www.ericsson.com