You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Marcus Olsson <ma...@ericsson.com> on 2016/11/23 15:52:56 UTC

STCS in L0 behaviour

Hi everyone,

TL;DR
Should LCS be changed to always prefer an STCS compaction in L0 if it's 
falling behind? Assuming that STCS in L0 is enabled.
Currently LCS seems to check if there is a possible L0->L1 compaction 
before checking if it's falling behind, which in our case used between 
15-30% of the compaction thread CPU.
TL;DR

So first some background:
We have a Apache Cassandra 2.2 cluster running with a high load. In that 
cluster there is a table with a moderate amount of writes per second 
that is using LeveledCompactionStrategy. The test was to run repair on 
that table while we monitored the cluster through JMC and with Flight 
Recordings enabled. This resulted in a large amount of sstables for that 
table, which I assume others have experienced as well. In this case I 
think it was between 15-20k.

 From the Flight Recording one thing we saw was that 15-30% of the CPU 
time in each of the compaction threads was spent on 
"getNextBackgroundTask()" which retrieves the next compaction job. With 
some further investigation this seems to mostly be when it's checking 
for overlap in L0 sstables before performing an L0->L1 compaction. There 
is a JIRA which seems to be related to this 
https://issues.apache.org/jira/browse/CASSANDRA-11571 which we 
backported to 2.2 and tested. In our testing it seemed to improve the 
situation but it was still using noticeable CPU.

My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
1. Check each level (L1+)
  - If a L1+ compaction is needed check if L0 is behind and do STCS if 
that's the case, otherwise do the L1+ compaction.
2. Check L0 -> L1 compactions and if none is needed/possible check for 
STCS in L0.

My proposal is to change this behavior to always check if L0 is far 
behind first and do a STCS compaction in that case. This would avoid the 
overlap check for L0 -> L1 compactions when L0 is behind and I think it 
makes sense since we already prefer STCS to L1+ compactions. This would 
not solve the repair situation, but it would lower some of the impact 
that repair has on LCS.

For what version this could get in I think trunk would be enough since 
compaction is pluggable.


-- 
Fwd: Footer

Ericsson <http://www.ericsson.com/>

*MARCUS OLSSON *
Software Developer

*Ericsson*
Sweden
marcus.olsson@ericsson.com <ma...@ericsson.com>
www.ericsson.com <http://www.ericsson.com>

Re: STCS in L0 behaviour

Posted by Marcus Olsson <ma...@ericsson.com>.

Hi,

In reply to Dikang Gu:
For the run where we incorporated the change from CASSANDRA-11571 the 
stack trace was like this (from JMC):
*Stack Trace* 	*Sample Count* 	*Percentage(%)*
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int) 
	229 	11.983
-org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates() 
	228 	11.931
--org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(int) 
	221 	11.565
---org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(SSTableReader, 
Map) 	201 	10.518
----org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(Token, 
Token, Map) 	201 	10.518
-----org.apache.cassandra.dht.Bounds.intersects(Bounds) 	141 	7.378
-----java.util.HashSet.add(Object) 	56 	2.93

This is for one of the compaction executors during an interval of 1 
minute and 24 seconds, but we saw similar behavior for other compaction 
threads as well. The full flight recording was 10 minutes and was 
started at the same time as the repair. The interval was taken from the 
end of the recording where the number of sstables had increased. During 
this interval this compaction thread used ~10% of the total CPU.

I agree that optimally there shouldn't be many sstables in L0 and except 
for when repair is running we don't have that many.

---

In reply to Jeff Jirsa/Nate McCall:
I might have been unclear about the compaction order in my first email, 
I meant to say that there is a check for STCS right before L1+, but only 
if a L1+ compaction is possible. We used version 2.2.7 for the test run 
so https://issues.apache.org/jira/browse/CASSANDRA-10979 should be 
included and have reduced some of the backlog of L0.

Correct me if I'm wrong but my interpretation of the scenario that 
Sylvain describes in 
https://issues.apache.org/jira/browse/CASSANDRA-5371 is when you either 
almost constantly have 32+ SSTables in L0 or are close to it. My guess 
is that this could be applied to having constant load during a certain 
timespan as well. So when you get more than 32 sstables you start to do 
STCS which in turn creates larger sstables which might span the whole of 
L1. Then when these sstables should be promoted to L1 it re-writes the 
whole L1 which creates a larger backlog in L0. So then the number of 
sstables keeps rising and trigger a STCS again, and complete the circle. 
Based on this interpretation it seems to me that if the write pattern 
into L0 is "random" this might happen regardless if a STCS compaction 
has occurred or not.

If my interpretation is correct it might be better to choose a higher 
number of sstables before STCS starts in L0 and make it configurable. 
With a reduced complexity it could be something like this:
1. Perform STCS in L0 if we have above X(1000?) sstables in L0.
2. Check L1+
3. Check for L0->L1

It should be possible to keep the current logic as well and only add a 
configurable check before (step 1) to avoid the overlapping check with 
larger backlogs. Another alternative might be 
https://issues.apache.org/jira/browse/CASSANDRA-7409 and allow 
overlapping sstables in more levels than L0. If it can quickly push 
sorted data to L1 it might remove the need for STCS in LCS. The 
previously mentioned potential cost of the overlapping check would still 
be there if we have a large backlog, but the approach might reduce the 
risk of getting into the situation. I'll try to get some time to run a 
test with CASSANDRA-7409 in our test cluster.

BR
Marcus O

On 11/28/2016 06:48 PM, Eric Evans wrote:
> On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu<di...@gmail.com>  wrote:
>> Hi Marcus,
>>
>> Do you have some stack trace to show that which function in the `
>> getNextBackgroundTask` is most expensive?
>>
>> Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
>> cluster, I try my best to reduce the impact of repair, and keep number of
>> sstables in L0 < 100.
>>
>> Thanks
>> Dikang.
>>
>> On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall<zz...@gmail.com>  wrote:
>>
>>>> The reason is described here:
>>> https://issues.apache.org/jira/browse/CASSANDRA-5371?
>>> focusedCommentId=13621679&page=com.atlassian.jira.
>>> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>>>> /Marcus
>>> "...a lot of the work you've done you will redo when you compact your now
>>> bigger L0 sstable against L1."
>>>
>>> ^ Sylvain's hypothesis (next comment down) is actually something we see
>>> occasionally in practice: having to re-write the contents of L1 too often
>>> when large L0 SSTables are pulled in. Here is an example we took on a
>>> system with pending compaction spikes that was seeing this specific issue
>>> with four LCS-based tables:
>>>
>>> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>>>
>>> The significant part of this particular workload is a burst of heavy writes
>>> from long-duration scheduled jobs.
>>>
>>
>> --
>> Dikang
>

Re: STCS in L0 behaviour

Posted by Eric Evans <jo...@gmail.com>.

On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu <di...@gmail.com> wrote:
> Hi Marcus,
>
> Do you have some stack trace to show that which function in the `
> getNextBackgroundTask` is most expensive?
>
> Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
> cluster, I try my best to reduce the impact of repair, and keep number of
> sstables in L0 < 100.
>
> Thanks
> Dikang.
>
> On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall <zz...@gmail.com> wrote:
>
>> > The reason is described here:
>> https://issues.apache.org/jira/browse/CASSANDRA-5371?
>> focusedCommentId=13621679&page=com.atlassian.jira.
>> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>> >
>> > /Marcus
>>
>> "...a lot of the work you've done you will redo when you compact your now
>> bigger L0 sstable against L1."
>>
>> ^ Sylvain's hypothesis (next comment down) is actually something we see
>> occasionally in practice: having to re-write the contents of L1 too often
>> when large L0 SSTables are pulled in. Here is an example we took on a
>> system with pending compaction spikes that was seeing this specific issue
>> with four LCS-based tables:
>>
>> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>>
>> The significant part of this particular workload is a burst of heavy writes
>> from long-duration scheduled jobs.
>>
>
>
>
> --
> Dikang



-- 
Eric Evans
john.eric.evans@gmail.com

Re: STCS in L0 behaviour

Posted by Dikang Gu <di...@gmail.com>.

Hi Marcus,

Do you have some stack trace to show that which function in the `
getNextBackgroundTask` is most expensive?

Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
cluster, I try my best to reduce the impact of repair, and keep number of
sstables in L0 < 100.

Thanks
Dikang.

On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall <zz...@gmail.com> wrote:

> > The reason is described here:
> https://issues.apache.org/jira/browse/CASSANDRA-5371?
> focusedCommentId=13621679&page=com.atlassian.jira.
> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
> >
> > /Marcus
>
> "...a lot of the work you've done you will redo when you compact your now
> bigger L0 sstable against L1."
>
> ^ Sylvain's hypothesis (next comment down) is actually something we see
> occasionally in practice: having to re-write the contents of L1 too often
> when large L0 SSTables are pulled in. Here is an example we took on a
> system with pending compaction spikes that was seeing this specific issue
> with four LCS-based tables:
>
> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>
> The significant part of this particular workload is a burst of heavy writes
> from long-duration scheduled jobs.
>



-- 
Dikang

Re: STCS in L0 behaviour

Posted by Nate McCall <zz...@gmail.com>.

> The reason is described here:
https://issues.apache.org/jira/browse/CASSANDRA-5371?focusedCommentId=13621679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>
> /Marcus

"...a lot of the work you've done you will redo when you compact your now
bigger L0 sstable against L1."

^ Sylvain's hypothesis (next comment down) is actually something we see
occasionally in practice: having to re-write the contents of L1 too often
when large L0 SSTables are pulled in. Here is an example we took on a
system with pending compaction spikes that was seeing this specific issue
with four LCS-based tables:

https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950

The significant part of this particular workload is a burst of heavy writes
from long-duration scheduled jobs.

Re: STCS in L0 behaviour

Posted by Marcus Eriksson <kr...@gmail.com>.

The reason is described here:
https://issues.apache.org/jira/browse/CASSANDRA-5371?focusedCommentId=13621679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13621679

/Marcus

On Wed, Nov 23, 2016 at 6:31 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> What you’re describing seems very close to what’s discussed in
> https://issues.apache.org/jira/browse/CASSANDRA-10979 - worth reading
> that ticket a bit.
>
>
>
> There does seem to be a check for STCS in L0 before it tries higher
> levels:
>
> https://github.com/apache/cassandra/blob/cassandra-2.2/
> src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L324-L326
>
>
>
> Why it’s doing that within the for loop (https://github.com/apache/
> cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/
> db/compaction/LeveledManifest.java#L310 ) is unexpected to me, though –
> Carl / Marcus, any insight into why it’s within the loop instead of before
> it?
>
>
>
>
>
> *From: *Marcus Olsson <ma...@ericsson.com>
> *Organization: *Ericsson AB
> *Reply-To: *"dev@cassandra.apache.org" <de...@cassandra.apache.org>
> *Date: *Wednesday, November 23, 2016 at 7:52 AM
> *To: *"dev@cassandra.apache.org" <de...@cassandra.apache.org>
> *Subject: *STCS in L0 behaviour
>
>
>
> Hi everyone,
>
>
> TL;DR
> Should LCS be changed to always prefer an STCS compaction in L0 if it's
> falling behind? Assuming that STCS in L0 is enabled.
> Currently LCS seems to check if there is a possible L0->L1 compaction
> before checking if it's falling behind, which in our case used between
> 15-30% of the compaction thread CPU.
> TL;DR
>
> So first some background:
> We have a Apache Cassandra 2.2 cluster running with a high load. In that
> cluster there is a table with a moderate amount of writes per second that
> is using LeveledCompactionStrategy. The test was to run repair on that
> table while we monitored the cluster through JMC and with Flight Recordings
> enabled. This resulted in a large amount of sstables for that table, which
> I assume others have experienced as well. In this case I think it was
> between 15-20k.
>
> From the Flight Recording one thing we saw was that 15-30% of the CPU time
> in each of the compaction threads was spent on "getNextBackgroundTask()"
> which retrieves the next compaction job. With some further investigation
> this seems to mostly be when it's checking for overlap in L0 sstables
> before performing an L0->L1 compaction. There is a JIRA which seems to be
> related to this https://issues.apache.org/jira/browse/CASSANDRA-11571
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D11571&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=M8Whp4kcFqczCZGE9J1ZG0d1LKXlruo4bEJLrXUM9n0&s=KaBB4zTcZ8lcdDd0rSCaJdO55spqy1lq5xzEtCyK5As&e=>
> which we backported to 2.2 and tested. In our testing it seemed to improve
> the situation but it was still using noticeable CPU.
>
> My interpretation of the current logic of LCS is (if STCS in L0 is
> enabled):
> 1. Check each level (L1+)
>  - If a L1+ compaction is needed check if L0 is behind and do STCS if
> that's the case, otherwise do the L1+ compaction.
> 2. Check L0 -> L1 compactions and if none is needed/possible check for
> STCS in L0.
>
> My proposal is to change this behavior to always check if L0 is far behind
> first and do a STCS compaction in that case. This would avoid the overlap
> check for L0 -> L1 compactions when L0 is behind and I think it makes sense
> since we already prefer STCS to L1+ compactions. This would not solve the
> repair situation, but it would lower some of the impact that repair has on
> LCS.
>
> For what version this could get in I think trunk would be enough since
> compaction is pluggable.
>
> --
>
>
>
> [image: ricsson]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ericsson.com_&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=M8Whp4kcFqczCZGE9J1ZG0d1LKXlruo4bEJLrXUM9n0&s=6tv3nY7zQR21E67RGe10u35WbYuJQ4gnMD6cao8il-E&e=>
>
> *MARCUS OLSSON *
> Software Developer
>
> *Ericsson*
> Sweden
> marcus.olsson@ericsson.com
> www.ericsson.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ericsson.com&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=M8Whp4kcFqczCZGE9J1ZG0d1LKXlruo4bEJLrXUM9n0&s=ozu2nd2H_Rto9M1Sme6640Z2ntrDYfkPOygOsu9uZUU&e=>
>

Re: STCS in L0 behaviour

Posted by Jeff Jirsa <je...@crowdstrike.com>.

What you’re describing seems very close to what’s discussed in  https://issues.apache.org/jira/browse/CASSANDRA-10979 - worth reading that ticket a bit. 

 

There does seem to be a check for STCS in L0 before it tries higher levels: 

https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L324-L326

 

Why it’s doing that within the for loop (https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L310 ) is unexpected to me, though – Carl / Marcus, any insight into why it’s within the loop instead of before it? 

 

 

From: Marcus Olsson <ma...@ericsson.com>
Organization: Ericsson AB
Reply-To: "dev@cassandra.apache.org" <de...@cassandra.apache.org>
Date: Wednesday, November 23, 2016 at 7:52 AM
To: "dev@cassandra.apache.org" <de...@cassandra.apache.org>
Subject: STCS in L0 behaviour

 

Hi everyone,


TL;DR
Should LCS be changed to always prefer an STCS compaction in L0 if it's falling behind? Assuming that STCS in L0 is enabled.
Currently LCS seems to check if there is a possible L0->L1 compaction before checking if it's falling behind, which in our case used between 15-30% of the compaction thread CPU.
TL;DR

So first some background:
We have a Apache Cassandra 2.2 cluster running with a high load. In that cluster there is a table with a moderate amount of writes per second that is using LeveledCompactionStrategy. The test was to run repair on that table while we monitored the cluster through JMC and with Flight Recordings enabled. This resulted in a large amount of sstables for that table, which I assume others have experienced as well. In this case I think it was between 15-20k.

From the Flight Recording one thing we saw was that 15-30% of the CPU time in each of the compaction threads was spent on "getNextBackgroundTask()" which retrieves the next compaction job. With some further investigation this seems to mostly be when it's checking for overlap in L0 sstables before performing an L0->L1 compaction. There is a JIRA which seems to be related to this https://issues.apache.org/jira/browse/CASSANDRA-11571 which we backported to 2.2 and tested. In our testing it seemed to improve the situation but it was still using noticeable CPU.

My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
1. Check each level (L1+)
 - If a L1+ compaction is needed check if L0 is behind and do STCS if that's the case, otherwise do the L1+ compaction.
2. Check L0 -> L1 compactions and if none is needed/possible check for STCS in L0.

My proposal is to change this behavior to always check if L0 is far behind first and do a STCS compaction in that case. This would avoid the overlap check for L0 -> L1 compactions when L0 is behind and I think it makes sense since we already prefer STCS to L1+ compactions. This would not solve the repair situation, but it would lower some of the impact that repair has on LCS.

For what version this could get in I think trunk would be enough since compaction is pluggable.

-- 

  

MARCUS OLSSON 
Software Developer

Ericsson
Sweden
marcus.olsson@ericsson.com
www.ericsson.com

Re: STCS in L0 behaviour

Posted by Jeff Jirsa <jj...@gmail.com>.

Without yet reading the code, what you describe sounds like a reasonable optimization / fix, suitable for 3.0+ (probably not 2.2, definitely not 2.1)

-- 
Jeff Jirsa


> On Nov 23, 2016, at 7:52 AM, Marcus Olsson <ma...@ericsson.com> wrote:
> 
> Hi everyone,
> 
> TL;DR
> Should LCS be changed to always prefer an STCS compaction in L0 if it's falling behind? Assuming that STCS in L0 is enabled.
> Currently LCS seems to check if there is a possible L0->L1 compaction before checking if it's falling behind, which in our case used between 15-30% of the compaction thread CPU.
> TL;DR
> 
> So first some background:
> We have a Apache Cassandra 2.2 cluster running with a high load. In that cluster there is a table with a moderate amount of writes per second that is using LeveledCompactionStrategy. The test was to run repair on that table while we monitored the cluster through JMC and with Flight Recordings enabled. This resulted in a large amount of sstables for that table, which I assume others have experienced as well. In this case I think it was between 15-20k.
> 
> From the Flight Recording one thing we saw was that 15-30% of the CPU time in each of the compaction threads was spent on "getNextBackgroundTask()" which retrieves the next compaction job. With some further investigation this seems to mostly be when it's checking for overlap in L0 sstables before performing an L0->L1 compaction. There is a JIRA which seems to be related to this https://issues.apache.org/jira/browse/CASSANDRA-11571 which we backported to 2.2 and tested. In our testing it seemed to improve the situation but it was still using noticeable CPU.
> 
> My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
> 1. Check each level (L1+)
>  - If a L1+ compaction is needed check if L0 is behind and do STCS if that's the case, otherwise do the L1+ compaction.
> 2. Check L0 -> L1 compactions and if none is needed/possible check for STCS in L0.
> 
> My proposal is to change this behavior to always check if L0 is far behind first and do a STCS compaction in that case. This would avoid the overlap check for L0 -> L1 compactions when L0 is behind and I think it makes sense since we already prefer STCS to L1+ compactions. This would not solve the repair situation, but it would lower some of the impact that repair has on LCS.
> 
> For what version this could get in I think trunk would be enough since compaction is pluggable.
> 
> 
> -- 
>  
> <mime-attachment.png>
> 
> MARCUS OLSSON 
> Software Developer
> 
> Ericsson
> Sweden
> marcus.olsson@ericsson.com
> www.ericsson.com