You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Marcus Olsson <ma...@ericsson.com> on 2016/12/02 18:33:33 UTC
Re: STCS in L0 behaviour

Hi,

In reply to Dikang Gu:
For the run where we incorporated the change from CASSANDRA-11571 the 
stack trace was like this (from JMC):
*Stack Trace* 	*Sample Count* 	*Percentage(%)*
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int) 
	229 	11.983
-org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates() 
	228 	11.931
--org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(int) 
	221 	11.565
---org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(SSTableReader, 
Map) 	201 	10.518
----org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(Token, 
Token, Map) 	201 	10.518
-----org.apache.cassandra.dht.Bounds.intersects(Bounds) 	141 	7.378
-----java.util.HashSet.add(Object) 	56 	2.93


This is for one of the compaction executors during an interval of 1 
minute and 24 seconds, but we saw similar behavior for other compaction 
threads as well. The full flight recording was 10 minutes and was 
started at the same time as the repair. The interval was taken from the 
end of the recording where the number of sstables had increased. During 
this interval this compaction thread used ~10% of the total CPU.

I agree that optimally there shouldn't be many sstables in L0 and except 
for when repair is running we don't have that many.

---

In reply to Jeff Jirsa/Nate McCall:
I might have been unclear about the compaction order in my first email, 
I meant to say that there is a check for STCS right before L1+, but only 
if a L1+ compaction is possible. We used version 2.2.7 for the test run 
so https://issues.apache.org/jira/browse/CASSANDRA-10979 should be 
included and have reduced some of the backlog of L0.

Correct me if I'm wrong but my interpretation of the scenario that 
Sylvain describes in 
https://issues.apache.org/jira/browse/CASSANDRA-5371 is when you either 
almost constantly have 32+ SSTables in L0 or are close to it. My guess 
is that this could be applied to having constant load during a certain 
timespan as well. So when you get more than 32 sstables you start to do 
STCS which in turn creates larger sstables which might span the whole of 
L1. Then when these sstables should be promoted to L1 it re-writes the 
whole L1 which creates a larger backlog in L0. So then the number of 
sstables keeps rising and trigger a STCS again, and complete the circle. 
Based on this interpretation it seems to me that if the write pattern 
into L0 is "random" this might happen regardless if a STCS compaction 
has occurred or not.

If my interpretation is correct it might be better to choose a higher 
number of sstables before STCS starts in L0 and make it configurable. 
With a reduced complexity it could be something like this:
1. Perform STCS in L0 if we have above X(1000?) sstables in L0.
2. Check L1+
3. Check for L0->L1

It should be possible to keep the current logic as well and only add a 
configurable check before (step 1) to avoid the overlapping check with 
larger backlogs. Another alternative might be 
https://issues.apache.org/jira/browse/CASSANDRA-7409 and allow 
overlapping sstables in more levels than L0. If it can quickly push 
sorted data to L1 it might remove the need for STCS in LCS. The 
previously mentioned potential cost of the overlapping check would still 
be there if we have a large backlog, but the approach might reduce the 
risk of getting into the situation. I'll try to get some time to run a 
test with CASSANDRA-7409 in our test cluster.

BR
Marcus O

On 11/28/2016 06:48 PM, Eric Evans wrote:
> On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu<di...@gmail.com>  wrote:
>> Hi Marcus,
>>
>> Do you have some stack trace to show that which function in the `
>> getNextBackgroundTask` is most expensive?
>>
>> Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
>> cluster, I try my best to reduce the impact of repair, and keep number of
>> sstables in L0 < 100.
>>
>> Thanks
>> Dikang.
>>
>> On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall<zz...@gmail.com>  wrote:
>>
>>>> The reason is described here:
>>> https://issues.apache.org/jira/browse/CASSANDRA-5371?
>>> focusedCommentId=13621679&page=com.atlassian.jira.
>>> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
>>>> /Marcus
>>> "...a lot of the work you've done you will redo when you compact your now
>>> bigger L0 sstable against L1."
>>>
>>> ^ Sylvain's hypothesis (next comment down) is actually something we see
>>> occasionally in practice: having to re-write the contents of L1 too often
>>> when large L0 SSTables are pulled in. Here is an example we took on a
>>> system with pending compaction spikes that was seeing this specific issue
>>> with four LCS-based tables:
>>>
>>> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>>>
>>> The significant part of this particular workload is a burst of heavy writes
>>> from long-duration scheduled jobs.
>>>
>>
>> --
>> Dikang
>