You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2016/12/20 20:50:58 UTC

[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

    [ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765216#comment-15765216 ] 

Paulo Motta commented on CASSANDRA-12965:
-----------------------------------------

I'm afraid this will no longer be fixed on 2.1 given it's on critical-fixes only mode, but I'd like to understand more about the problem to see if is still present on later versions, since this is pretty similar to CASSANDRA-13055. Few questions:
- Was this a one-off problem or did it happen more than once?
- What repair command/options do you use?
- How many tables, RF and vnodes?
- Is repair triggered simultaneously in more than one node?

> StreamReceiveTask causing high CPU utilization during repair
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-12965
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu (100% of all cores on a 48-core machine). When I took a stack trace I found exactly 48 running StreamReceiveTask threads. Each was in the same block of code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 tid=0x00007f01520a8800 nid=0x6e77 runnable [0x00007f020dfae000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
>         at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
>         at java.util.Arrays.sort(Arrays.java:1312)
>         at java.util.Arrays.sort(Arrays.java:1506)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:141)
>         at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:257)
>         at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>         at org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>         at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:590)
>         at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:584)
>         at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
>         at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
>         at org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
>         at org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
>         at org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
>         at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair was also generating thousands (20,000+) of tiny SSTables in a table that previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this CPU work is necessary or a bug, but I did notice that these tasks are run on a thread pool constructed in StreamReceiveTask.java, so perhaps this pool should have a thread count max less than the number of processors on the machine, at least for machines with a lot of processors. Any reason not to do that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, so that is not a factor. All our tables use LCS. Unfortunately I don't have the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)