You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict Elliott Smith (Jira)" <ji...@apache.org> on 2020/01/15 13:51:00 UTC

[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

    [ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015996#comment-17015996 ] 

Benedict Elliott Smith commented on CASSANDRA-15430:
----------------------------------------------------

[~tsteinmaurer] it would help if you could post the schema and example queries you are submitting to the cluster.  It might be that there is a mitigation in a later version of Cassandra for the specific workload, or in the forthcoming 4.0, that might be possible for you to backport.  I would also be happy to take a look at the JFR logs if we can find somewhere shared to put them.

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15430
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Thomas Steinmaurer
>            Priority: Normal
>         Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage                   256     81824     3360532756         0                 0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage                 0         0              0         0                 0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage                         0         0       62862266         0                 0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage              0         0     2176659856         0                 0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - ReadRepairStage                   0         0              0         0                 0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - CounterMutationStage              0         0              0         0                 0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different node, high-level, it looks like the code path underneath {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org