You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sumanth Pasupuleti (JIRA)" <ji...@apache.org> on 2019/07/10 06:31:00 UTC
[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

    [ https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881762#comment-16881762 ] 

Sumanth Pasupuleti edited comment on CASSANDRA-15013 at 7/10/19 6:30 AM:
-------------------------------------------------------------------------

Performance tests were run against two C* clusters, one running latest trunk, and one running (latest trunk + 15013 patch). Two NDBench clusters, with similar configuration to emit similar traffic, were setup to throw load at each of the C* clusters. Each of the C* clusters is a single region, six i3.8xl nodes, and each of the NDBench clusters is 450 nodes.

Following is the analysis of the perf run:
# No blocked threadpool in patch, vs blocked threadpool in trunk
 !perftest_blockedthreads.png! 
# Similar writeops
!perftest_writeops.png|thumbnail!
# Patch does more readops vs trunk
!perftest_readops.png|thumbnail!
# Comparable read and write latencies (99th and avg)
!perftest_readlatency_99th.png|thumbnail!
!perftest_readlatency_avg.png|thumbnail!
!perftest_writelatency_99th.png|thumbnail!
!perftest_writelatency_avg.png|thumbnail!
# Comparable CPU usage
!perftest_cpu_usage.png|thumbnail!
# Comparable heap usage
!perftest_heap_usage.png|thumbnail!
# Connections count (~1000 connections per C* node)
!perftest_connections_count.png|thumbnail!



was (Author: sumanth.pasupuleti):
Performance tests were run against two C* clusters, one running latest trunk, and one running (latest trunk + 15013 patch). Two NDBench clusters, with similar configuration to emit similar traffic, were setup to throw load at each of the C* clusters. Each of the C* clusters is a single region, six i3.8xl nodes, and each of the NDBench clusters is 450 nodes.

Following is the analysis of the perf run:
# No blocked threadpool in patch, vs blocked threadpool in trunk
!perftest_blockedthreads.png|thumbnail!
# Similar writeops
!perftest_writeops.png|thumbnail!
# Patch does more readops vs trunk
!perftest_readops.png|thumbnail!
# Comparable read and write latencies (99th and avg)
!perftest_readlatency_99th.png|thumbnail!
!perftest_readlatency_avg.png|thumbnail!
!perftest_writelatency_99th.png|thumbnail!
!perftest_writelatency_avg.png|thumbnail!
# Comparable CPU usage
!perftest_cpu_usage.png|thumbnail!
# Comparable heap usage
!perftest_heap_usage.png|thumbnail!
# Connections count (~1000 connections per C* node)
!perftest_connections_count.png|thumbnail!


> Message Flusher queue can grow unbounded, potentially running JVM out of memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap dump showing each ImmediateFlusher taking upto 600MB.png, perftest_blockedthreads.png, perftest_connections_count.png, perftest_cpu_usage.png, perftest_heap_usage.png, perftest_readlatency_99th.png, perftest_readlatency_avg.png, perftest_readops.png, perftest_writelatency_99th.png, perftest_writelatency_avg.png, perftest_writeops.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue bounded, since, in the current state, items get added to the queue without any checks on queue size, nor with any checks on netty outbound buffer to check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org