You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict Elliott Smith (Jira)" <ji...@apache.org> on 2020/04/07 12:19:00 UTC
[jira] [Comment Edited] (CASSANDRA-15700) Performance regression on internode messaging

    [ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077168#comment-17077168 ] 

Benedict Elliott Smith edited comment on CASSANDRA-15700 at 4/7/20, 12:18 PM:
------------------------------------------------------------------------------

Thanks for the bug report, this is very valuable.

bq. what's the reason to prune so often, rather than just when polling the message during delivery?

This is the only way (presently) to know what deadline the eventLoop needs to wake up on, to ensure we do not let expiring messages accumulate indefinitely, without maintaining an auxiliary structure of expirations or introducing a more sophisticated queue for maintaining QoS.  It is not an astonishing surprise to see this report, though we had been reassured by this path's absence so far in other testing, as we consciously implemented a suboptimal approach to keep the patch tractable.  We expected that this work would be self limiting - a cluster _shouldn't_ have a backlog that dominates unless there are network problems, or if the receiving node is getting backed up, in which case you would expect expiry itself to dominate since the cost of pruning _should_ be quite low.  Clearly that isn't the case here; fortunately there are numerous solutions to this problem, none of which are super onerous.


was (Author: benedict):
Thanks for the bug report, this is very valuable.

bq. what's the reason to prune so often, rather than just when polling the message during delivery?

This is the only way (presently) to know what deadline the eventLoop needs to wake up on, to ensure we do not let expiring messages accumulate indefinitely, without maintaining an auxiliary structure of expirations or introducing a more sophisticated queue for maintaining QoS.  It is not an astonishing surprise to see this report, though we had been reassured by this path's absence so far in other testing, as we consciously implemented a suboptimal approach to keep the patch tractable.  We expected that this work would be self limiting - a cluster _shouldn't_ have a backlog that dominates unless there are network problems, or if the receiving node is getting backed up, in which case you would expect expiry itself to dominate since the cost of performing expiration _should_ be quite low.  Clearly that isn't the case here; fortunately there are numerous solutions to this problem, none of which are super onerous.

> Performance regression on internode messaging
> ---------------------------------------------
>
>                 Key: CASSANDRA-15700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Normal
>             Fix For: 4.0
>
>         Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] creates a huge amount of strings in the {{id()}} call. The next question is what causes all those message expirations; we thoroughly reviewed the internode messaging code and the only issue we could find so far is related to the "batch pruning" calls [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] and [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: it _seems_ too much time is spent on those, causing the event loop to fall behind in processing the rest of the messages, which will end up being expired. This is supported by the analysis of the collapsed stacks (after fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which spends half of its time pruning. But only a tiny portion of such pruning time is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to understand: what's the reason to prune so often, rather than just when polling the message during delivery? If there's a reason I'm missing, let's talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org