You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2015/02/06 19:07:36 UTC

[jira] [Commented] (CASSANDRA-8692) Coalesce intra-cluster network messages

    [ https://issues.apache.org/jira/browse/CASSANDRA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309535#comment-14309535 ] 

Benedict commented on CASSANDRA-8692:
-------------------------------------

On the whole I like the patch, but I still have a lot of suggestions, and think it should be changed quite a bit before commit.

first, minor suggestions/nits:
# The bug you spotted some time ago with .size() being passed in to drainTo() is still present.
# For clarity, make DISABLE_COALESCING a strategy, instead of a parameter? Fewer conditional statements in the code, and it'slikely the optimizer will eliminate these calls anyway, since the strategy is final and the body will be a no-op.
# VM options might be good to prefix with "cassandra." or similar - we should decide a uniform approach to this
# parkLoop probably shouldn't loop if it's within a few % of the target sleep time if on a second iteration (might explain your double-sleeping?)
# The haveAllSamples() check seems unnecessary - unless our max coalesce window is >= 134 millis, we should never use the dummy samples?

It also seems to me that it is never useful to not drain the queue; it's not like it costs us very much by proportion to anything else, and the current behaviour is that it is always drained anyway (since we always "block", due to the bug mentioned in nit #1). We can simplify the code in this method quite a bit with this assumption.

More substantively:

It seems that we're missing some useful information to help us make better decisions by only providing the first message we drained to our strategy prior to blocking. IMO, we should:

# Immediately drain the queue, and notify of all sample data points
# Then ask the strategy if we want to sleep; and we should provide it the current outstanding message count to help make this decision

The current moving average strategy admittedly would not work well with this setup, because it only bases its calculation on the most recent 16 samples, and so providing it with extra samples upfront most likely won't help, and might even hurt. This seems counterintuitive behaviour-wise though, as sleeping for some period based on the guessed arrival time of the next message when that message is potentially already in our buffer strikes me as odd; as it is that providing more data points might make the decision worse. I suspect that the first and last messages in a batch dominate the behaviour, and it is probably similar to only tracking these and sleeping for some ratio of the delta.

I would like to propose a _slightly_ more involved strategy, that calculates the moving average over time horizons instead of count horizons. For simplicity, let's say we start with a linear discretization of time (i.e. split the horizon into even sized buckets). Let's also assume that the 1s average is a good barometer. So your 16 buckets would each represent the number of messages that arrived in the prior 16 intervals of 62.5ms (with the boundary bucket not utilised until we cross it). The interesting bit is then how we use this information. Let's also assume a uniform message arrival distribution (again, this can be improved later, but won't likely yield much parameter tweaking can't).

This gives us an average delta between messages of Dns. Let's also say we have N messages in our buffer already.

I can think of two complementary logics to apply off the top of my head. There are plenty of other possible variants of this style of approach, especially if we integrate other data (such as network latencies), but these ones spring to mind as simple and likely yielding much of the potential.

Logic 1:
We have N messages waiting to send; there's no point delaying these unless we intend to appreciably increase N. So let's say we want to double the number of messages in N to make it worthwhile sleeping, otherwise we may as well send what we have without delay. So we calculate XD, and sleep for it ONLY if it is less than the max coalesce interval; if it's greater, we just send immediately. This multiplier can be configurable, but 2x seems likely to work pretty well.

Logic 2:
Our average arrival rate says that we should expect these N messages to have arrived over a timespan of NDns; if they actually arrived in less time (by checking the delta between the first/last), we assume this is a burst of activity and just flush immediately since the non-uniformity of arrival is in our favour. If the opposite is true we might be able to estimate the probability that such a cluster of messages will arrive in the next Xns, but this requires some further thought.

These can both be applied, with Logic 2 overriding Logic 1 if it says to short-circuit the waiting, say. We can then follow up with some research into some more advanced analysis of the data, and integration of other metrics to the decision.

Do these descriptions make sense, and sound reasonable to you?

> Coalesce intra-cluster network messages
> ---------------------------------------
>
>                 Key: CASSANDRA-8692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8692
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.4
>
>         Attachments: batching-benchmark.png
>
>
> While researching CASSANDRA-8457 we found that it is effective and can be done without introducing additional latency at low concurrency/throughput.
> The patch from that was used and found to be useful in a real life scenario so I propose we implement this in 2.1 in addition to 3.0.
> The change set is a single file and is small enough to be reviewable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)