You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania (JIRA)" <ji...@apache.org> on 2016/07/04 03:21:11 UTC
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

    [ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360784#comment-15360784 ] 

Stefania commented on CASSANDRA-9318:
-------------------------------------

bq. For that specific test I've got no client timeouts at all, as I wrote at ONE.

Sorry I should have been clearer, I meant what were the {{write_request_timeout_in_ms}} and {{back_pressure_timeout_override}} yaml settings?

bq. Agreed with all your points. I'll see what I can do, but any help/pointers will be very appreciated.

We can do the following:

bq. verify we can reduce the number of dropped mutations in a larger (5-10 nodes) cluster with multiple clients writing simultaneously

I will ask for help to the TEs, more details to follow.

bq. some cstar perf tests to ensure ops per second are not degraded, both read and writes
    
We can launch a comparison test [here|http://cstar.datastax.com], 30M rows should be enough. I can launch it for you if you don't have an account.

bq. the dtests should be run with and without backpressure enabled
    
This can be done by temporarily changing cassandra.yaml on your branch and then launching the dtests.

bq. we should do a bulk load test, for example for cqlsh COPY FROM

I can take care of this. I don't expect problems because COPY FROM should contact the replicas directly, it's just a box I want to tick. Importing 5 to 10M rows with 3 nodes should be sufficient.

bq. Please send me a PR and I'll incorporate those in my branch

I couldn't create a PR, for some reason sbtourist/cassandra wasn't in the base fork list. I've attached a patch to this ticket, [^9318-3.0-nits-trailing-spaces.patch].

bq. I find the current layout effective and simple enough, but I'll not object if you want to push those under a common "container" option.

The encryption options are what I was aiming at, but it's true that for everything else we have a flat layout, so let's leave it as it is.

bq. I don't like much that name either, as it doesn't convey very well the (double) meaning; making the back-pressure window the same as the write timeout is not strictly necessary, but it makes the algorithm behave better in terms of reducing dropped mutations as it gives replica more time to process its backlog after the rate is reduced. Let me think about that a bit more, but I'd like to avoid requiring the user to increase the write timeout manually, as again, it reduces the effectiveness of the algorithm.

I'll let you think about it. Maybe a boolean property that is true by default and that clearly indicates that the timeout is overridden, although this complicates things somewhat.

bq. Sure I can switch to that on trunk, if you think it's worth performance-wise (I can write a JMH test if there isn't one already).

The precision is only 10 milliseconds, if this is acceptable it would be interesting to see what the difference in performance is.

bq. It is not used in any unit tests code, but it is used in my manual byteman tests, and unfortunately I need it on the C* classpath; is that a problem to keep it?

Sorry I missed the byteman imports and helper. Let's just move it to the test source folder and add a comment. 

--

The rest of the CR points are fine. 

One thing we did not confirm is whether you are happy committing this only to trunk or whether you need this in 3.0. Strictly speaking 3.0 accepts only bug fixes, not new features. However, this is an optional feature that solves a problem (dropped mutations) and that is disabled by default, so we have a case for an exception.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)