You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ariel Weisberg (JIRA)" <ji...@apache.org> on 2015/12/14 19:20:46 UTC

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

    [ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056411#comment-15056411 ] 

Ariel Weisberg edited comment on CASSANDRA-9318 at 12/14/15 6:20 PM:
---------------------------------------------------------------------

Quick note. 65k mutations pending in the mutation stage. 7 memtables pending flush. [I hooked memtables pending flush into the backpressure mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39] That absolutely wrecked performance as throughput dropped to 0 zero periodically, but throughput is infinitely higher than when the database has OOMed.

Kicked off a few performance runs to demonstrate what happens when you do have backpressure and you try various large limits on in flight memtables/requests.

[9318 w/backpressure 64m 8g heap memtables count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f]
[9318 w/backpressure 1g 8g heap memtables count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f]
[9318 w/backpressure 2g 8g heap memtables count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f]

I am setting the point where backpressure turns off to almost the same limit as to when it turns on. This is smooths out performance just enough for stress to not constantly emit huge numbers of errors as writes time out because the database stops serving requests for a long time waiting for a memtable to flush.

With pressure from memtables somewhat accounted for the remaining source of pressure that can bring down a node is remotely delivered mutations. I can throw those into the calculation and add a listener that blocks reads from other cluster nodes. It's a nasty thing to do, but maybe not that different from OOM.

I am going to hack together something to force a node to be slow so I can demonstrate overwhelming it with remotely delivered mutations first.


was (Author: aweisberg):
Quick note. 65k mutations pending in the mutation stage. 7 memtables pending flush. [I hooked memtables pending flush into the backpressure mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39] That absolutely wrecked performance as throughput dropped to 0 zero periodically, but throughput is infinitely higher than when the database hasn't OOMed.

Kicked off a few performance runs to demonstrate what happens when you do have backpressure and you try various large limits on in flight memtables/requests.

[9318 w/backpressure 64m 8g heap memtables count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f]
[9318 w/backpressure 1g 8g heap memtables count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f]
[9318 w/backpressure 2g 8g heap memtables count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f]

I am setting the point where backpressure turns off to almost the same limit as to when it turns on. This is smooths out performance just enough for stress to not constantly emit huge numbers of errors as writes time out because the database stops serving requests for a long time waiting for a memtable to flush.

With pressure from memtables somewhat accounted for the remaining source of pressure that can bring down a node is remotely delivered mutations. I can throw those into the calculation and add a listener that blocks reads from other cluster nodes. It's a nasty thing to do, but maybe not that different from OOM.

I am going to hack together something to force a node to be slow so I can demonstrate overwhelming it with remotely delivered mutations first.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)