You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Peter Schuller (Created) (JIRA)" <ji...@apache.org> on 2012/02/05 11:23:54 UTC

[jira] [Created] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

lower impact on old-gen promotion of slow nodes or connections
--------------------------------------------------------------

Key: CASSANDRA-3853
URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller

Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.

First off, I use the following premises:

* The node is not running Azul ;)
* The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
* When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.

Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:

* Some node has a problem and is slow, even if just for a little while.
* Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
* You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
* Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
* The total amount of memory used for 60 * 50 000 requests is freaking high.
* The young gen GC pauses happen *much* more frequently than every 60 seconds.
* The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
* This increases the total cost of ownership, and thus demand for CPU resources.
* You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
* This leads to promotion failure/conc mode failure, and you fall into full GC.
* But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
* Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
* Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.

The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.

This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.

We need to:

* Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
* More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.

A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).

A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200917#comment-13200917 ] 

Peter Schuller commented on CASSANDRA-3853:
-------------------------------------------

A possible improvement is to use much larger socket buffers, but that doesn't give a lot of control (you'd have to set buffers as a function of the total number of nodes in the cluster and total amount of memory you're willing to let the kernel use for it).

A more difficult but similar improvement might be to keep user-level on-heap but slab allocated I/O buffers for outgoing requests where they can sit w/o causing promotion costs.

That still doesn't address pending queues between stages, nor cases where co-ordinator requests have to wait for these requests to complete or time out.
                
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204239#comment-13204239 ] 

Peter Schuller commented on CASSANDRA-3853:
-------------------------------------------

For the record my comment about counters is under the assumption that the client doesn't re-try non-idempotent requests (if it does, it might even be better to drop them than not, if they're pase the *client* timeout (that Cassandra doesn't know anything about)).
                
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204238#comment-13204238 ] 

Peter Schuller commented on CASSANDRA-3853:
-------------------------------------------

We're running with CASSANDRA-3005, but request rate * rpc_timeout is very significant for high-throughput high-timeout clusters.

And yes, whenever we can easily determine that a request will time out, I agree that's a very good point at which to drop them. I also agree about pre-maturely dropping requests being a sensitive thing to do. Especially for counters (being non-idempotent on writes).

In fact one of the cases where we are about this is exactly that - a high-throughput write dominated clusters of counters, with high timeouts.

But, while the request from co-ordinator to counter leader should preferably not time out, dropping requests to replicas is much more reasonable (in preference to falling over due to g.c.). It's kind of a tricky situation.

The problem is most sever among a replica set. A single node being saturated that results in pending requests spread out over all co-ordinators is not very impactful (if the cluster is not small); but a single replica accumulating on-heap promoted-to-old-gen state for most or all of its requests due to a neighbor (in the replica set) being saturated is a significant problem.

The other case is when a node just internally racks up pending for stages which is presumably an easier fix if not fixed already on 1.1/trunk. Will try to look at this some more "soonish" when stress testing 1.1.

                
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Vijay (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211501#comment-13211501 ] 

Vijay edited comment on CASSANDRA-3853 at 2/19/12 7:33 PM:
-----------------------------------------------------------

Other thing we can do is that we can drop the references in the co-ordinator once we have executed/sent the query to the remote node, we can avoid promotion of those objects. We dont use this command for retry etc... it is done by the client hence we dont need to hold this references. Once the references to the query is dropped even if the rpc timeout is 10 seconds we have reference to only fewer objects.

Bonus: if we convert table name and column family name to byte buffers and use references then we will save some there too. 
                
      was (Author: vijay2win@yahoo.com):
    Other thing we can do is that we can drop the references in the co-ordinator once we have executed/sent the query to the remote node, this we can avoid promotion of those objects at least we dont use this command for retry etc... it is done by the client hence we dont need to hold the references. Once the references to the query is dropped even if the rpc timeout is 10 seconds we have reference to very little objects.

Bonus: if we convert table name and column family name to byte buffers and use references then we will save some there too. 
                  
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204234#comment-13204234 ] 

Jonathan Ellis commented on CASSANDRA-3853:
-------------------------------------------

bq. Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.

CASSANDRA-3005 makes some progress here.

bq. More aggressively drop requests internally

The "right" thing to do is to drop requests if we know they *will* timeout, whereas right now we only drop them if they already *have* timed out.  The former would mean including "expected query latency" in our threshold, minus maybe 10% (most users are fairly allergic to dropping requests unnecessarily).
                
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Vijay (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211501#comment-13211501 ] 

Vijay commented on CASSANDRA-3853:
----------------------------------

Other thing we can do is that we can drop the references in the co-ordinator once we have executed/sent the query to the remote node, this we can avoid promotion of those objects at least we dont use this command for retry etc... it is done by the client hence we dont need to hold the references. Once the references to the query is dropped even if the rpc timeout is 10 seconds we have reference to very little objects.

Bonus: if we convert table name and column family name to byte buffers and use references then we will save some there too. 
                
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204240#comment-13204240 ] 

Peter Schuller commented on CASSANDRA-3853:
-------------------------------------------

(But the correct, IMO, behavior of a client is to not re-try, so that case is a non-issue.)
                
> lower impact on old-gen promotion of slow nodes or connections
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc) there is a tendency for cascading failure if the system is overall under high load. This is generally true of most systems, but one way in which it is worse than desired is the way we queue up things between stages and outgoing requests.
> First off, I use the following premises:
> * The node is not running Azul ;)
> * The total cost of ownership (in terms of allocation+collection) of an object that dies in old-gen is *much* higher than that of an object that dies in young gen.
> * When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is *serial* and does not use all cores, and is a stop-the-world pause.
> Here is how this very effectively leads to cascading failure of the "fallen and can't get up" kind:
> * Some node has a problem and is slow, even if just for a little while.
> * Other nodes, especially neighbors in the replica set, start queueing up outgoing requests to the node for {{rpc_timeout}} milliseconds.
> * You have a high (let's say write) throughput of 50 thousand or so requests per second per node.
> * Because you want writes to be highly available and you are okay with high latency, you have an {{rpc_timeout}} of 60 seconds.
> * The total amount of memory used for 60 * 50 000 requests is freaking high.
> * The young gen GC pauses happen *much* more frequently than every 60 seconds.
> * The result is that when a node goes down, other nodes in the replica set start *massively* increasing their promotion rate into old gen. A cluster whose nodes are normally completely fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than normal: While the total allocation rate doesn't change (or not very much, perhaps a little if clients are doing re-tries), the promotion rate into old-gen increases massively.
> * This increases the total cost of ownership, and thus demand for CPU resources.
> * You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping is not parallel, even though marking is).
> * This leads to promotion failure/conc mode failure, and you fall into full GC.
> * But now, your full GC is effectively stealing CPU resources since you are forcing all cores but one to be completely idle on your system.
> * Once you go out of GC, you now have a huge backlog of work to do that you get bombarded with from other nodes that thought it was a good idea to retain 30 seconds worth of messages in *their* heap. So you're now being instantly shot down again by your neighbors, falling into the next full GC cycle even easier than originally.
> * Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same predicament.
> The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing a live-change of RPC timeouts to something very very low might also do the trick.
> This is a specific instance of the overall problem that we should IMO not be queueing up huge amounts of data in memory. Just recently I saw a node with *10 million* requests pending.
> We need to:
> * Have support for more aggressively dropping requests instead of queueing them when sending to other nodes.
> * More aggressively drop requests internally; there is very little use to queueing up hundreds of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage where any response is irrelevant once timeout has been reached.
> A complication here is that we *cannot* just drop requests so quickly that we never promote into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping requests every time another node goes into young gc. And if we retain requests long enough for other node's young gc, it also means we retain them long enough for promotion into old-gen with us (not strictly true with survivor spaces, but we can't assume to target the distinction there with any accuracy).
> A possible alternative is to ask users to be better about using short timeouts, but that probably ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained server-side settings. Even with shorter timeouts though, we still need to be careful about dropping requests in places it makes sense to avoid accumulating more than a timeout's worth of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira