You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ariel Weisberg (JIRA)" <ji...@apache.org> on 2015/01/06 18:06:35 UTC
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

    [ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266402#comment-14266402 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:05 PM:
-------------------------------------------------------------------

I think I stumbled onto what is going on based on Benedict's suggestion to disable TCP no delay. It looks like there is a small message performance issue. 

This is something I have seen before in EC2 where you can only send a surprisingly small number of messages in/out of a VM. I don't have the numbers from when I micro benchmarked it, but it is something like 450k messages with TCP no delay and a million or so without. Adding more sockets helps but it doesn't  even double the number of messages in/out. Throwing more cores at the problem doesn't help you just end up with under utilized cores which matches the mysterious levels of starvation I was seeing in C* even though I was exposing sufficient concurrency.

14 servers nodes. 6 client nodes. 500 threads per client. Server started with
        "row_cache_size_in_mb" : "2000",
        "key_cache_size_in_mb":"500",
        "rpc_max_threads" : "1024",
        "rpc_min_threads" : "16",
        "native_transport_max_threads" : "1024"
8-gig old gen, 2 gig new gen.

Client running CL=ALL and the same schema I have been using throughout this ticket.

With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221

No delay on 
162987

Modified trunk to fix a bug batching messages and add a configurable window for coalescing multiple messages into a socket see https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|

I did not expect to get mileage out of coalescing at the application level but it works extremely well. CPU utilization is still low at 1800%. There seems to be less correlation between CPU utilization and throughput as I vary the coalescing window and throughput changes dramatically. I do see that core 0 is looking pretty saturated and is only 10% idle. That might be the next or actual bottleneck.

What role this optimization plays at different cluster sizes is an important question. There has to be a tipping point where coalescing stops working because not enough packets go to each end point at the same time. With vnodes it wouldn't be unusual to be communicating with a large number of other hosts right?

It also takes a significant amount of additional latency to get the mileage at high levels of throughput, but at lower concurrency there is no benefit and it will probably show up as decreased throughput. It makes it tough to crank it up as a default. Either it is adaptive or most people don't get the benefit.

At high levels of throughput it is a clear latency win. Latency is much lower for individual requests on average. Making this a config option is viable as a starting point. Possibly a separate option for local/remote DC coalescing. Ideally we could make it adapt to the workload.

I am going to chase down what impact coalescing has at lower levels of concurrency so we can quantify the cost of turning it on. I'm also going to try and get to the bottom of all interrupts going to core 0. Maybe it is the real problem and coalescing is just a band aid to get more throughput.


was (Author: aweisberg):
I think I stumbled onto what is going on based on Benedict's suggestion to disable TCP no delay. It looks like there is a small message performance issue. 

This is something I have seen before in EC2 where you can only send a surprisingly small number of messages in/out of a VM. I don't have the numbers from when I micro benchmarked it, but it is something like 450k messages with TCP no delay and a million or so without. Adding more sockets helps but it doesn't  even double the number of messages in/out. Throwing more cores at the problem doesn't help you just end up with under utilized cores which matches the mysterious levels of starvation I was seeing in C* even though I was exposing sufficient concurrency.

14 servers nodes. 6 client nodes. 500 threads per client. Server started with
        "row_cache_size_in_mb" : "2000",
        "key_cache_size_in_mb":"500",
        "rpc_max_threads" : "1024",
        "rpc_min_threads" : "16",
        "native_transport_max_threads" : "1024"
8-gig old gen, 2 gig new gen.

Client running CL=ALL and the same schema I have been using throughout this ticket.

With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221

No delay on 
162987

Modified trunk to fix a bug batching messages and add a configurable window for coalescing multiple messages into a socket see https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|

I did not expect get mileage out of coalescing at the application level but it works extremely well. CPU utilization is still low at 1800%. There seems to be less correlation between CPU utilization and throughput as I vary the coalescing window and throughput changes dramatically. I do see that core 0 is looking pretty saturated and is only 10% idle. That might be the next or actual bottleneck.

What role this optimization plays at different cluster sizes is an important question. There hast to be a tipping point where coalescing stops working  because not enough packets go to each end point at the same time. With vnodes it wouldn't be unusual to be communicating with a large number of other hosts right?

It also takes a significant amount of additional latency to get the mileage at high levels of throughput, but at lower concurrency there is no benefit and it will probably show up as decreased throughput. It makes it tough to crank it up as a default. Either it is adaptive or most people don't get the benefit.

At high levels of throughput it is a clear latency win. Latency is much lower for individual requests on average. Making this a config option is viable as a starting point. Possibly a separate option for local/remote DC coalescing. Ideally we could make it adapt to the workload.

I am going to chase down what impact coalescing has at lower levels of concurrency so we can quantify the cost of turning it on. I'm also going to try and get to the bottom of all interrupts going to core 0. Maybe it is the real problem and coalescing is just a band aid to get more throughput.

> nio MessagingService
> --------------------
>
>                 Key: CASSANDRA-8457
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Ariel Weisberg
>              Labels: performance
>             Fix For: 3.0
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters.  Let's look at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)