You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jason Brown (JIRA)" <ji...@apache.org> on 2014/04/14 20:09:20 UTC

[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

     [ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Brown updated CASSANDRA-4718:
-----------------------------------

    Attachment: 4718-v1.patch
                v1-stress.out

After several month hiatus, digging into this again. This time around, though, I have real hardware to test on, and thus the results are more consistent across executions (no cloud provider intermittently throttling me). 

Testing the thrift interface (both sync and hsha), the short story is throughput is up ~20% vs. TPE, and 95% / 99%iles are down 40-60%. The 99.9%ile, however, is a bit tricker. In some of my tests it is down almost 80%, and sometimes it is up 40-50%. I need to dig in further to understand what is going on (not sure if it’s because of a shared env, reading across numa cores, and so on). perf and likwid are my friends in this investigation.

As to testing the native protocol interface, I’ve only tested writes (new 2.1 stress seems broken on reads) and I get double the throughput and 40-50% lower latencies across the board.  

My test cluster consists of three machines, 32 cores each, 2 sockets (2 numa cores), 132G memory, 2.6.39 kernel, plus a similar box that generates the load.

A couple of notes about this patch:
* RequestThreadPoolExecutor now decorates a FJP. Previously we had a TPE which contains, of course, a (bounded) queue. The bounded queue helped with back pressure from incoming requests. By using a FJP, there is no queue to help with back pressure as the FJP always enqueue a task (without blocking). Not sure if we still want/need that back pressure here.
* As ForkJoinPool doesn’t expose much in terms of use metrics (like total completed) compared to ThreadPoolExecutor, the ForkJoinPoolMetrics is similarly barren. Not sure if we want to capture this on our own in DFJP or something like else. 
* I have made similar FJP changes to the disruptor-thrift library, and once this patch is committed, I’ll work with Pavel to make the changes over there and pull in the updated jar.

As a side note, looks like the quasar project (http://docs.paralleluniverse.co/quasar/) indicates the jsr166e jar has some optimizations (http://blog.paralleluniverse.co/2013/05/02/quasar-pulsar/) over the jdk7 implementation (that are included in jdk8). I pulled in those changes and stress tested, but didn’t see much of a difference for our use case. I can, however, pull them in again if any one feels strongly.


> More-efficient ExecutorService for improved throughput
> ------------------------------------------------------
>
>                 Key: CASSANDRA-4718
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>         Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more work in "bulk" instead of just one task per dequeue.  (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for this.  However, no ExecutorService in the jdk supports using drainTo, nor could I google one.
> What I would like to do here is create just such a beast and wire it into (at least) the write and read stages.  (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)