You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Tom Whitmore (Jira)" <ji...@apache.org> on 2021/03/08 22:27:00 UTC
[jira] [Comment Edited] (CASSANDRA-16499) single-threaded write workloads can spend ~70% time waiting on SEPExecutor

    [ https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297728#comment-17297728 ] 

Tom Whitmore edited comment on CASSANDRA-16499 at 3/8/21, 10:26 PM:
--------------------------------------------------------------------

Hi [~benedict], [~clohfink],

Using just 'Cassandra Stress' I can produce anomalous throughput & latency indicators comparing single-thread write workload with 10-thread write workload against Cassandra 4 beta 4.

See output from three runs (single-threaded, 10 threads, back to single-threaded again) of cassandra-stress.

[^Stress Write 2 sgl-thread vs 10 threads -- 01.txt]

Key measurements:

 
||Run||Op rate||Latency mean||Latency p95||Total time||
|Run 1 - Single threaded|320 op/s|3.1 ms|4.1 ms|0:31|
|Run 2 - 10 threads|5,427 op/s|1.8 ms|2.1 ms|0:18|
|Run 3 - Single threaded|333 op/s|2.9 ms|4.0 ms|0:29|

Key observations:
 * 10 threads give 16x more throughput than a single-thread.
 * latency mean is 38% lower (1.8 ms) for 10 threads than single-threaded (2.9 ms), and latency p95 is 47% lower
 * overall test time is 38% shorter for 10 thread test compared with single threaded test, even though 10x more data is written in 10-thread test.
 * warmup is not a factor, as single-threaded throughput & latency remain poor in run 3 after multi-threaded test in run 2.

Testing this requires running just two 'cassandra-stress' tests against a single-node local cluster.
 * cassandra-stress.bat write n=10000 -rate threads=1
 * cassandra-stress.bat write n=100000 -rate threads=10

Are you able to replicate any of these observations?


was (Author: tomw_nz):
Hi [~benedict], [~clohfink],

Using just 'Cassandra Stress' I can produce anomalous throughput & latency indicators comparing single-thread write workload with 10-thread write workload against Cassandra 4 beta 4.

See output from three runs (single-threaded, 10 threads, back to single-threaded again) of cassandra-stress.

[^Stress Write 2 sgl-thread vs 10 threads -- 01.txt]

Key measurements:

 
||Run||Op rate||Latency mean||Latency p95||Total time||
|Run 1 - Single threaded|320 op/s|3.1 ms|4.1 ms|0:31|
|Run 2 - 10 threads|5,427 op/s|1.8 ms|2.1 ms|0:18|
|Run 3 - Single threaded|333 op/s|2.9 ms|4.0 ms|0:29|

Key observations:

 
 * 10 threads give 16x more throughput than a single-thread.
 * latency mean is 38% lower (1.8 ms) for 10 threads than single-threaded (2.9 ms), and latency p95 is 47% lower
 * overall test time is 38% shorter for 10 thread test compared with single threaded test, even though 10x more data is written in 10-thread test.
 * warmup is not a factor, as single-threaded throughput & latency remain poor in run 3 after multi-threaded test in run 2.

Testing this requires running just two 'cassandra-stress' tests against a single-node local cluster.
 * cassandra-stress.bat write n=10000 -rate threads=1
 * cassandra-stress.bat write n=100000 -rate threads=10

Are you able to replicate any of these observations?

> single-threaded write workloads can spend ~70% time waiting on SEPExecutor
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16499
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16499
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Tom Whitmore
>            Priority: Normal
>         Attachments: Cassandra Write trace 5;  warmed up -- 02.txt, Single-thread Latencies report -- 01.xlsx, Stress Write 2 sgl-thread vs 10 threads -- 01.txt, Stress Write sgl-thread 1 -- 01.txt, Stress Write trace 1.txt, tracing & experimental change patches.zip
>
>
> Hi all! While conducting benchmarking of Cassandra against other databases for a particular healthcare solution, I found some surprising anomalies in single-threaded write performance. 
> Analysis & tracing suggest there might be some inefficiencies in inter-thread execution in Cassandra;
>  * Tracing showed an average delay of 1.52 ms between StorageProxy.performLocally() being called, and the LocalMutationRunnable actually executing.
>  * Total operation time averaged 2.06 ms (measured at Message.Dispatcher processRequest()). This suggested ~72% of the +total operation time+ being lost waiting for thread scheduling in SEPExecutor.
>  * With I tested with multiple threads, performance with 10 threads was 27x higher. This supports a hypothesis that scheduling delays may be hindering single-threaded progress.
>  * Transaction throughput for Cassandra with a single-threaded workload, measured far lower than that of PostgreSQL on the same hardware. Postgres achieved ~200k committed transactions/ minute including fsync; Cassandra achieves ~37k per minute. Given they are both essentially writing to a commit log, it may be informative to understand why the differences are arising.
> Cassandra's architecture seems in theory like it might be aligned for our usecase, given the Commit Log and Log Structured Merge design. Some of our customers have configurations posing high single-threaded loads. Therefore I spent some time trying to understand why efficiency for such loads seemed less than expected.
> My investigation so far:
>  * benchmarked Cassandra 3.11.10
>  * stack-dumped it under load & identified a pattern of threads waiting in AbstractWriteResponseHandler while nothing else is busy
>  * checked out Cassandra 3.11.10 source, built it, debugged  & stepped through key areas to try and understand behavior.
>  * instrumented key areas with custom tracing code & timestamps to 0.01 millisecond.
>  ** _see patch attached._
>  * benchmarked Cassandra 4 beta 4 & verified delays also present
>  * shown & traced delays with my healthcare scenario benchmark
>  * shown & traced delays with the +Cassandra stress-test+ tool.
> The configuration was:
>  * single-node Cassandra running locally, on a recent Dell laptop with NVmE SSD.
>  * for the healthcare scenario:
>  ** Java client app running 1 or 10 threads;
>  ** trialled LOCAL_ONE and ANY consistency levels;
>  ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED.
>  * for 'cassandra-stress':
>  ** cassandra-stress.bat write n=10000 -rate threads=1
> Without deeply understanding the code, I have considered a couple of possible areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be interested to understand whether or not these might be sound or not; note that neither achieves as much improvement as might theoretically be hoped for.
> My investigations based on the key observation of large delays between StorageProxy.performLocally() being invoked and the LocalMutationRunnable actually executing, for single-threaded workloads.
> What I looked at:
>  * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true to execute immediately, false if scheduled – for this workload scheduling seemed slow.
>  ** takeWorkPermit() answers false if no work permits are available.
>  ** I noticed takeWorkPermit() also answers false if no task permits are available, +even if no task permit need be taken.+
>  ** by changing this condition I was able to gain +45% performance.
>  * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I noted that in a single-thread workload SEPWorkers would likely spin & be put to sleep for a fairly extended period after completing each task.
>  ** I did wonder if the relatively-long park times (average of 500,000 nanos = 0.5ms) could contribute to threads being more aggressively de-scheduled.
>  **  an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained +7.9% performance.
>  * Without deeply understanding SEP Executor/ Worker, I feel there may be more questions around how SEP Workers are brought out of SPINNING/ sleep state and whether this logic functions promptly & correctly.
>  ** At a very initial stage of investigation: +SEPWorker.assign() unparks threads when transitioning out of STOPPED state, but code appears potentially not to unpark threads coming out of SPINNING state.+
>  ** _This is a very cursory "looking at the code" & initial debugging stage, but I'm not certain it's accurate._ _Attempted experiments to unpark for the SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine CPU use or dropped messages, rather than helping anything._
>  ** _If & when I can find out more, I'll post it here._
> I will post the tracing code & traces I captured, and welcome some feedback and thoughts on these performance questions from the Cassandra dev community. Thanks all!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org