You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Peter Fales <Pe...@alcatel-lucent.com> on 2011/12/19 20:54:43 UTC

Cassandra stress test and max vs. average read/write latency.

Has anyone looked much at the maximum latency of cassandra read/write
requests?  (rather than the average latency and average throughput)

We've been struggling for quite some time trying to figure out why we
we see occasional read or write response times in the 100s of milliseconds
even on fast machines that normally respond in just a few milleconds.   
We've spent a lot of time attempting to trying to tune our benchark
and cassandra configurations to lower these maximum times.   There are
a lot of things that make it a little better, or a little worse, but we've
found it nearly impossible to eliminate these "outliers" completely.

A lot of our initial testing was done with a home-grown benchmark test
written in C++ and using the thrift interface.   However, now that we've 
recently upgraded from 0.6.8 to 1.0.3, that has allowed me to do some 
testing using the official java "stress" tool.   The problem, at least
for this purpose, is that the stress tool only reports *average*
response times over the measurement intervals.    This effectively
hides the large value if they are infrequent relative to measurement
interval.   I've modified the stress test, so that it also tracks the
maximum latency reported over each measurement interval. 

Here is an excerpt from a typical result:

$ bin/stress -d XXX -p 7777 -e QUORUM  -t 4  -i 1  -l 3 -c 1 -n 400000
total interval_op_rate interval_key_rate avg_latency elapsed_time max(millisec)
5780 5780 5780 6.098615916955017E-4 1 13
13837 8057 8057 5.003102891895247E-4 2 4
22729 8892 8892 4.7199730094466935E-4 3 4
31840 9111 9111 4.6504225661288555E-4 4 1
40925 9085 9085 4.6846450192625206E-4 5 1
49076 8151 8151 5.200588884799411E-4 6 100
...
3186625 8886 8886 4.786180508665316E-4 411 10
3195626 9001 9001 4.705032774136207E-4 412 1
3204574 8948 8948 4.710549843540456E-4 414 1
3213524 8950 8950 4.7195530726256986E-4 415 1
3217534 4010 4010 0.0010763092269326683 416 607
3226560 9026 9026 4.695324617770884E-4 417 1
3235425 8865 8865 4.7805978567399887E-4 418 1
3244177 8752 8752 4.848034734917733E-4 419 10
...


My patch adds the final column which logs the maximum response time 
over the one-second interval.  In most cases the average reponse time 
is under 1 msec, and though the maximum might be a bit larger, it's still
just a few milleseonds - usually under 10 msec.   But sometimes (like
interval 416) one of the response took 607 milliseconds.

These numbers aren't too bad if you are supporting an interactive 
application and don't mind a slightly slower response now and then as
long as the average stays low and throughput stays high.  But for 
other types of applications, these slow responses might be a problem.

I'm trying to understand if this is expected or not, and if there is
anything we can do about.   I'd also be interested in hearing from folks
that have run the stress test agains their own Cassandra clusters.   

If anyone wants to try this, I've included  my patch to the stress test 
below.   So far, I've only instrumented the default "insert" operation. 
(It also adusts the output to separate fields with spaces instead of
commas.  I find that easier to read and it caters to gnuplot)

diff -ur stress/src/org/apache/cassandra/stress/operations/Inserter.java /home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/operations/Inserter.java
--- stress/src/org/apache/cassandra/stress/operations/Inserter.java	2011-11-15 02:57:23.000000000 -0600
+++ /home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/operations/Inserter.java	2011-12-09 08:58:04.000000000 -0600
@@ -108,6 +108,11 @@
         session.operations.getAndIncrement();
         session.keys.getAndIncrement();
         session.latency.getAndAdd(System.currentTimeMillis() - start);
+
+long  delta = System.currentTimeMillis() - start;
+if ( delta > session.maxlatency.get() ) {
+session.maxlatency.set(delta);
+}
     }
 
     private Map<String, List<Mutation>> getSuperColumnsMutationMap(List<SuperColumn> superColumns)
diff -ur stress/src/org/apache/cassandra/stress/Session.java /home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/Session.java
--- stress/src/org/apache/cassandra/stress/Session.java	2011-11-15 02:57:23.000000000 -0600
+++ /home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/Session.java	2011-12-09 08:48:45.000000000 -0600
@@ -53,6 +53,7 @@
     public final AtomicInteger operations;
     public final AtomicInteger keys;
     public final AtomicLong    latency;
+    public final AtomicLong	maxlatency;
 
     static
     {
@@ -337,6 +338,7 @@
         operations = new AtomicInteger();
         keys = new AtomicInteger();
         latency = new AtomicLong();
+	maxlatency = new AtomicLong();
     }
 
     public int getCardinality()
diff -ur stress/src/org/apache/cassandra/stress/StressAction.java /home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/StressAction.java
--- stress/src/org/apache/cassandra/stress/StressAction.java	2011-11-15 02:57:23.000000000 -0600
+++ /home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/StressAction.java	2011-12-09 09:00:41.000000000 -0600
@@ -55,8 +55,7 @@
         int threadCount = client.getThreads();
         Consumer[] consumers = new Consumer[threadCount];
 
-        output.println("total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time");
-
+        output.println("total interval_op_rate interval_key_rate avg_latency elapsed_time max(millisec)");
         int itemsPerThread = client.getKeysPerThread();
         int modulo = client.getNumKeys() % threadCount;
 
@@ -129,11 +128,13 @@
                 int opDelta = total - oldTotal;
                 int keyDelta = keyCount - oldKeyCount;
                 double latencyDelta = latency - oldLatency;
-
+		long maxdelta = client.maxlatency.get();
+		client.maxlatency.set(0);
                 long currentTimeInSeconds = (System.currentTimeMillis() - testStartTime) / 1000;
                 String formattedDelta = (opDelta > 0) ? Double.toString(latencyDelta / (opDelta * 1000)) : "NaN";
 
-                output.println(String.format("%d,%d,%d,%s,%d", total, opDelta / interval, keyDelta / interval, formattedDelta, currentTimeInSeconds));
+                output.println(String.format("%d %d %d %s %d %d", total, opDelta / interval, keyDelta / interval, formattedDelta, currentTimeInSeconds,maxdelta));
+	
             }
         }
 


-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: Peter.Fales@alcatel-lucent.com
Phone: 630 979 8031

Re: Cassandra stress test and max vs. average read/write latency.

Posted by Peter Schuller <pe...@infidyne.com>.

> Thanks for your input.  Can you tell me more about what we should be
> looking for in the gc log?   We've already got the gc logging turned
> on and, and we've already done the plotting to show that in most
> cases the outliers are happening periodically (with a period of
> 10s of seconds to a few minutes, depnding on load and tuning)

Are you measuring writes or reads? If writes,
https://issues.apache.org/jira/browse/CASSANDRA-1991 is still relevant
I think (sorry no progress from my end on that one). Also, I/O
scheduling issues can easily cause problems with the commit log
latency (on fsync()). Try switching to periodic commit log mode and
see if it helps, just to eliminate that (if you're not already in
periodic; if so, try upping the interval).

For reads, I am generally unaware of much aside from GC and legitimate
"jitter" (scheduling/disk I/O etc) that would generate outliers. At
least that I can think of off hand...

And w.r.t. the GC log - yeah, correlating in time is one thing.
Another thing is to confirm what kind of GC pauses you're seeing.
Generally you want to be seeing lots of ParNew:s of shorter duration,
and those are tweakable by changing the young generation size. The
other thing is to make sure CMS is not failing (promotion
failure/concurrent mode failure) and falling back to a stop-the-world
serial compacting GC of the entire heap.

You might also use -:XX+PrintApplicationPauseTime (I think, I am
probably not spelling it entirely correctly) to get a more obvious and
greppable report for each pause, regardless of "type"/cause.

> I've tried to correlate the times of the outliers with messages either
> in the system log or the gc log.   There seemms to be some (but not
> complete) correlation between the outliers and system log messages about
> memtable flushing.   I can not find anything in the gc log that
> seems to be an obvious problem, or that matches up with the time
> times of the outliers.

And these are still the very extreme (500+ ms and such) outliers that
you're seeing w/o GC correlation? Off the top of my head, that seems
very unexpected (assuming a non-saturated system) and would definitely
invite investigation IMO.

If you're willing to start iterating with the source code I'd start
bisecting down the call stack and see where it's happening .

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra stress test and max vs. average read/write latency.

Posted by Peter Fales <Pe...@alcatel-lucent.com>.

Peter,

Thanks for your input.  Can you tell me more about what we should be
looking for in the gc log?   We've already got the gc logging turned
on and, and we've already done the plotting to show that in most 
cases the outliers are happening periodically (with a period of 
10s of seconds to a few minutes, depnding on load and tuning)

I've tried to correlate the times of the outliers with messages either
in the system log or the gc log.   There seemms to be some (but not
complete) correlation between the outliers and system log messages about
memtable flushing.   I can not find anything in the gc log that 
seems to be an obvious problem, or that matches up with the time 
times of the outliers.


On Mon, Dec 19, 2011 at 09:47:32PM -0800, Peter Schuller wrote:
> > I'm trying to understand if this is expected or not, and if there is
> 
> Without careful tuning, outliers around a couple of hundred ms are
> definitely expected in general (not *necessarily*, depending on
> workload) as a result of garbage collection pauses. The impact will be
> worsened a bit if you are running under high CPU load (or even maxing
> it out with stress) because post-pause, if you are close to max CPU
> usage you will take considerably longer to "catch up".
> 
> Personally, I would just log each response time and feed it to gnuplot
> or something. It should be pretty obvious whether or not the latencies
> are due to periodic pauses.
> 
> If you are concerned with eliminating or reducing outliers, I would:
> 
> (1) Make sure that when you're benchmarking, that you're putting
> Cassandra under a reasonable amount of load. Latency benchmarks are
> usually useless if you're benchmarking against a saturated system. At
> least, start by achieving your latency goals at 25% or less CPU usage,
> and then go from there if you want to up it.
> 
> (2) One can affect GC pauses, but it's non-trivial to eliminate the
> problem completely. For example, the length of frequent young-gen
> pauses can typically be decreased by decreasing the size of the young
> generation, leading to more frequent shorter GC pauses. But that
> instead causes more promotion into the old generation, which will
> result in more frequent very long pauses (relative to normal; they
> would still be infrequent relative to young gen pauses) - IF your
> workload is such that you are suffering from fragmentation and
> eventually seeing Cassandra fall back to full compacting GC:s
> (stop-the-world) for the old generation.
> 
> I would start by adjusting young gen so that your frequent pauses are
> at an acceptable level, and then see whether or not you can sustain
> that in terms of old-gen.
> 
> Start with this in any case: Run Cassandra with -XX:+PrintGC
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
> 
> -- 
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: Peter.Fales@alcatel-lucent.com
Phone: 630 979 8031

Re: Cassandra stress test and max vs. average read/write latency.

Posted by Peter Schuller <pe...@infidyne.com>.

> Is there any way to control the CPU load when using the "stress" benchmark?
> I have some control over that with our home-grown benchmark, but I
> thought it made sense to use the official benchmark tool as people might
> more readily believe those results and/or be able to reproduce them.  But
> offhand, I don't see any to throttle back the load created by the
> stress test.

I'm not aware of one built-in. It would be a useful patch IMO, to
allow setting a target rate.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra stress test and max vs. average read/write latency.

Posted by Peter Fales <Pe...@alcatel-lucent.com>.

Peter,

Thanks for your response. I'm looking into some of the ideas in your
other recent mail, but I had another followup question on this one...

Is there any way to control the CPU load when using the "stress" benchmark?
I have some control over that with our home-grown benchmark, but I
thought it made sense to use the official benchmark tool as people might
more readily believe those results and/or be able to reproduce them.  But
offhand, I don't see any to throttle back the load created by the 
stress test.

On Mon, Dec 19, 2011 at 09:47:32PM -0800, Peter Schuller wrote:
> > I'm trying to understand if this is expected or not, and if there is
> 
> Without careful tuning, outliers around a couple of hundred ms are
> definitely expected in general (not *necessarily*, depending on
> workload) as a result of garbage collection pauses. The impact will be
> worsened a bit if you are running under high CPU load (or even maxing
> it out with stress) because post-pause, if you are close to max CPU
> usage you will take considerably longer to "catch up".
> 
> Personally, I would just log each response time and feed it to gnuplot
> or something. It should be pretty obvious whether or not the latencies
> are due to periodic pauses.
> 
> If you are concerned with eliminating or reducing outliers, I would:
> 
> (1) Make sure that when you're benchmarking, that you're putting
> Cassandra under a reasonable amount of load. Latency benchmarks are
> usually useless if you're benchmarking against a saturated system. At
> least, start by achieving your latency goals at 25% or less CPU usage,
> and then go from there if you want to up it.
> 
> (2) One can affect GC pauses, but it's non-trivial to eliminate the
> problem completely. For example, the length of frequent young-gen
> pauses can typically be decreased by decreasing the size of the young
> generation, leading to more frequent shorter GC pauses. But that
> instead causes more promotion into the old generation, which will
> result in more frequent very long pauses (relative to normal; they
> would still be infrequent relative to young gen pauses) - IF your
> workload is such that you are suffering from fragmentation and
> eventually seeing Cassandra fall back to full compacting GC:s
> (stop-the-world) for the old generation.
> 
> I would start by adjusting young gen so that your frequent pauses are
> at an acceptable level, and then see whether or not you can sustain
> that in terms of old-gen.
> 
> Start with this in any case: Run Cassandra with -XX:+PrintGC
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
> 
> -- 
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: Peter.Fales@alcatel-lucent.com
Phone: 630 979 8031

Re: Cassandra stress test and max vs. average read/write latency.

Posted by Peter Schuller <pe...@infidyne.com>.

> I'm trying to understand if this is expected or not, and if there is

Without careful tuning, outliers around a couple of hundred ms are
definitely expected in general (not *necessarily*, depending on
workload) as a result of garbage collection pauses. The impact will be
worsened a bit if you are running under high CPU load (or even maxing
it out with stress) because post-pause, if you are close to max CPU
usage you will take considerably longer to "catch up".

Personally, I would just log each response time and feed it to gnuplot
or something. It should be pretty obvious whether or not the latencies
are due to periodic pauses.

If you are concerned with eliminating or reducing outliers, I would:

(1) Make sure that when you're benchmarking, that you're putting
Cassandra under a reasonable amount of load. Latency benchmarks are
usually useless if you're benchmarking against a saturated system. At
least, start by achieving your latency goals at 25% or less CPU usage,
and then go from there if you want to up it.

(2) One can affect GC pauses, but it's non-trivial to eliminate the
problem completely. For example, the length of frequent young-gen
pauses can typically be decreased by decreasing the size of the young
generation, leading to more frequent shorter GC pauses. But that
instead causes more promotion into the old generation, which will
result in more frequent very long pauses (relative to normal; they
would still be infrequent relative to young gen pauses) - IF your
workload is such that you are suffering from fragmentation and
eventually seeing Cassandra fall back to full compacting GC:s
(stop-the-world) for the old generation.

I would start by adjusting young gen so that your frequent pauses are
at an acceptable level, and then see whether or not you can sustain
that in terms of old-gen.

Start with this in any case: Run Cassandra with -XX:+PrintGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)