You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Peter Fales <Pe...@alcatel-lucent.com> on 2010/07/06 17:26:46 UTC

Cassandra performance and read/write latency

Greetings Cassandra Developers!

We've been trying to benchmark Cassandra performance and have 
developed a test client written in C++ that uses multiple threads to 
send out a large number of write and read requests (as fast as the
server can handle them).   

One of the results we're seeing is a bit surprising, and I'm hoping
someone here can help shed some light on the topic - as far as I can
tell, it hasn't been discuseed on the mailing list.

Most of the requests return in a reasonable amount of time (10s or
100s of milliseconds), but every once in a while, the server seems to
just "stop" for up to several seconds.   During this time, all the 
reads and writes will take several seconds to complete and network traffic
in an out of the system drops off to nearly zero.   When plotted on a 
graph, these appear as very larges spikes every few minutes.  (Though without
any particular pattern to how often those spikes occur).   Even though
the average response time is very good (and therefore we get a reasonable
number of requests/sec) these occasional outliers are a showstopper for
our potential applications.

We've experimented with a number of different machines of different 
capabilities including a range of physical machines, and clusters of
machines on Amazon's EC2.  We've also used different numbers of nodes
in the cluster and different values for ReplicationFactor.   All are 
qualitatively similar, though the numbers vary as expected (i.e. 
fast machines improve both the average and maximum numbers, but the 
max values are still on the order of seconds)

I know Cassandra has lots of configuration parameters that can be
tweaked, but most of the other parameters are left at the default
values of Cassandara-0.6.2 or 0.6.3.

Has anyone else seen nodes "hang" for several seconds like this?  I'm
not sure if this is a Java VM issue (e.g. garbage collection) or something
specific to the Cassandra application.   I'll be happy to share more 
details of our experiments either on the mailing list, or with interested
parties offline.  But I thought I'd start with a brief description and 
see how consistent it is with other experiences.   I'm sort of expecting
to see "Well, of course you'll see that kind of behavior because you
didn't change..."

I'm also interested in comparing notes with anyone  else that has been doing
read/write throughput benchmarks with Cassandara.

Thanks in advance for any information or suggestions you may have!

-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: Peter.Fales@alcatel-lucent.com
Phone: 630 979 8031

Re: Cassandra performance and read/write latency

Posted by Peter Schuller <pe...@infidyne.com>.

> Has anyone else seen nodes "hang" for several seconds like this?  I'm
> not sure if this is a Java VM issue (e.g. garbage collection) or something

Since garbage collection is logged (if you're running with default
settings etc), any multi-second GC:s should be discoverable in said
log. So for testing that hypothesis i'd check there first. Cassandra
itself logs GC:s, but you can also turn of the JVM:s GC logging by
e.g. "-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps".

> I'm also interested in comparing notes with anyone  else that has been doing
> read/write throughput benchmarks with Cassandara.

I did some batch write testing to see how it scaled up to about 200
million rows and 200 gb; I had ocational spikes in latency that were
due to disk writes being flushed by the OS. However it was probably
exacerbated in this case by the fact that this was ZFS/FreeBSD and ZFS
is always (in my humble of opinion, and at least on FreeBSD)
exhibiting the behavior for me that it flushes writes too late and end
up blocking applications even if you have left-over bandwidth.

In my case I "eliminated" the issue for the purpose of my test by
having a stupid while loop simply doing "sync" every handful of
seconds, to avoid accumulating too much data in the cache.

While I expect this to be less of a problem for other setups, it's
possible this is what you're seeing. If the operating system is
blocking writes to the commit log for example (are you running with
periodic fsync or batch wise fsync?).

-- 
/ Peter Schuller