You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Schuilenga, Jan Taeke" <ja...@ocwduo.nl> on 2011/06/16 08:28:42 UTC

Important Variables for Scaling

Which variables (for instance: throughput, CPU, I/O, connections) are
leading in deciding to add a node to a Cassandra setup which is put
under strain. We are trying to proove scalibility, but when is the time
there to add a node and have the optimum scalibilty result.

Re: Important Variables for Scaling

Posted by aaron morton <aa...@thelastpickle.com>.
It's a difficult questions to answer in the abstract. Some thoughts...

Scaling by adding one node at  time is not optimal. The best case scenario is to double the number of nodes, as this means existing nodes only have to stream their data to a new node. Obviously this is not always possible. When adding less nodes existing nodes keeping a balanced ring may mean streams data to other nodes and accepting data from nodes.  

In general try to keep the data volume < 50% full, so there is lots of free space to do moves. 

In general nodes with a few 100GB's of data are easiest to manage. 

The pending column in nodetool tpstats will let you know how many read or write requests are waiting to be serviced. If this is consistently above concurrent_reads or concurrent_writes it means there is a queue > 1 for each thread. This will add to request latency, once maximum throughput is reached additional requests will queue. See the SEDA paper.  

Sometime in the 0.7 dev client connection pooling was added to better manage those resources. See cassandra.yaml for info.  

The o.a.c.db.StorageProxy JMX MBean provides a latency trackers for total request time including wait times. And o.a.c.db.ColumnFamiles... provides latency trackers for the local node operations to do the read or write.

if your data set grows quickly watch the disk space etc. If you do a lot of requests but you data grows slowly watch the throughout and latency numbers. 

 
Hope that helps.   
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16 Jun 2011, at 18:28, Schuilenga, Jan Taeke wrote:

> Which variables (for instance: throughput, CPU, I/O, connections) are leading in deciding to add a node to a Cassandra setup which is put under strain. We are trying to proove scalibility, but when is the time there to add a node and have the optimum scalibilty result.
>