You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Daniel Doubleday <da...@gmx.net> on 2010/12/22 15:50:14 UTC

Problematic usage pattern

Hi all

wanted to share a cassandra usage pattern you might want to avoid (if you can).

The combinations of 

- heavy rows,
- large volume and
- many updates (overwriting columns)

will lead to a higher count of live ssts (at least if you're not starting mayor compactions a lot) with many ssts actually containing the same hot rows. 
This will lead to loads of multiple reads that will increase latency and io pressure by itself and making the page cache less effective because it will contains loads of 'invalid' data.

In our case we could reduce reads by ~40%. Our rows contained one large column (1-4k) and some 50 - 100 small columns.
We splitted into 2 CFs and stored the large column in the other CF with a UUID as row key which we store in CF1. We cache the now light-weight rows in the row cache which eliminates the update problem and instead of updating the large column we create a new row and delete the other one. That way the bloom filter prevents unnecessary reads. 

The downside is that to read the large column from CF2 we have to read CF1 first but since that one is in the row cache that still way better.

To monitor this we did a very small patch which records the file scans for a CF in a histogram in a similar way as the latency stats.

If someone's interested - here is the patch agains 0.6.8:

https://gist.github.com/751601

Cheers,
Daniel
smeet.com, Berlin