You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by David <da...@silveregg.co.jp> on 2010/05/10 13:28:23 UTC

Cassandra for live statistics aggregation ?

Hi,

I am investigating the use of Cassandra to gather and aggregate simple 
statistics in real time from multiple sources, something quite similar 
to what is described there: 
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra. I 
have a few questions about how to design a model in cassandra, 
especially w.r.t. the lack of atomic increment operations.

Simply said, I would have a server which would receive requests as follows:

..../request?timestamp=1234567890&property1=value1&....

And I would like to keep track of the number of such requests in time 
ranges ("how many requests / hour for the last few days"), for arbitrary 
combinations of {property1 : value1, ..., propertyN: valueN}. The goal 
is to cope with a few thousand requests / sec on the write side, and to 
get acceptable latency for queries (ideally ~ 1 sec/query for a few 
queries / sec).

Given those constraints, I am considering making time-based buckets, 
where I would count the number of requests for each property combination 
on a hourly-basis, daily-basis, etc...

	- the most obvious one, prefixing the key with the timestamp to use 
keyrange-based queries. Unfortunately, this seems to require an ordered 
partitioner, which sound like a bad idea here as writes would happen on 
one node at any given time.
	- another solution I can think of is to keep a column family per bucket 
(one for daily count, etc...), the key would be 
bucket_id:hash({property1: value1, ...}), and the columns would be the 
corresponding time_stamp for this bucket and set of properties. This is 
easy to write and read for the queries I care. Problem: I understand 
that cassandra scales to a few millions columns/row, and this solution 
may requires many more for bucket which are coarser than a day.
	- more involved: using a timestamp-based key but implementing my own 
partitioner to partition the writes.

I am quite new to key-value store, so there may be some other simple 
solutions to this problem ? Most examples I found on the internet were 
too incomplete or assumed range queries over key containing timestamps 
(thus requiring the ordered partitioner).

thanks,

David