You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by David <da...@silveregg.co.jp> on 2010/05/10 13:28:23 UTC
Cassandra for live statistics aggregation ?
Hi,
I am investigating the use of Cassandra to gather and aggregate simple
statistics in real time from multiple sources, something quite similar
to what is described there:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra. I
have a few questions about how to design a model in cassandra,
especially w.r.t. the lack of atomic increment operations.
Simply said, I would have a server which would receive requests as follows:
..../request?timestamp=1234567890&property1=value1&....
And I would like to keep track of the number of such requests in time
ranges ("how many requests / hour for the last few days"), for arbitrary
combinations of {property1 : value1, ..., propertyN: valueN}. The goal
is to cope with a few thousand requests / sec on the write side, and to
get acceptable latency for queries (ideally ~ 1 sec/query for a few
queries / sec).
Given those constraints, I am considering making time-based buckets,
where I would count the number of requests for each property combination
on a hourly-basis, daily-basis, etc...
- the most obvious one, prefixing the key with the timestamp to use
keyrange-based queries. Unfortunately, this seems to require an ordered
partitioner, which sound like a bad idea here as writes would happen on
one node at any given time.
- another solution I can think of is to keep a column family per bucket
(one for daily count, etc...), the key would be
bucket_id:hash({property1: value1, ...}), and the columns would be the
corresponding time_stamp for this bucket and set of properties. This is
easy to write and read for the queries I care. Problem: I understand
that cassandra scales to a few millions columns/row, and this solution
may requires many more for bucket which are coarser than a day.
- more involved: using a timestamp-based key but implementing my own
partitioner to partition the writes.
I am quite new to key-value store, so there may be some other simple
solutions to this problem ? Most examples I found on the internet were
too incomplete or assumed range queries over key containing timestamps
(thus requiring the ordered partitioner).
thanks,
David