You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Erik Bunn <eb...@basen.net> on 2010/11/02 18:27:50 UTC

Cassandra suitability for model?


...or perhaps vice versa: how would I tweak a model to suit Cassandra?

I have in mind data that could be _almost_ shoehorned into the (S)CF
structure, and I'd love to hammer this nail with something hadoopy, but I
have a niggling suspicion I'm setting myself up for frustration.

I have
* a relatively small set of primary tags (dozens or hundreds per cluster)
* under each primary tag a large number (on the scale of 1E6) of arbitrary
   length hierarchical paths ("foo/bar/xyzzy", typically consisting of
   descriptive labels, usually totaling 20-40 chars)
* under each path an arbitrary number (usually a few or a few dozen, but in
   some systematic cases ~1000) of leaf tags (typically descriptive labels, say
   4-16 chars in length)
* under each leaf tag a value (arbitrary; string, number, perhaps binary)

On the surface, it would seem that the primary tag would correspond well with
Supercolumn keys, the intermediate path with ColumnFamily names, and the final
key-value-pairs with Columns. Any warning bells here? (Seems like I could also
use the primary tag as a Keyspace name, but I seem to recall some warnings
about using excessive keyspaces.)

The gist is that each and every leaf tag, across the whole data set, receives
a value every few seconds, indefinitely, and history must be preserved. In
practice, all Columns in the ColumnFamily receive a value at the same time.
100k ColumnFamily updates a second would be routine. Nodes would be added
whenever storage or per-node I/O became an issue.
A query is a much more rare occurrence, and would nearly always involve
retrieving the full contents of a ColumnFamily over some time range
(usually thousands of snapshots, not at all rarely millions).

Just by browsing online documentation I can't resolve whether this
timestamping could work in conjunction with Cassandra's internal native
timestamping as is. Is it possible - out of the box, or with minor coding -
to retain history, and to incorporate time ranges into queries? If so, does
the distributed storage cope with the accumulating data of a single
ColumnFamily flowing over to new nodes?

Or: should I twist the whole thing around and incorporate my timestamp
into the ColumnFamily identifier to enjoy automatic scaling? Would the
sheer number of resulting identifiers become a performance issue?

Thanks for your comments;
//e