You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by Apache Wiki <wi...@apache.org> on 2009/08/31 22:58:57 UTC

[Cassandra Wiki] Update of "CassandraLimitations" by JonathanEllis

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The following page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/CassandraLimitations

------------------------------------------------------------------------------
= Limitations =

- From easiest to fix to hardest:
+ == Inherent in the design ==
+
+ The main limitation on column and supercolumn size is that all data for a single key and column must fit (on disk) on a single machine in the cluster. Because keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. This is an inherent limitation of the distribution model.
+
+ == Artifacts of the current code base ==

* Cassandra's compaction code currently deserializes an entire row (per columnfamily) at a time. So all the data from a given columnfamily/key pair must fit in memory. Fixing this is relatively easy since columns are stored in-order on disk so there is really no reason you have to deserialize row-at-a-time except that that is easier with the current encapsulation of functionality.
* Cassandra has two levels of indexes: key and column. But in super columnfamilies there is a third level of subcolumns; these are not indexed, and any request for a subcolumn deserializes _all_ the subcolumns in that supercolumn. So you want to avoid a data model that requires large numbers of subcolumns. This can be fixed; the core classes involved are SuperColumn and SequenceFile.