You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Paul Pak <pp...@yellowseo.com> on 2011/01/09 02:55:27 UTC

Re: Cassandra gotchas ...

Hi all,

After using Cassandra some time, I had some comments on Cassandra and
hope they spark productive conversation on the list.  They are meant
only as constructive feedback as a user of Cassandra.  While there are
many things great about Cassandra, I still feel that the current
implementation has two major issues that are limiting it's ability to be
used in production.  There are so many little gotchas that come up which
most people don't find out about until you get through most of the
implementation.  Most of the gotchas, I can live with, but the following
items seem like too heavy a cost to me.

1) If you have a result set with thousands of results, like an inbox,
there is no way to efficiently handle the pages <- 1 2 3 4 5 6 7 8 9 10
-> except by creating additional data structures on a materialized
view.  But that means you can only get paged views on materialized
views.  If you were to add constraints, all the paging functionality no
longer works.  This is a basic functionality that many, many
applications need.  Essentially it means that we can only perform the
most basic queries in Cassandra and secondary indexes and super columns
are near useless.  Super Columns are useless for doing complex queries
because of a lack of secondary indexes and the fact that it needs to
deserialize the entire row to work with it.  Regular CF's are no good
too for queries with constraints because the paging no longer works
since there is no materialized view.  There is no way to get the 800th
record in a result set without getting ALL the data up to the 800th
record.  That is crazy!  Cassandra desperately needs an efficient
capability to return a result set by specifying a start_column by record
number, not key.

2) Lack of operational support features.  For instance, no capability to
manage Cassandra's usage of disk space on nodes.  The fact that an admin
cannot specify where data goes or how to handle hot data, or gracefully
stop handling writes to nodes is a fundamental problem with the
partitioning strategy in my opinion.  I believe the entire partitioning
strategy needs to be revisited and probably rewritten to include
capabilities to accept administrator input on how to handle the data
(i.e. directories, machines, etc.), easily support moving data and
specifying where it should go, how many replicas, etc.  As it is, it is
just not flexible enough.   What if you have particularly hot data and
want to replicate it a dozen times to service read requests faster?  If
a node runs out of space for sstables, I still want it to be operational
for read requests, but not write.  When nodes are moved, we need to
manually run cleanup.  Why is that?  If there is a safety reason, then
how is an administrator going to know better than Cassandra that the
operation was successful?

I know that Cassandra is a work in progress and there are many
limitations I can live with, but it would be nice to know what the
roadmap is for the next 12-24 months so we can get an idea of what major
directions Cassandra is going in so we can plan accordingly.  It would
be nice if the community could vote of features considered so that the
devs would have an idea of where the major pain points are for the users
of Cassandra.  The questions that are especially important are...  what
feature additions are being considered?  And, what is being done to
improve cassandra's operations management?  As clusters get larger,
having it run smoothly is critical for success with Cassandra.  I can
live with less features, but if I get going and the system falls flat in
production, that's a terrible situation.  Thanks and Happy New Year all!

Paul

Re: Cassandra gotchas ...

Posted by Jeremy Hanna <je...@gmail.com>.

> I know that Cassandra is a work in progress and there are many
> limitations I can live with, but it would be nice to know what the
> roadmap is for the next 12-24 months so we can get an idea of what major
> directions Cassandra is going in so we can plan accordingly. 

Take a look at Jira - https://issues.apache.org/jira/browse/CASSANDRA - there are many, many
tickets slated for 0.7.1 and 0.8.  Also, you can get involved by taking part in discussions on the dev list.
If you feel that a feature is lacking, you can also create tickets in Jira.

> It would be nice if the community could vote of features considered so that the
> devs would have an idea of where the major pain points are for the users
> of Cassandra.


Jira tickets can be voted on.  For example, Cassandra-1072 - distributed counters - was recently
committed to trunk (https://issues.apache.org/jira/browse/CASSANDRA-1072).  You can tell by the votes
and watches, as well as discussion that it was a popular ticket.  There were many,
many, many discussions about the feature including alternate implementations.  Several companies
were involved.  Discussion took place mostly on the ticket, IRC and the dev mailing list.

Speaking of, you can check out the IRC channels to discuss tickets, features and plans as well.
See http://wiki.apache.org/cassandra/IRC

I can't speak on all of the things you brought up but Jira, the dev mailing list, and IRC are the primary ways to
propose features, see what's coming, discuss pain points, etc.

The community is very active and welcomes feedback.  Thanks for taking the time.

On Jan 8, 2011, at 7:55 PM, Paul Pak wrote:

> Hi all,
> 
> After using Cassandra some time, I had some comments on Cassandra and
> hope they spark productive conversation on the list.  They are meant
> only as constructive feedback as a user of Cassandra.  While there are
> many things great about Cassandra, I still feel that the current
> implementation has two major issues that are limiting it's ability to be
> used in production.  There are so many little gotchas that come up which
> most people don't find out about until you get through most of the
> implementation.  Most of the gotchas, I can live with, but the following
> items seem like too heavy a cost to me.
> 
> 1) If you have a result set with thousands of results, like an inbox,
> there is no way to efficiently handle the pages <- 1 2 3 4 5 6 7 8 9 10
> -> except by creating additional data structures on a materialized
> view.  But that means you can only get paged views on materialized
> views.  If you were to add constraints, all the paging functionality no
> longer works.  This is a basic functionality that many, many
> applications need.  Essentially it means that we can only perform the
> most basic queries in Cassandra and secondary indexes and super columns
> are near useless.  Super Columns are useless for doing complex queries
> because of a lack of secondary indexes and the fact that it needs to
> deserialize the entire row to work with it.  Regular CF's are no good
> too for queries with constraints because the paging no longer works
> since there is no materialized view.  There is no way to get the 800th
> record in a result set without getting ALL the data up to the 800th
> record.  That is crazy!  Cassandra desperately needs an efficient
> capability to return a result set by specifying a start_column by record
> number, not key.
> 
> 2) Lack of operational support features.  For instance, no capability to
> manage Cassandra's usage of disk space on nodes.  The fact that an admin
> cannot specify where data goes or how to handle hot data, or gracefully
> stop handling writes to nodes is a fundamental problem with the
> partitioning strategy in my opinion.  I believe the entire partitioning
> strategy needs to be revisited and probably rewritten to include
> capabilities to accept administrator input on how to handle the data
> (i.e. directories, machines, etc.), easily support moving data and
> specifying where it should go, how many replicas, etc.  As it is, it is
> just not flexible enough.   What if you have particularly hot data and
> want to replicate it a dozen times to service read requests faster?  If
> a node runs out of space for sstables, I still want it to be operational
> for read requests, but not write.  When nodes are moved, we need to
> manually run cleanup.  Why is that?  If there is a safety reason, then
> how is an administrator going to know better than Cassandra that the
> operation was successful?
> 
> I know that Cassandra is a work in progress and there are many
> limitations I can live with, but it would be nice to know what the
> roadmap is for the next 12-24 months so we can get an idea of what major
> directions Cassandra is going in so we can plan accordingly.  It would
> be nice if the community could vote of features considered so that the
> devs would have an idea of where the major pain points are for the users
> of Cassandra.  The questions that are especially important are...  what
> feature additions are being considered?  And, what is being done to
> improve cassandra's operations management?  As clusters get larger,
> having it run smoothly is critical for success with Cassandra.  I can
> live with less features, but if I get going and the system falls flat in
> production, that's a terrible situation.  Thanks and Happy New Year all!
> 
> Paul