You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by asil klin <as...@gmail.com> on 2011/01/09 13:04:35 UTC

A few quick questions to help me design a better schema..

1. ) If certain columns in a row get mutated too frequently or if new
columns are added to the row frequently then does the reads of old columns
that rarely get changed is also affected ? In other words, is the
performance of reads of almost infrequently changing columns in a row where
some columns are frequently updated/inserted, affected in any manner ?

2. ) Are all columns inside a super column family, supercolumns or can they
may be simple columns+supercolumns  as well ?

3. ) When row cache is enabled and certain  columns of a row are read then
will the entire row be put into the cache or just those read columns are put
into cache?

4. ) Does the larger no of column families has any impact on the
performance(I read about it somewhere)? Should information for a particular
row key be split in multiple column families according to the specific query
demands or should all data related to a particular row key be kept together
in a single column family ?

5. ) Are there any limitation of valueless column to consider. I read in a
ppt   "Only works with <= 2B columns in 0.7 valueless colum". I could
understand the meaning of this statement.

Thanks
Asil

Re: A few quick questions to help me design a better schema..

Posted by Tyler Hobbs <ty...@riptano.com>.
>
> Though in general I would say that it is worth considering. In
> particular if you have certain data that is accessed a lot more
> frequently than other data (especially if the "other data" is large),
> the improved cache locality of keeping the frequently accessed data
> separate can be high (assuming greater-than-RAM data sets). Another
> concern might be if you have some parts that are constantly updated or
> deleted, while some other part that is mostly append-only. The
> compaction needs of the frequently overwriting/removed data may be
> higher, which may also be a reason to separate it out.
>

Excellent point, Peter.  Thanks for adding that.  Taking into consideration
effective caching and keeping the number of rows an SSTable is split across
are both fairly advanced performance topics, but certainly worth considering
once you have a solid data model (and a lot of data :).

- Tyler

Re: A few quick questions to help me design a better schema..

Posted by Peter Schuller <pe...@infidyne.com>.
>> 4. ) Does the larger no of column families has any impact on the
>> performance(I read about it somewhere)? Should information for a particular
>> row key be split in multiple column families according to the specific query
>> demands or should all data related to a particular row key be kept together
>> in a single column family ?
>
> A higher number of column families requires more memory to be used and more
> compactions to occur.  I can't answer the rest of the question accurately
> without more detail on the particular use case.

Though in general I would say that it is worth considering. In
particular if you have certain data that is accessed a lot more
frequently than other data (especially if the "other data" is large),
the improved cache locality of keeping the frequently accessed data
separate can be high (assuming greater-than-RAM data sets). Another
concern might be if you have some parts that are constantly updated or
deleted, while some other part that is mostly append-only. The
compaction needs of the frequently overwriting/removed data may be
higher, which may also be a reason to separate it out.

Whether or not rows should be split in the specific use-case will of
course depend, as always.

-- 
/ Peter Schuller

Re: A few quick questions to help me design a better schema..

Posted by Tyler Hobbs <ty...@riptano.com>.
>
> 1. ) If certain columns in a row get mutated too frequently or if new
> columns are added to the row frequently then does the reads of old columns
> that rarely get changed is also affected ? In other words, is the
> performance of reads of almost infrequently changing columns in a row where
> some columns are frequently updated/inserted, affected in any manner ?
>

Yes, the performance of reading columns that you haven't changed will still
be affected by changing other columns in the row.  Constantly updating a row
causes it to be split across multiple SSTables.  If you are asking for the
columns by name, you may not need to actually read any extra data from most
of the SSTables, but you will need to at least read the per-row Bloom Filter
on each (or read the index and scan a portion of the row for slices); this
costs one seek for each SSTable.


> 2. ) Are all columns inside a super column family, supercolumns or can they
> may be simple columns+supercolumns  as well ?
>

They are all super columns.  There is no mixing of column types.


> 3. ) When row cache is enabled and certain  columns of a row are read then
> will the entire row be put into the cache or just those read columns are put
> into cache?
>

The entire row will be put into the cache.  This is good motivation for
splitting timelines into multiple rows by a relatively low timespan if you
mainly read the very end of the timeline.  Note that there has been
discussion somewhere of allowing you to only cache the last N columns of a
row in the row cache.


> 4. ) Does the larger no of column families has any impact on the
> performance(I read about it somewhere)? Should information for a particular
> row key be split in multiple column families according to the specific query
> demands or should all data related to a particular row key be kept together
> in a single column family ?
>

A higher number of column families requires more memory to be used and more
compactions to occur.  I can't answer the rest of the question accurately
without more detail on the particular use case.


> 5. ) Are there any limitation of valueless column to consider. I read in a
> ppt   "Only works with <= 2B columns in 0.7 valueless colum". I could
> understand the meaning of this statement.
>

I believe this is referring to the 2 billion column limit per row.  In the
real world, you generally don't want to get anywhere near that many columns
in a single row.

- Tyler