You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Aditya Narayan <ad...@gmail.com> on 2011/02/07 07:07:44 UTC

Does variation in no of columns in rows over the column family has any performance impact ?

Does huge variation in no. of columns in rows, over the column family
has *any* impact on the performance ?

Can I have like just 100 columns in some rows and like hundred
thousands of columns in another set of rows, without any downsides ?

Re: Does variation in no of columns in rows over the column family has any performance impact ?

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 7, 2011 at 5:40 AM, Aditya Narayan <ad...@gmail.com> wrote:
> Thanks for the detailed explanation Peter! Definitely cleared my doubts !
>
>
>
> On Mon, Feb 7, 2011 at 1:52 PM, Peter Schuller
> <pe...@infidyne.com> wrote:
>>> Does huge variation in no. of columns in rows, over the column family
>>> has *any* impact on the performance ?
>>>
>>> Can I have like just 100 columns in some rows and like hundred
>>> thousands of columns in another set of rows, without any downsides ?
>>
>> If I interpret your question the way I think you mean it, then no,
>> Cassandra doesn't "do" anything with the data such that the smaller
>> rows are somehow directly less efficient because there are other rows
>> that are bigger. It doesn't affect the on-disk format or the on-disk
>> efficiency of accessing the rows.
>>
>> However, there are almost always indirect effects when it comes to
>> performance, in and particular storage systems. In the case of
>> Cassandra, the *variation* itself should not impose a direct
>> performance penalty, but there are potential other effects. For
>> example the row cache is only useful for small works, so if you are
>> looking to use the row cache the huge rows would perhaps prevent that.
>> This could be interpreted as a performance impact on the smaller rows
>> by the larger rows.... Compaction may become more expensive due to
>> e.g. additional GC pressure resulting from
>> large-but-still-within-in-memory-limits rows being compacted (or not,
>> depending on JVM/GC settings). There is also the effect of cache
>> locality as data set grows, and the cache locality for the smaller
>> rows will likely be worse than had they been in e.g. a separate CF.
>>
>> Those are just three random example; I'm just trying to make the point
>> that "without any downsides" is a very strong and blanket requirement
>> for making the decision to mix small rows with larger ones.
>>
>> --
>> / Peter Schuller
>>
>

The performance could be variable if you are using operations such as
a get_slice with a large Slice Predicate, large rows take longer to be
de serialized and transferred then smaller rows. I have never
benchmarked this but it would probably take a significant difference
in row size before the size of a row had a noticeable impact.

Re: Does variation in no of columns in rows over the column family has any performance impact ?

Posted by Aditya Narayan <ad...@gmail.com>.
Thanks for the detailed explanation Peter! Definitely cleared my doubts !



On Mon, Feb 7, 2011 at 1:52 PM, Peter Schuller
<pe...@infidyne.com> wrote:
>> Does huge variation in no. of columns in rows, over the column family
>> has *any* impact on the performance ?
>>
>> Can I have like just 100 columns in some rows and like hundred
>> thousands of columns in another set of rows, without any downsides ?
>
> If I interpret your question the way I think you mean it, then no,
> Cassandra doesn't "do" anything with the data such that the smaller
> rows are somehow directly less efficient because there are other rows
> that are bigger. It doesn't affect the on-disk format or the on-disk
> efficiency of accessing the rows.
>
> However, there are almost always indirect effects when it comes to
> performance, in and particular storage systems. In the case of
> Cassandra, the *variation* itself should not impose a direct
> performance penalty, but there are potential other effects. For
> example the row cache is only useful for small works, so if you are
> looking to use the row cache the huge rows would perhaps prevent that.
> This could be interpreted as a performance impact on the smaller rows
> by the larger rows.... Compaction may become more expensive due to
> e.g. additional GC pressure resulting from
> large-but-still-within-in-memory-limits rows being compacted (or not,
> depending on JVM/GC settings). There is also the effect of cache
> locality as data set grows, and the cache locality for the smaller
> rows will likely be worse than had they been in e.g. a separate CF.
>
> Those are just three random example; I'm just trying to make the point
> that "without any downsides" is a very strong and blanket requirement
> for making the decision to mix small rows with larger ones.
>
> --
> / Peter Schuller
>

Re: Does variation in no of columns in rows over the column family has any performance impact ?

Posted by Peter Schuller <pe...@infidyne.com>.
> Does huge variation in no. of columns in rows, over the column family
> has *any* impact on the performance ?
>
> Can I have like just 100 columns in some rows and like hundred
> thousands of columns in another set of rows, without any downsides ?

If I interpret your question the way I think you mean it, then no,
Cassandra doesn't "do" anything with the data such that the smaller
rows are somehow directly less efficient because there are other rows
that are bigger. It doesn't affect the on-disk format or the on-disk
efficiency of accessing the rows.

However, there are almost always indirect effects when it comes to
performance, in and particular storage systems. In the case of
Cassandra, the *variation* itself should not impose a direct
performance penalty, but there are potential other effects. For
example the row cache is only useful for small works, so if you are
looking to use the row cache the huge rows would perhaps prevent that.
This could be interpreted as a performance impact on the smaller rows
by the larger rows.... Compaction may become more expensive due to
e.g. additional GC pressure resulting from
large-but-still-within-in-memory-limits rows being compacted (or not,
depending on JVM/GC settings). There is also the effect of cache
locality as data set grows, and the cache locality for the smaller
rows will likely be worse than had they been in e.g. a separate CF.

Those are just three random example; I'm just trying to make the point
that "without any downsides" is a very strong and blanket requirement
for making the decision to mix small rows with larger ones.

-- 
/ Peter Schuller

Re: Does variation in no of columns in rows over the column family has any performance impact ?

Posted by Aaron Morton <aa...@thelastpickle.com>.
For completeness there are a couple of things in the config file that may be interesting if you run into issues.

- column_index_size_in_kb defines how big a row has to get before an index is written for the row. Without an index the entire row must be read to find a column. 

- in_memory_compaction_limit_in_mb - defines the maximum size of row than can be compacted in memory, larger rows go through a slower compaction process.

- sliced_buffer_size_in_kb controls the size of the buffer when slicing columns. 

 Aaron
 
On 08 Feb, 2011,at 08:03 AM, Daniel Doubleday <da...@gmx.net> wrote:

It depends a little on your write pattern:

- Wide rows tend to get distributed over more sstables so more disk reads are necessary. This will become noticeable when you have high io load and reads actually hit the discs.
- If you delete a lot slice query performance might suffer: extreme example: create 2M cols, delete the first 1M and then ask for the first 10.


On Feb 7, 2011, at 7:07 AM, Aditya Narayan wrote:

> Does huge variation in no. of columns in rows, over the column family
> has *any* impact on the performance ?
> 
> Can I have like just 100 columns in some rows and like hundred
> thousands of columns in another set of rows, without any downsides ?


Re: Does variation in no of columns in rows over the column family has any performance impact ?

Posted by Daniel Doubleday <da...@gmx.net>.
It depends a little on your write pattern:

- Wide rows tend to get distributed over more sstables so more disk reads are necessary. This will become noticeable when you have high io load and reads actually hit the discs.
- If you delete a lot slice query performance might suffer: extreme example: create 2M cols, delete the first 1M and then ask for the first 10.


On Feb 7, 2011, at 7:07 AM, Aditya Narayan wrote:

> Does huge variation in no. of columns in rows, over the column family
> has *any* impact on the performance ?
> 
> Can I have like just 100 columns in some rows and like hundred
> thousands of columns in another set of rows, without any downsides ?