You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mat Brown <ma...@brewster.com> on 2012/08/17 01:09:10 UTC

Omitting empty columns from CQL SELECT

Hello all,

I've noticed that when performing a SELECT statement with a list of
columns specified, Cassandra returns all columns in the resulting
row(s) even if they have no value. This creates an apparently
considerable amount of transport and deserialization overhead,
particularly in one use case I'm looking at, in which we select a
large collection of columns but expect only a small fraction of them
to contain values. Is there any way to get around this and only
receive columns that have values in the results?

Thanks,
Mat

Re: Omitting empty columns from CQL SELECT

Posted by aaron morton <aa...@thelastpickle.com>.
> there a situation in which that behavior would be useful?
guessing, makes life easier to client implementations and is consistent in the sense that when doing a slice by name the server is the entity that decides which columns are in the result set. 
 
I took a look at the performance of various query techniques here http://www.slideshare.net/aaronmorton/cassandra-sf-2012-technical-deep-dive-query-performance and http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ . 

If you are selecting by name on wide rows you will eventually see latency increase. This simply has to do with the amount of data pages that must be read to satisfy the query. Note though that this is better in 1.X . See slide 61 in the first link and "In Motion - Name Locality" section in the second. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 10:07 PM, Mat Brown <ma...@brewster.com> wrote:

> Hi Aaron,
> 
> Thanks for the answer. That makes sense and I can see it as a formal
> reason for returning empty columns, but as a practical matter, is
> there a situation in which that behavior would be useful?
> 
> Unfortunately a column slice won't do the trick -- the columns we're
> looking for at any given time wouldn't correspond to a particular
> range; it's essentially "random access".
> 
> For what it's worth, I've managed to make this operation about 30x
> faster in a quick benchmark by just not selecting for specific columns
> at all, and throwing away columns I don't care about in the
> application layer instead. It's unclear whether the performance
> improvements will continue to accrue as the column family becomes more
> densely populated, though.
> 
> Anyway, thanks again!
> Mat
> 
> On Fri, Aug 17, 2012 at 5:06 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> If you specify the columns by name in the select clause the query returns
>> them because they should be projected in the result set.
>> 
>> Can you use a column slice instead ?
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 17/08/2012, at 11:09 AM, Mat Brown <ma...@brewster.com> wrote:
>> 
>> Hello all,
>> 
>> I've noticed that when performing a SELECT statement with a list of
>> columns specified, Cassandra returns all columns in the resulting
>> row(s) even if they have no value. This creates an apparently
>> considerable amount of transport and deserialization overhead,
>> particularly in one use case I'm looking at, in which we select a
>> large collection of columns but expect only a small fraction of them
>> to contain values. Is there any way to get around this and only
>> receive columns that have values in the results?
>> 
>> Thanks,
>> Mat
>> 
>> 


Re: Omitting empty columns from CQL SELECT

Posted by Mat Brown <ma...@brewster.com>.
Hi Aaron,

Thanks for the answer. That makes sense and I can see it as a formal
reason for returning empty columns, but as a practical matter, is
there a situation in which that behavior would be useful?

Unfortunately a column slice won't do the trick -- the columns we're
looking for at any given time wouldn't correspond to a particular
range; it's essentially "random access".

For what it's worth, I've managed to make this operation about 30x
faster in a quick benchmark by just not selecting for specific columns
at all, and throwing away columns I don't care about in the
application layer instead. It's unclear whether the performance
improvements will continue to accrue as the column family becomes more
densely populated, though.

Anyway, thanks again!
Mat

On Fri, Aug 17, 2012 at 5:06 AM, aaron morton <aa...@thelastpickle.com> wrote:
> If you specify the columns by name in the select clause the query returns
> them because they should be projected in the result set.
>
> Can you use a column slice instead ?
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 11:09 AM, Mat Brown <ma...@brewster.com> wrote:
>
> Hello all,
>
> I've noticed that when performing a SELECT statement with a list of
> columns specified, Cassandra returns all columns in the resulting
> row(s) even if they have no value. This creates an apparently
> considerable amount of transport and deserialization overhead,
> particularly in one use case I'm looking at, in which we select a
> large collection of columns but expect only a small fraction of them
> to contain values. Is there any way to get around this and only
> receive columns that have values in the results?
>
> Thanks,
> Mat
>
>

Re: Omitting empty columns from CQL SELECT

Posted by aaron morton <aa...@thelastpickle.com>.
If you specify the columns by name in the select clause the query returns them because they should be projected in the result set. 

Can you use a column slice instead ?

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 11:09 AM, Mat Brown <ma...@brewster.com> wrote:

> Hello all,
> 
> I've noticed that when performing a SELECT statement with a list of
> columns specified, Cassandra returns all columns in the resulting
> row(s) even if they have no value. This creates an apparently
> considerable amount of transport and deserialization overhead,
> particularly in one use case I'm looking at, in which we select a
> large collection of columns but expect only a small fraction of them
> to contain values. Is there any way to get around this and only
> receive columns that have values in the results?
> 
> Thanks,
> Mat