You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dan Gould <da...@chill.com> on 2013/11/06 23:08:58 UTC

CQL 'IN' predicate

I was wondering if anyone had a sense of performance/best practices
around the 'IN' predicate.

I have a list of up to potentially ~30k keys that I want to look up in a
table (typically queries will have <500, but I worry about the long tail).  Most
of them will not exist in the table, but, say, about 10-20% will.

Would it be best to do:

1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);

2) Split into smaller batches--
for group_of_100 in all_30000:
    // ** Issue in parallel or block after each one??
    SELECT fields FROM table WHERE id in (group_of_100 uuids);

3) Something else?

My guess is that (1) is fine and that the only worry is too much data returned (which won't be a problem in this case), but I wanted to check that it's not a C* anti-pattern before.

[Conversely, is a batch insert with up to 30k items ok?]

Thanks,
Dan

Re: CQL 'IN' predicate

Posted by Nate McCall <na...@thelastpickle.com>.

Regardless of the size of 'k', each key gets turned into a ReadCommand
internally and is (eventually) delegated to StorageProxy#fetchRows:
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/service/StorageProxy.java#L852

You still send the same number of ReadCommands going out either way, but
it's more a matter of throttling resource usage.

Honestly, I would say play around with it for a little bit and monitor the
results - there is most likely a good sweet spot in there somewhere - bonus
if you can make it a stored procedure to mitigate the parsing overhead.



On Thu, Nov 7, 2013 at 9:23 AM, Dan Gould <da...@chill.com> wrote:

>  Thanks--that's what I was wondering.  So, if I understand you correctly,
> it sounds like a single
>     SELECT ... WHERE foo in (k items);
>
> can tie up k threads rather than 1 thread per node which can starve other
> tasks on a cluster.  AFAICT, there's no way to say "this query should be
> limited to only __% of the resources on each node".  Alas, for every other
> table in our system I've figured out nice ways to denormalize and turn
> complex things in to single queries.  But this one I can't.
>
> So--alas--it sounds like my best answer will be to issue lots of smaller
> queries with pauses in-between.  (Or to look at patching C* to be smarter
> about resource management, but I'm not-at-all familiar with C* internals so
> this may be impractical at the moment.)
>
>
> On 11/6/13 8:26 PM, Aaron Morton wrote:
>
>  If one big query doesn't cause problems
>
>
>  Every row you read becomes a (roughly) RF number of tasks in the
> cluster. If you ask for 100 rows in one query it will generate 300 tasks
> that are processed by the read thread pool which as a default of 32
> threads. If you ask for a lot of rows and the number of nodes in low there
> is a chance the client starve others as they wait for all the tasks to be
> completed. So i tend to like asking for fewer rows.
>
>  Cheers
>
>  -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
>  Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>  On 7/11/2013, at 12:19 pm, Dan Gould <da...@chill.com> wrote:
>
>  Thanks Nate,
>
> I assume 10k is the return limit.  I don't think I'll ever get close to
> 10k matches to the IN query.  That said, you're right: to be safe I'll
> increase the limit to match the number of items on the IN.
>
> I didn't know CQL supported stored procedures, but I'll take a look.  I
> suppose my question was asking about parsing overhead, however.  If one big
> query doesn't cause problems--which I assume it wouldn't since there can be
> multiple threads parsing and I assume C* is smart about memory when
> accumulating results--I'd much rather do that.
>
> Dan
>
> On 11/6/13 3:05 PM, Nate McCall wrote:
>
> Unless you explicitly set a page size (i'm pretty sure the query is
> converted to a paging query automatically under the hood) you will get
> capped at the default of 10k which might get a little weird semantically.
> That said, you should experiment with explicit page sizes and see where it
> gets you (i've not tried this yet with an IN clause - would be real curious
> to hear how it worked).
>
>  Another thing to consider is that it's a pretty big statement to parse
> every time. You might want to go the (much) smaller batch route so these
> can be stored procedures? (another thing I havent tried with IN clause -
> don't see why it would not work though).
>
>
>
>
> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <da...@chill.com> wrote:
>
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>>
>> I have a list of up to potentially ~30k keys that I want to look up in a
>> table (typically queries will have <500, but I worry about the long
>> tail).  Most
>> of them will not exist in the table, but, say, about 10-20% will.
>>
>> Would it be best to do:
>>
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>>
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>>
>> 3) Something else?
>>
>> My guess is that (1) is fine and that the only worry is too much data
>> returned (which won't be a problem in this case), but I wanted to check
>> that it's not a C* anti-pattern before.
>>
>> [Conversely, is a batch insert with up to 30k items ok?]
>>
>> Thanks,
>> Dan
>>
>>
>
>
>  --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: CQL 'IN' predicate

Posted by Dan Gould <da...@chill.com>.

Thanks--that's what I was wondering.  So, if I understand you correctly, 
it sounds like a single
     SELECT ... WHERE foo in (k items);

can tie up k threads rather than 1 thread per node which can starve 
other tasks on a cluster.  AFAICT, there's no way to say "this query 
should be limited to only __% of the resources on each node".  Alas, for 
every other table in our system I've figured out nice ways to 
denormalize and turn complex things in to single queries.  But this one 
I can't.

So--alas--it sounds like my best answer will be to issue lots of smaller 
queries with pauses in-between.  (Or to look at patching C* to be 
smarter about resource management, but I'm not-at-all familiar with C* 
internals so this may be impractical at the moment.)

On 11/6/13 8:26 PM, Aaron Morton wrote:
>> If one big query doesn't cause problems
>
> Every row you read becomes a (roughly) RF number of tasks in the 
> cluster. If you ask for 100 rows in one query it will generate 300 
> tasks that are processed by the read thread pool which as a default of 
> 32 threads. If you ask for a lot of rows and the number of nodes in 
> low there is a chance the client starve others as they wait for all 
> the tasks to be completed. So i tend to like asking for fewer rows.
>
> Cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 7/11/2013, at 12:19 pm, Dan Gould <dan@chill.com 
> <ma...@chill.com>> wrote:
>
>> Thanks Nate,
>>
>> I assume 10k is the return limit.  I don't think I'll ever get close 
>> to 10k matches to the IN query.  That said, you're right: to be safe 
>> I'll increase the limit to match the number of items on the IN.
>>
>> I didn't know CQL supported stored procedures, but I'll take a look.  
>> I suppose my question was asking about parsing overhead, however.  If 
>> one big query doesn't cause problems--which I assume it wouldn't 
>> since there can be multiple threads parsing and I assume C* is smart 
>> about memory when accumulating results--I'd much rather do that.
>>
>> Dan
>>
>> On 11/6/13 3:05 PM, Nate McCall wrote:
>>> Unless you explicitly set a page size (i'm pretty sure the query is 
>>> converted to a paging query automatically under the hood) you will 
>>> get capped at the default of 10k which might get a little weird 
>>> semantically. That said, you should experiment with explicit page 
>>> sizes and see where it gets you (i've not tried this yet with an IN 
>>> clause - would be real curious to hear how it worked).
>>>
>>> Another thing to consider is that it's a pretty big statement to 
>>> parse every time. You might want to go the (much) smaller batch 
>>> route so these can be stored procedures? (another thing I havent 
>>> tried with IN clause - don't see why it would not work though).
>>>
>>>
>>>
>>>
>>> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <dan@chill.com 
>>> <ma...@chill.com>> wrote:
>>>
>>>     I was wondering if anyone had a sense of performance/best practices
>>>     around the 'IN' predicate.
>>>
>>>     I have a list of up to potentially ~30k keys that I want to look
>>>     up in a
>>>     table (typically queries will have <500, but I worry about the
>>>     long tail).  Most
>>>     of them will not exist in the table, but, say, about 10-20% will.
>>>
>>>     Would it be best to do:
>>>
>>>     1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ......
>>>     uuid30000);
>>>
>>>     2) Split into smaller batches--
>>>     for group_of_100 in all_30000:
>>>        // ** Issue in parallel or block after each one??
>>>        SELECT fields FROM table WHERE id in (group_of_100 uuids);
>>>
>>>     3) Something else?
>>>
>>>     My guess is that (1) is fine and that the only worry is too much
>>>     data returned (which won't be a problem in this case), but I
>>>     wanted to check that it's not a C* anti-pattern before.
>>>
>>>     [Conversely, is a batch insert with up to 30k items ok?]
>>>
>>>     Thanks,
>>>     Dan
>>>
>>>
>>>
>>>
>>> -- 
>>> -----------------
>>> Nate McCall
>>> Austin, TX
>>> @zznate
>>>
>>> Co-Founder & Sr. Technical Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>
>

Re: CQL 'IN' predicate

Posted by Aaron Morton <aa...@thelastpickle.com>.

> If one big query doesn't cause problems

Every row you read becomes a (roughly) RF number of tasks in the cluster. If you ask for 100 rows in one query it will generate 300 tasks that are processed by the read thread pool which as a default of 32 threads. If you ask for a lot of rows and the number of nodes in low there is a chance the client starve others as they wait for all the tasks to be completed. So i tend to like asking for fewer rows. 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 7/11/2013, at 12:19 pm, Dan Gould <da...@chill.com> wrote:

> Thanks Nate,
> 
> I assume 10k is the return limit.  I don't think I'll ever get close to 10k matches to the IN query.  That said, you're right: to be safe I'll increase the limit to match the number of items on the IN.
> 
> I didn't know CQL supported stored procedures, but I'll take a look.  I suppose my question was asking about parsing overhead, however.  If one big query doesn't cause problems--which I assume it wouldn't since there can be multiple threads parsing and I assume C* is smart about memory when accumulating results--I'd much rather do that.
> 
> Dan
> 
> On 11/6/13 3:05 PM, Nate McCall wrote:
>> Unless you explicitly set a page size (i'm pretty sure the query is converted to a paging query automatically under the hood) you will get capped at the default of 10k which might get a little weird semantically. That said, you should experiment with explicit page sizes and see where it gets you (i've not tried this yet with an IN clause - would be real curious to hear how it worked). 
>> 
>> Another thing to consider is that it's a pretty big statement to parse every time. You might want to go the (much) smaller batch route so these can be stored procedures? (another thing I havent tried with IN clause - don't see why it would not work though).
>> 
>> 
>> 
>> 
>> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <da...@chill.com> wrote:
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>> 
>> I have a list of up to potentially ~30k keys that I want to look up in a
>> table (typically queries will have <500, but I worry about the long tail).  Most
>> of them will not exist in the table, but, say, about 10-20% will.
>> 
>> Would it be best to do:
>> 
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>> 
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>> 
>> 3) Something else?
>> 
>> My guess is that (1) is fine and that the only worry is too much data returned (which won't be a problem in this case), but I wanted to check that it's not a C* anti-pattern before.
>> 
>> [Conversely, is a batch insert with up to 30k items ok?]
>> 
>> Thanks,
>> Dan
>> 
>> 
>> 
>> 
>> -- 
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>> 
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>

Re: CQL 'IN' predicate

Posted by Nate McCall <na...@thelastpickle.com>.

Sorry - I meant more that it would be worth experimenting with (much)
smaller page sizes as opposed to getting back one giant page. This would
cut down the read command overhead but you would still have the parsing to
deal with.

Aaron has a good point though. A big query is still a bad idea if it can be
avoided with more denorrmalization.

IME, when one finds themselves in new ground like this while building an
application using Cassandra, it's best to take a step back, look at the
data model, and make some adjustments to play towards Cassandra's
strengths.


On Wed, Nov 6, 2013 at 5:19 PM, Dan Gould <da...@chill.com> wrote:

>  Thanks Nate,
>
> I assume 10k is the return limit.  I don't think I'll ever get close to
> 10k matches to the IN query.  That said, you're right: to be safe I'll
> increase the limit to match the number of items on the IN.
>
> I didn't know CQL supported stored procedures, but I'll take a look.  I
> suppose my question was asking about parsing overhead, however.  If one big
> query doesn't cause problems--which I assume it wouldn't since there can be
> multiple threads parsing and I assume C* is smart about memory when
> accumulating results--I'd much rather do that.
>
> Dan
>
>
> On 11/6/13 3:05 PM, Nate McCall wrote:
>
> Unless you explicitly set a page size (i'm pretty sure the query is
> converted to a paging query automatically under the hood) you will get
> capped at the default of 10k which might get a little weird semantically.
> That said, you should experiment with explicit page sizes and see where it
> gets you (i've not tried this yet with an IN clause - would be real curious
> to hear how it worked).
>
>  Another thing to consider is that it's a pretty big statement to parse
> every time. You might want to go the (much) smaller batch route so these
> can be stored procedures? (another thing I havent tried with IN clause -
> don't see why it would not work though).
>
>
>
>
> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <da...@chill.com> wrote:
>
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>>
>> I have a list of up to potentially ~30k keys that I want to look up in a
>> table (typically queries will have <500, but I worry about the long
>> tail).  Most
>> of them will not exist in the table, but, say, about 10-20% will.
>>
>> Would it be best to do:
>>
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>>
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>>
>> 3) Something else?
>>
>> My guess is that (1) is fine and that the only worry is too much data
>> returned (which won't be a problem in this case), but I wanted to check
>> that it's not a C* anti-pattern before.
>>
>> [Conversely, is a batch insert with up to 30k items ok?]
>>
>> Thanks,
>> Dan
>>
>>
>
>
>  --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: CQL 'IN' predicate

Posted by Dan Gould <da...@chill.com>.

Thanks Nate,

I assume 10k is the return limit.  I don't think I'll ever get close to 
10k matches to the IN query.  That said, you're right: to be safe I'll 
increase the limit to match the number of items on the IN.

I didn't know CQL supported stored procedures, but I'll take a look.  I 
suppose my question was asking about parsing overhead, however.  If one 
big query doesn't cause problems--which I assume it wouldn't since there 
can be multiple threads parsing and I assume C* is smart about memory 
when accumulating results--I'd much rather do that.

Dan

On 11/6/13 3:05 PM, Nate McCall wrote:
> Unless you explicitly set a page size (i'm pretty sure the query is 
> converted to a paging query automatically under the hood) you will get 
> capped at the default of 10k which might get a little weird 
> semantically. That said, you should experiment with explicit page 
> sizes and see where it gets you (i've not tried this yet with an IN 
> clause - would be real curious to hear how it worked).
>
> Another thing to consider is that it's a pretty big statement to parse 
> every time. You might want to go the (much) smaller batch route so 
> these can be stored procedures? (another thing I havent tried with IN 
> clause - don't see why it would not work though).
>
>
>
>
> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <dan@chill.com 
> <ma...@chill.com>> wrote:
>
>     I was wondering if anyone had a sense of performance/best practices
>     around the 'IN' predicate.
>
>     I have a list of up to potentially ~30k keys that I want to look
>     up in a
>     table (typically queries will have <500, but I worry about the
>     long tail).  Most
>     of them will not exist in the table, but, say, about 10-20% will.
>
>     Would it be best to do:
>
>     1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ......
>     uuid30000);
>
>     2) Split into smaller batches--
>     for group_of_100 in all_30000:
>        // ** Issue in parallel or block after each one??
>        SELECT fields FROM table WHERE id in (group_of_100 uuids);
>
>     3) Something else?
>
>     My guess is that (1) is fine and that the only worry is too much
>     data returned (which won't be a problem in this case), but I
>     wanted to check that it's not a C* anti-pattern before.
>
>     [Conversely, is a batch insert with up to 30k items ok?]
>
>     Thanks,
>     Dan
>
>
>
>
> -- 
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com

Re: CQL 'IN' predicate

Posted by Nate McCall <na...@thelastpickle.com>.

Unless you explicitly set a page size (i'm pretty sure the query is
converted to a paging query automatically under the hood) you will get
capped at the default of 10k which might get a little weird semantically.
That said, you should experiment with explicit page sizes and see where it
gets you (i've not tried this yet with an IN clause - would be real curious
to hear how it worked).

Another thing to consider is that it's a pretty big statement to parse
every time. You might want to go the (much) smaller batch route so these
can be stored procedures? (another thing I havent tried with IN clause -
don't see why it would not work though).

On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <da...@chill.com> wrote:

> I was wondering if anyone had a sense of performance/best practices
> around the 'IN' predicate.
>
> I have a list of up to potentially ~30k keys that I want to look up in a
> table (typically queries will have <500, but I worry about the long tail).
>  Most
> of them will not exist in the table, but, say, about 10-20% will.
>
> Would it be best to do:
>
> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>
> 2) Split into smaller batches--
> for group_of_100 in all_30000:
>    // ** Issue in parallel or block after each one??
>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>
> 3) Something else?
>
> My guess is that (1) is fine and that the only worry is too much data
> returned (which won't be a problem in this case), but I wanted to check
> that it's not a C* anti-pattern before.
>
> [Conversely, is a batch insert with up to 30k items ok?]
>
> Thanks,
> Dan
>
>

-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com