You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jack Krupansky <ja...@gmail.com> on 2016/04/18 18:16:06 UTC

Proper use of COUNT

Based on a recent inquiry and a recent thread of my own, and the coming
support for wide rows, I'll focus in on this question that I feel needs
better documentation of recommended best practice:

When can the COUNT(*) aggregate row-counting function be used?

Accept for relatively small or narrow queries, it seems to have a
propensity for timing out.

>From what I gather (and without specific testing myself), SELECT COUNT(*)
seems to be warranted for use cases where:

1. The table is relatively small, or
2. The number of rows selected by the WHERE clause is relatively small, due
to a combination of a relatively small number of partitions and a
relatively narrow slice of rows in a partition.

One can always approximate a larger SELECT by performing multiple selects,
each specifying a limited range of tokens in the WHERE clause.

But, even if the number of partitions selected is relatively small, it is
possible that the partitions might be wide so that the number of rows is
not so small.

And with CASSANDRA-11206
<https://issues.apache.org/jira/browse/CASSANDRA-11206>, very large
partitions (2 GB) could have a non-trivial number of rows.

So, an open question is how small is reasonable for COUNT. Thousands? Tens
of thousands? Only Hundreds?

Obviously it does depend on your particular data and hardware, but just on
a general basis.

In any case, does anyone have any specific experiences to share as to what
recommendations they would have for the use of COUNT?

A companion question is whether COUNT(column_name) has the same limitations
and recommendations. It does have to actually fetch the column values as
opposed to simply determining the existence of the row, but how
consequential that additional processing is, I couldn't say.

-- Jack Krupansky

Re: Proper use of COUNT

Posted by DuyHai Doan <do...@gmail.com>.

Jack, you should have a look at my blog post, I did some testing with
various value for paging using aggregate functions:
http://www.doanduyhai.com/blog/?p=2015

On Tue, Apr 19, 2016 at 10:23 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> BTW, I did notice this Jira for setting a client timeout for cqlsh, so
> maybe this is the culprit for that user:
>
> CASSANDRA-7516 - Configurable client timeout for cqlsh
> https://issues.apache.org/jira/browse/CASSANDRA-7516
>
> Or, should they actually be using the --request-timeout command line
> option for cqlsh?
>
> -- Jack Krupansky
>
> On Tue, Apr 19, 2016 at 4:56 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> Sylvain & Tyler, this Jira is for a user reporting a timeout for SELECT
>> COUNT(*) using 3.3:
>> https://issues.apache.org/jira/browse/CASSANDRA-11566
>>
>> I'll let one of you guys follow up on that. I mean, I thought it was
>> timing out die to the amount of data, but you guys are saying that paging
>> should make that not a problem. Or is there a timeout in cqlsh simply
>> because the operation is slow - as opposed to the server reporting an
>> internal timeout?
>>
>> Thanks.
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Apr 19, 2016 at 12:45 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>>
>>>
>>> On Tue, Apr 19, 2016 at 11:32 AM, Jack Krupansky <
>>> jack.krupansky@gmail.com> wrote:
>>>
>>>>
>>>> Are the queries sent from the coordinator to other nodes sequencing
>>>> through partitions in token order and that's what allows the coordinator to
>>>> dedupe with just a single page at a time? IOW, if a target node responds
>>>> with a row from token t, then by definition there will be no further rows
>>>> returned from that node with a token less than t?
>>>>
>>>
>>> That's correct.  The internal paging for aggregation queries is exactly
>>> the same as the normal "client facing" paging.
>>>
>>>
>>>>
>>>> And if I understand all of this so far, this means that for 3.x COUNT
>>>> (and other aggregate functions) are "safe but may be slow" (paraphrasing
>>>> Sylvain.) Is this for 3.0 and later or some other 3.x (or even some 2.x)?
>>>>
>>>
>>> I think count(*) started using paging internally in 2.1, but I'm having
>>> trouble finding the jira ticket.  It could have been 2.0.
>>>
>>> The new aggregation functions in 2.2 utilize the same code path.
>>>
>>>
>>>>
>>>> There remains the question of recommended usage for COUNT. I think my
>>>> two proposed guidelines remain valid (ignoring the old timeout issue), with
>>>> the only remaining question about how large a row count is advisable for
>>>> "decent" request latency. 1,000? 10,000? Granted, it depends on the
>>>> specific data and hardware, but I'm thinking that the guidance should be
>>>> that you should only use COUNT(*) for no more than "low thousands" of rows
>>>> unless you are willing to accept it both being very slow and very
>>>> disruptive to normal cluster health. IOW, it's more like a batch analytics
>>>> operation than a real-time operation. An occasional administrative query to
>>>> measure table size should be okay, but common use for OLTP should be
>>>> restricted to relatively narrow slices or row counts... I think. Feedback
>>>> welcome.
>>>>
>>>> The upcoming support for 2GB partitions will be interesting, but the
>>>> same guidance should cover, I think. Maybe the numeric upper bound might be
>>>> a bit higher since only a single partition is involved, but if processing
>>>> many thousands of rows will remain time consuming, it sounds like that
>>>> should be treated more as a batch-style OLAP operation rather than a
>>>> real-time OLTP operation... I think.
>>>>
>>>
>>> I think this is decent guidance.  I'll also clarify that aggregation
>>> functions should only be used on single partitions if you expect to get a
>>> response back with reasonable latency.  Full table scans are still
>>> expensive, even when they're wrapped in an aggregation function.
>>>
>>> If count(*) is too slow, the standard alternatives are:
>>>  - counters
>>>  - a static count that's periodically refreshed by a batch/background
>>> process
>>>  - LWT increments on an int column
>>>  - an external datastore like redis
>>>
>>> Obviously, each of these has a different set of tradeoffs.
>>>
>>> --
>>> Tyler Hobbs
>>> DataStax <http://datastax.com/>
>>>
>>
>>
>

Re: Proper use of COUNT

Posted by Jack Krupansky <ja...@gmail.com>.

BTW, I did notice this Jira for setting a client timeout for cqlsh, so
maybe this is the culprit for that user:

CASSANDRA-7516 - Configurable client timeout for cqlsh
https://issues.apache.org/jira/browse/CASSANDRA-7516

Or, should they actually be using the --request-timeout command line option
for cqlsh?

-- Jack Krupansky

On Tue, Apr 19, 2016 at 4:56 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> Sylvain & Tyler, this Jira is for a user reporting a timeout for SELECT
> COUNT(*) using 3.3:
> https://issues.apache.org/jira/browse/CASSANDRA-11566
>
> I'll let one of you guys follow up on that. I mean, I thought it was
> timing out die to the amount of data, but you guys are saying that paging
> should make that not a problem. Or is there a timeout in cqlsh simply
> because the operation is slow - as opposed to the server reporting an
> internal timeout?
>
> Thanks.
>
>
>
> -- Jack Krupansky
>
> On Tue, Apr 19, 2016 at 12:45 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>
>>
>> On Tue, Apr 19, 2016 at 11:32 AM, Jack Krupansky <
>> jack.krupansky@gmail.com> wrote:
>>
>>>
>>> Are the queries sent from the coordinator to other nodes sequencing
>>> through partitions in token order and that's what allows the coordinator to
>>> dedupe with just a single page at a time? IOW, if a target node responds
>>> with a row from token t, then by definition there will be no further rows
>>> returned from that node with a token less than t?
>>>
>>
>> That's correct.  The internal paging for aggregation queries is exactly
>> the same as the normal "client facing" paging.
>>
>>
>>>
>>> And if I understand all of this so far, this means that for 3.x COUNT
>>> (and other aggregate functions) are "safe but may be slow" (paraphrasing
>>> Sylvain.) Is this for 3.0 and later or some other 3.x (or even some 2.x)?
>>>
>>
>> I think count(*) started using paging internally in 2.1, but I'm having
>> trouble finding the jira ticket.  It could have been 2.0.
>>
>> The new aggregation functions in 2.2 utilize the same code path.
>>
>>
>>>
>>> There remains the question of recommended usage for COUNT. I think my
>>> two proposed guidelines remain valid (ignoring the old timeout issue), with
>>> the only remaining question about how large a row count is advisable for
>>> "decent" request latency. 1,000? 10,000? Granted, it depends on the
>>> specific data and hardware, but I'm thinking that the guidance should be
>>> that you should only use COUNT(*) for no more than "low thousands" of rows
>>> unless you are willing to accept it both being very slow and very
>>> disruptive to normal cluster health. IOW, it's more like a batch analytics
>>> operation than a real-time operation. An occasional administrative query to
>>> measure table size should be okay, but common use for OLTP should be
>>> restricted to relatively narrow slices or row counts... I think. Feedback
>>> welcome.
>>>
>>> The upcoming support for 2GB partitions will be interesting, but the
>>> same guidance should cover, I think. Maybe the numeric upper bound might be
>>> a bit higher since only a single partition is involved, but if processing
>>> many thousands of rows will remain time consuming, it sounds like that
>>> should be treated more as a batch-style OLAP operation rather than a
>>> real-time OLTP operation... I think.
>>>
>>
>> I think this is decent guidance.  I'll also clarify that aggregation
>> functions should only be used on single partitions if you expect to get a
>> response back with reasonable latency.  Full table scans are still
>> expensive, even when they're wrapped in an aggregation function.
>>
>> If count(*) is too slow, the standard alternatives are:
>>  - counters
>>  - a static count that's periodically refreshed by a batch/background
>> process
>>  - LWT increments on an int column
>>  - an external datastore like redis
>>
>> Obviously, each of these has a different set of tradeoffs.
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>
>

Re: Proper use of COUNT

Posted by Jack Krupansky <ja...@gmail.com>.

Sylvain & Tyler, this Jira is for a user reporting a timeout for SELECT
COUNT(*) using 3.3:
https://issues.apache.org/jira/browse/CASSANDRA-11566

I'll let one of you guys follow up on that. I mean, I thought it was timing
out die to the amount of data, but you guys are saying that paging should
make that not a problem. Or is there a timeout in cqlsh simply because the
operation is slow - as opposed to the server reporting an internal timeout?

Thanks.



-- Jack Krupansky

On Tue, Apr 19, 2016 at 12:45 PM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> On Tue, Apr 19, 2016 at 11:32 AM, Jack Krupansky <jack.krupansky@gmail.com
> > wrote:
>
>>
>> Are the queries sent from the coordinator to other nodes sequencing
>> through partitions in token order and that's what allows the coordinator to
>> dedupe with just a single page at a time? IOW, if a target node responds
>> with a row from token t, then by definition there will be no further rows
>> returned from that node with a token less than t?
>>
>
> That's correct.  The internal paging for aggregation queries is exactly
> the same as the normal "client facing" paging.
>
>
>>
>> And if I understand all of this so far, this means that for 3.x COUNT
>> (and other aggregate functions) are "safe but may be slow" (paraphrasing
>> Sylvain.) Is this for 3.0 and later or some other 3.x (or even some 2.x)?
>>
>
> I think count(*) started using paging internally in 2.1, but I'm having
> trouble finding the jira ticket.  It could have been 2.0.
>
> The new aggregation functions in 2.2 utilize the same code path.
>
>
>>
>> There remains the question of recommended usage for COUNT. I think my two
>> proposed guidelines remain valid (ignoring the old timeout issue), with the
>> only remaining question about how large a row count is advisable for
>> "decent" request latency. 1,000? 10,000? Granted, it depends on the
>> specific data and hardware, but I'm thinking that the guidance should be
>> that you should only use COUNT(*) for no more than "low thousands" of rows
>> unless you are willing to accept it both being very slow and very
>> disruptive to normal cluster health. IOW, it's more like a batch analytics
>> operation than a real-time operation. An occasional administrative query to
>> measure table size should be okay, but common use for OLTP should be
>> restricted to relatively narrow slices or row counts... I think. Feedback
>> welcome.
>>
>> The upcoming support for 2GB partitions will be interesting, but the same
>> guidance should cover, I think. Maybe the numeric upper bound might be a
>> bit higher since only a single partition is involved, but if processing
>> many thousands of rows will remain time consuming, it sounds like that
>> should be treated more as a batch-style OLAP operation rather than a
>> real-time OLTP operation... I think.
>>
>
> I think this is decent guidance.  I'll also clarify that aggregation
> functions should only be used on single partitions if you expect to get a
> response back with reasonable latency.  Full table scans are still
> expensive, even when they're wrapped in an aggregation function.
>
> If count(*) is too slow, the standard alternatives are:
>  - counters
>  - a static count that's periodically refreshed by a batch/background
> process
>  - LWT increments on an int column
>  - an external datastore like redis
>
> Obviously, each of these has a different set of tradeoffs.
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: Proper use of COUNT

Posted by Tyler Hobbs <ty...@datastax.com>.

On Tue, Apr 19, 2016 at 11:32 AM, Jack Krupansky <ja...@gmail.com>
wrote:

>
> Are the queries sent from the coordinator to other nodes sequencing
> through partitions in token order and that's what allows the coordinator to
> dedupe with just a single page at a time? IOW, if a target node responds
> with a row from token t, then by definition there will be no further rows
> returned from that node with a token less than t?
>

That's correct.  The internal paging for aggregation queries is exactly the
same as the normal "client facing" paging.


>
> And if I understand all of this so far, this means that for 3.x COUNT (and
> other aggregate functions) are "safe but may be slow" (paraphrasing
> Sylvain.) Is this for 3.0 and later or some other 3.x (or even some 2.x)?
>

I think count(*) started using paging internally in 2.1, but I'm having
trouble finding the jira ticket.  It could have been 2.0.

The new aggregation functions in 2.2 utilize the same code path.


>
> There remains the question of recommended usage for COUNT. I think my two
> proposed guidelines remain valid (ignoring the old timeout issue), with the
> only remaining question about how large a row count is advisable for
> "decent" request latency. 1,000? 10,000? Granted, it depends on the
> specific data and hardware, but I'm thinking that the guidance should be
> that you should only use COUNT(*) for no more than "low thousands" of rows
> unless you are willing to accept it both being very slow and very
> disruptive to normal cluster health. IOW, it's more like a batch analytics
> operation than a real-time operation. An occasional administrative query to
> measure table size should be okay, but common use for OLTP should be
> restricted to relatively narrow slices or row counts... I think. Feedback
> welcome.
>
> The upcoming support for 2GB partitions will be interesting, but the same
> guidance should cover, I think. Maybe the numeric upper bound might be a
> bit higher since only a single partition is involved, but if processing
> many thousands of rows will remain time consuming, it sounds like that
> should be treated more as a batch-style OLAP operation rather than a
> real-time OLTP operation... I think.
>

I think this is decent guidance.  I'll also clarify that aggregation
functions should only be used on single partitions if you expect to get a
response back with reasonable latency.  Full table scans are still
expensive, even when they're wrapped in an aggregation function.

If count(*) is too slow, the standard alternatives are:
 - counters
 - a static count that's periodically refreshed by a batch/background
process
 - LWT increments on an int column
 - an external datastore like redis

Obviously, each of these has a different set of tradeoffs.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Proper use of COUNT

Posted by Jack Krupansky <ja...@gmail.com>.

Thanks, Tyler.

"Deduping (i.e. normal conflict resolution) happens per-page"

Are the queries sent from the coordinator to other nodes sequencing through
partitions in token order and that's what allows the coordinator to dedupe
with just a single page at a time? IOW, if a target node responds with a
row from token t, then by definition there will be no further rows returned
from that node with a token less than t?

And if I understand all of this so far, this means that for 3.x COUNT (and
other aggregate functions) are "safe but may be slow" (paraphrasing
Sylvain.) Is this for 3.0 and later or some other 3.x (or even some 2.x)?

There remains the question of recommended usage for COUNT. I think my two
proposed guidelines remain valid (ignoring the old timeout issue), with the
only remaining question about how large a row count is advisable for
"decent" request latency. 1,000? 10,000? Granted, it depends on the
specific data and hardware, but I'm thinking that the guidance should be
that you should only use COUNT(*) for no more than "low thousands" of rows
unless you are willing to accept it both being very slow and very
disruptive to normal cluster health. IOW, it's more like a batch analytics
operation than a real-time operation. An occasional administrative query to
measure table size should be okay, but common use for OLTP should be
restricted to relatively narrow slices or row counts... I think. Feedback
welcome.

The upcoming support for 2GB partitions will be interesting, but the same
guidance should cover, I think. Maybe the numeric upper bound might be a
bit higher since only a single partition is involved, but if processing
many thousands of rows will remain time consuming, it sounds like that
should be treated more as a batch-style OLAP operation rather than a
real-time OLTP operation... I think.

Thanks.

-- Jack Krupansky

On Tue, Apr 19, 2016 at 12:04 PM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> On Tue, Apr 19, 2016 at 9:51 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>>
>> 1. Another clarification: All of the aggregate functions, AVG, SUM, MIN,
>> MAX are in exactly the same boat as COUNT, right?
>>
>
> Yes.
>
>
>>
>> 2. Is the paging for COUNT, et al, done within the coordinator node?
>>
>
> Yes.
>
>
>>
>> 3. Does dedupe on the coordinator node consume memory proportional to the
>> number of rows on all nodes? I mean, you can't dedupe using only partition
>> keys of the coordinator node, right? What I'm wondering is if the usability
>> of COUNT (et al) is memory limited as well as time.
>>
>
> Deduping (i.e. normal conflict resolution) happens per-page, so in the
> worst case the memory requirements for the coordinator are RF * page size.
>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: Proper use of COUNT

Posted by Tyler Hobbs <ty...@datastax.com>.

On Tue, Apr 19, 2016 at 9:51 AM, Jack Krupansky <ja...@gmail.com>
wrote:

>
> 1. Another clarification: All of the aggregate functions, AVG, SUM, MIN,
> MAX are in exactly the same boat as COUNT, right?
>

Yes.


>
> 2. Is the paging for COUNT, et al, done within the coordinator node?
>

Yes.


>
> 3. Does dedupe on the coordinator node consume memory proportional to the
> number of rows on all nodes? I mean, you can't dedupe using only partition
> keys of the coordinator node, right? What I'm wondering is if the usability
> of COUNT (et al) is memory limited as well as time.
>

Deduping (i.e. normal conflict resolution) happens per-page, so in the
worst case the memory requirements for the coordinator are RF * page size.




-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Proper use of COUNT

Posted by Jack Krupansky <ja...@gmail.com>.

Thanks for that clarification, Sylvain.

1. Another clarification: All of the aggregate functions, AVG, SUM, MIN,
MAX are in exactly the same boat as COUNT, right?

2. Is the paging for COUNT, et al, done within the coordinator node?

3. Does dedupe on the coordinator node consume memory proportional to the
number of rows on all nodes? I mean, you can't dedupe using only partition
keys of the coordinator node, right? What I'm wondering is if the usability
of COUNT (et al) is memory limited as well as time.

Thanks.

-- Jack Krupansky

On Tue, Apr 19, 2016 at 5:36 AM, Sylvain Lebresne <sy...@datastax.com>
wrote:

>
>> Accept for relatively small or narrow queries, it seems to have a
>> propensity for timing out.
>>
>
> For recent enough version of C*, it shouldn't since it pages internally
> (it will be slow and as always be, but it shouldn't time out if some decent
> page size is used, which should be the default). I suspect report of it
> timeouting are either using old versions (or are using unreasonable paging
> size values, but that sounds less likely since I'd assume users would
> easily find and fix their error in that case).
>
> But if the query is timeouting unreasonably for large partition in recent
> versions, then it's a bug and a JIRA can be open with reproduction steps.
>
> --
> Sylvain
>
>>
>

Re: Proper use of COUNT

Posted by Sylvain Lebresne <sy...@datastax.com>.

>
>
> Accept for relatively small or narrow queries, it seems to have a
> propensity for timing out.
>

For recent enough version of C*, it shouldn't since it pages internally (it
will be slow and as always be, but it shouldn't time out if some decent
page size is used, which should be the default). I suspect report of it
timeouting are either using old versions (or are using unreasonable paging
size values, but that sounds less likely since I'd assume users would
easily find and fix their error in that case).

But if the query is timeouting unreasonably for large partition in recent
versions, then it's a bug and a JIRA can be open with reproduction steps.

--
Sylvain

>