You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jack Krupansky <ja...@gmail.com> on 2016/04/09 00:48:26 UTC

1, 2, 3...

I'm afraid I don't have the solid answer to this obvious question: How do I
get a fairly accurate count of (CQL) rows in a Cassandra table?

Does SELECT COUNT (*) FROM <table-name> actually do it?

Does it really count (CQL) rows across all nodes and exclude replicated
rows?

Is there a better/preferred technique? For example, is it more efficient to
query the row count one node at a time?

And for bonus points: How do you count (CQL) rows for each node? Again,
excluding replication.

-- Jack Krupansky

Re: 1, 2, 3...

Posted by Spencer Brown <li...@gmail.com>.
CQL commands don't count replications - that would make any select
meaningless since they would all return dups.

On Fri, Apr 8, 2016 at 6:48 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> I'm afraid I don't have the solid answer to this obvious question: How do
> I get a fairly accurate count of (CQL) rows in a Cassandra table?
>
> Does SELECT COUNT (*) FROM <table-name> actually do it?
>
> Does it really count (CQL) rows across all nodes and exclude replicated
> rows?
>
> Is there a better/preferred technique? For example, is it more efficient
> to query the row count one node at a time?
>
> And for bonus points: How do you count (CQL) rows for each node? Again,
> excluding replication.
>
> -- Jack Krupansky
>

Re: 1, 2, 3...

Posted by Emīls Šolmanis <em...@gmail.com>.
You're not mistaken, just thought you were after partition keys and didn't
read the question that carefully. Afaik, you're SOOL if you need to
distinguish clustering keys as unique. Well, other than doing a full table
scan of course, which I'm assuming is not too plausible.

On Mon, 11 Apr 2016 at 16:52 Jack Krupansky <ja...@gmail.com>
wrote:

> Unless I'm mistaken, nodetool tablestats gives you the number of
> partitions (partition keys), not the number of primary keys. IOW, the term
> "keys" is ambiguous. That's why I phrased the original question as count of
> (CQL) rows, to distinguish from the pre-CQL3 concept of a partition being
> treated as a single row.
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 11:46 AM, Emīls Šolmanis <emils.solmanis@gmail.com
> > wrote:
>
>> Wouldn't the "number of keys" part of *nodetool cfstats* run on every
>> node, summed and divided by replication factor give you a decent
>> approximation? Or are you really after a completely precise number?
>>
>> On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <ja...@gmail.com>
>> wrote:
>>
>>> Agreed, that anything requiring a full table scan, short of batch
>>> analytics,is an antipattern, although the goal is not to do a full scan per
>>> se, but just get the row count. It still surprises people that Cassandra
>>> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
>>> query for q=*:* and that will very quickly return the total row count. I
>>> presume that Stratio will handle this fine as well.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Apr 11, 2016 at 11:10 AM, <SE...@homedepot.com> wrote:
>>>
>>>> Cassandra is not good for table scan type queries (which count(*)
>>>> typically is). While there are some attempts to do that (as noted below),
>>>> this is a path I avoid.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sean Durity
>>>>
>>>>
>>>>
>>>> *From:* Max C [mailto:mc_cassandra@core43.com]
>>>> *Sent:* Saturday, April 09, 2016 6:19 PM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: 1, 2, 3...
>>>>
>>>>
>>>>
>>>> Looks like this guy (Brian Hess) wrote a script to split the token
>>>> range and run count(*) on each subrange:
>>>>
>>>>
>>>>
>>>> https://github.com/brianmhess/cassandra-count
>>>>
>>>>
>>>>
>>>> - Max
>>>>
>>>>
>>>>
>>>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <je...@crowdstrike.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>>>> with enough time and assuming you don’t have any partitions that will kill
>>>> you.
>>>>
>>>>
>>>>
>>>> No, it doesn’t count extra replicas / duplicates.
>>>>
>>>>
>>>>
>>>> The old way to do this (before paging / fetch size) was to use manual
>>>> paging based on tokens/clustering keys:
>>>>
>>>>
>>>>
>>>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>>>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>>>> page through the whole token space.
>>>>
>>>>
>>>>
>>>> You could, in theory, issue thousands of queries in parallel, all for
>>>> different token ranges, and then sum the results. That’s what something
>>>> like spark would be doing. If you want to determine rows per node, limit
>>>> the token range to that owned by the node (easier with 1 token than vnodes,
>>>> with vnodes repeat num_tokens times).
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> The information in this Internet Email is confidential and may be
>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>> Email by anyone else is unauthorized. If you are not the intended
>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>> When addressed to our clients any opinions or advice contained in this
>>>> Email are subject to the terms and conditions expressed in any applicable
>>>> governing The Home Depot terms of business or client engagement letter. The
>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>> content of this attachment and for any damages or losses arising from any
>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>> items of a destructive nature, which may be contained in this attachment
>>>> and shall not be liable for direct, indirect, consequential or special
>>>> damages in connection with this e-mail message or its attachment.
>>>>
>>>
>>>
>

Re: 1, 2, 3...

Posted by Jack Krupansky <ja...@gmail.com>.
Unless I'm mistaken, nodetool tablestats gives you the number of partitions
(partition keys), not the number of primary keys. IOW, the term "keys" is
ambiguous. That's why I phrased the original question as count of (CQL)
rows, to distinguish from the pre-CQL3 concept of a partition being treated
as a single row.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 11:46 AM, Emīls Šolmanis <em...@gmail.com>
wrote:

> Wouldn't the "number of keys" part of *nodetool cfstats* run on every
> node, summed and divided by replication factor give you a decent
> approximation? Or are you really after a completely precise number?
>
> On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> Agreed, that anything requiring a full table scan, short of batch
>> analytics,is an antipattern, although the goal is not to do a full scan per
>> se, but just get the row count. It still surprises people that Cassandra
>> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
>> query for q=*:* and that will very quickly return the total row count. I
>> presume that Stratio will handle this fine as well.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Apr 11, 2016 at 11:10 AM, <SE...@homedepot.com> wrote:
>>
>>> Cassandra is not good for table scan type queries (which count(*)
>>> typically is). While there are some attempts to do that (as noted below),
>>> this is a path I avoid.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Max C [mailto:mc_cassandra@core43.com]
>>> *Sent:* Saturday, April 09, 2016 6:19 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: 1, 2, 3...
>>>
>>>
>>>
>>> Looks like this guy (Brian Hess) wrote a script to split the token range
>>> and run count(*) on each subrange:
>>>
>>>
>>>
>>> https://github.com/brianmhess/cassandra-count
>>>
>>>
>>>
>>> - Max
>>>
>>>
>>>
>>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <je...@crowdstrike.com>
>>> wrote:
>>>
>>>
>>>
>>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>>> with enough time and assuming you don’t have any partitions that will kill
>>> you.
>>>
>>>
>>>
>>> No, it doesn’t count extra replicas / duplicates.
>>>
>>>
>>>
>>> The old way to do this (before paging / fetch size) was to use manual
>>> paging based on tokens/clustering keys:
>>>
>>>
>>>
>>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>>> page through the whole token space.
>>>
>>>
>>>
>>> You could, in theory, issue thousands of queries in parallel, all for
>>> different token ranges, and then sum the results. That’s what something
>>> like spark would be doing. If you want to determine rows per node, limit
>>> the token range to that owned by the node (easier with 1 token than vnodes,
>>> with vnodes repeat num_tokens times).
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>
>>

Re: 1, 2, 3...

Posted by Emīls Šolmanis <em...@gmail.com>.
Wouldn't the "number of keys" part of *nodetool cfstats* run on every node,
summed and divided by replication factor give you a decent approximation?
Or are you really after a completely precise number?

On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <ja...@gmail.com>
wrote:

> Agreed, that anything requiring a full table scan, short of batch
> analytics,is an antipattern, although the goal is not to do a full scan per
> se, but just get the row count. It still surprises people that Cassandra
> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
> query for q=*:* and that will very quickly return the total row count. I
> presume that Stratio will handle this fine as well.
>
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 11:10 AM, <SE...@homedepot.com> wrote:
>
>> Cassandra is not good for table scan type queries (which count(*)
>> typically is). While there are some attempts to do that (as noted below),
>> this is a path I avoid.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Max C [mailto:mc_cassandra@core43.com]
>> *Sent:* Saturday, April 09, 2016 6:19 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: 1, 2, 3...
>>
>>
>>
>> Looks like this guy (Brian Hess) wrote a script to split the token range
>> and run count(*) on each subrange:
>>
>>
>>
>> https://github.com/brianmhess/cassandra-count
>>
>>
>>
>> - Max
>>
>>
>>
>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>>
>>
>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>> with enough time and assuming you don’t have any partitions that will kill
>> you.
>>
>>
>>
>> No, it doesn’t count extra replicas / duplicates.
>>
>>
>>
>> The old way to do this (before paging / fetch size) was to use manual
>> paging based on tokens/clustering keys:
>>
>>
>>
>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>> page through the whole token space.
>>
>>
>>
>> You could, in theory, issue thousands of queries in parallel, all for
>> different token ranges, and then sum the results. That’s what something
>> like spark would be doing. If you want to determine rows per node, limit
>> the token range to that owned by the node (easier with 1 token than vnodes,
>> with vnodes repeat num_tokens times).
>>
>>
>>
>> ------------------------------
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>

Re: 1, 2, 3...

Posted by Jack Krupansky <ja...@gmail.com>.
Agreed, that anything requiring a full table scan, short of batch
analytics,is an antipattern, although the goal is not to do a full scan per
se, but just get the row count. It still surprises people that Cassandra
cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
query for q=*:* and that will very quickly return the total row count. I
presume that Stratio will handle this fine as well.


-- Jack Krupansky

On Mon, Apr 11, 2016 at 11:10 AM, <SE...@homedepot.com> wrote:

> Cassandra is not good for table scan type queries (which count(*)
> typically is). While there are some attempts to do that (as noted below),
> this is a path I avoid.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Max C [mailto:mc_cassandra@core43.com]
> *Sent:* Saturday, April 09, 2016 6:19 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: 1, 2, 3...
>
>
>
> Looks like this guy (Brian Hess) wrote a script to split the token range
> and run count(*) on each subrange:
>
>
>
> https://github.com/brianmhess/cassandra-count
>
>
>
> - Max
>
>
>
> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>
>
> SELECT COUNT(*) probably works (with internal paging) on many datasets
> with enough time and assuming you don’t have any partitions that will kill
> you.
>
>
>
> No, it doesn’t count extra replicas / duplicates.
>
>
>
> The old way to do this (before paging / fetch size) was to use manual
> paging based on tokens/clustering keys:
>
>
>
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
> SELECT’s WHERE clause can use token(), which is what you’d want to use to
> page through the whole token space.
>
>
>
> You could, in theory, issue thousands of queries in parallel, all for
> different token ranges, and then sum the results. That’s what something
> like spark would be doing. If you want to determine rows per node, limit
> the token range to that owned by the node (easier with 1 token than vnodes,
> with vnodes repeat num_tokens times).
>
>
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

RE: 1, 2, 3...

Posted by SE...@homedepot.com.
Cassandra is not good for table scan type queries (which count(*) typically is). While there are some attempts to do that (as noted below), this is a path I avoid.


Sean Durity

From: Max C [mailto:mc_cassandra@core43.com]
Sent: Saturday, April 09, 2016 6:19 PM
To: user@cassandra.apache.org
Subject: Re: 1, 2, 3...

Looks like this guy (Brian Hess) wrote a script to split the token range and run count(*) on each subrange:

https://github.com/brianmhess/cassandra-count

- Max

On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <je...@crowdstrike.com>> wrote:

SELECT COUNT(*) probably works (with internal paging) on many datasets with enough time and assuming you don’t have any partitions that will kill you.

No, it doesn’t count extra replicas / duplicates.

The old way to do this (before paging / fetch size) was to use manual paging based on tokens/clustering keys:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – SELECT’s WHERE clause can use token(), which is what you’d want to use to page through the whole token space.

You could, in theory, issue thousands of queries in parallel, all for different token ranges, and then sum the results. That’s what something like spark would be doing. If you want to determine rows per node, limit the token range to that owned by the node (easier with 1 token than vnodes, with vnodes repeat num_tokens times).


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: 1, 2, 3...

Posted by Max C <mc...@core43.com>.
Looks like this guy (Brian Hess) wrote a script to split the token range and run count(*) on each subrange:

https://github.com/brianmhess/cassandra-count <https://github.com/brianmhess/cassandra-count>

- Max

> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <je...@crowdstrike.com> wrote:
> 
> SELECT COUNT(*) probably works (with internal paging) on many datasets with enough time and assuming you don’t have any partitions that will kill you.
> 
> No, it doesn’t count extra replicas / duplicates.
> 
> The old way to do this (before paging / fetch size) was to use manual paging based on tokens/clustering keys:
> 
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html <https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html> – SELECT’s WHERE clause can use token(), which is what you’d want to use to page through the whole token space. 
> 
> You could, in theory, issue thousands of queries in parallel, all for different token ranges, and then sum the results. That’s what something like spark would be doing. If you want to determine rows per node, limit the token range to that owned by the node (easier with 1 token than vnodes, with vnodes repeat num_tokens times).


Re: 1, 2, 3...

Posted by Jeff Jirsa <je...@crowdstrike.com>.
SELECT COUNT(*) probably works (with internal paging) on many datasets with enough time and assuming you don’t have any partitions that will kill you.

No, it doesn’t count extra replicas / duplicates.

The old way to do this (before paging / fetch size) was to use manual paging based on tokens/clustering keys:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – SELECT’s WHERE clause can use token(), which is what you’d want to use to page through the whole token space. 

You could, in theory, issue thousands of queries in parallel, all for different token ranges, and then sum the results. That’s what something like spark would be doing. If you want to determine rows per node, limit the token range to that owned by the node (easier with 1 token than vnodes, with vnodes repeat num_tokens times).



From:  Jack Krupansky
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, April 8, 2016 at 3:48 PM
To:  "user@cassandra.apache.org"
Subject:  1, 2, 3...

I'm afraid I don't have the solid answer to this obvious question: How do I get a fairly accurate count of (CQL) rows in a Cassandra table? 

Does SELECT COUNT (*) FROM <table-name> actually do it?

Does it really count (CQL) rows across all nodes and exclude replicated rows?

Is there a better/preferred technique? For example, is it more efficient to query the row count one node at a time?

And for bonus points: How do you count (CQL) rows for each node? Again, excluding replication.

-- Jack Krupansky