You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Karthik K <ma...@gmail.com> on 2022/09/28 11:44:00 UTC

Questions on the count and multiple index behaviour in cassandra

Hi,

We have two doubts on cassandra 3.11 features:

1) Need to get counts of row from a cassandra table.
We have 3 node clusters with Apache Cassandra 3.11 version.

We loaded a table in cassandra with 9lakh records. We have around 91
columns in this table. Most of the records have text as datatype.
All these 9lakh records were part of a single partition key.

When we tried a select count(*) query with that partition key, the query
was timing out.

However, we were able to retrieve counts through multiple calls by fetching
only
1 lakh records in each call. The only disadvantage here is the time taken
which
is around 1minute and 3 seconds.

Is there any other approach to get the row count faster in cassandra? Do we
need to '
change the data modelling approach to achieve this? Suggestions are welcome


2) How to data model in cassandra to support usage of multiple filters.
 We may also need the count of rows for this multiple filter query.

Thanks & Regards,
Karthikeyan

RE: Questions on the count and multiple index behaviour in cassandra

Posted by "Durity, Sean R via user" <us...@cassandra.apache.org>.
Aggregate queries (like count(*) ) are fine *within* a reasonably sized partition (under 100 MB in size). However, Cassandra  is not the right tool if you want to do aggregate queries *across* partitions (unless you break up the work with something like Spark). Choosing the right partition key and values IS the goal of Cassandra data modeling. (Clustering keys are used for ordering data within a partition.)

Good:
Select count(*) from my_table where my_partition_key = ‘1’; --and the partition is 100 MB or less

Not good:
Select count(*) from my_table;

Are the counts the actual workload or just a measure of the completion of the load? You want to model the data to satisfy the queries for the workload. Queries should be very simple; getting the data model right is the hard work. Make sure that Cassandra fits the use case you have.


Sean R. Durity
DB Solutions
Staff Systems Engineer – Cassandra

From: Karthik K <ma...@gmail.com>
Sent: Wednesday, September 28, 2022 8:48 AM
To: user@cassandra.apache.org
Cc: rseshadri@altimetrik.com
Subject: [EXTERNAL] Re: Questions on the count and multiple index behaviour in cassandra

Hi Stéphane Alleaume, Thanks for your quick response. I have attached the Table stats by running the nodetool cfstats command to get to the size. If I am correct, the partition size must be 464 Mb. However, when I exported the data as csv the
ZjQcmQRYFpfptBannerStart
This Message Is From an Untrusted Sender
You have not previously corresponded with this sender.
    Report Suspicious  <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/M-nmYVHPHQ!GBuqOYqUdJskCVZpWpnorse_9wgMi29zj7R6P8-PPCNyOOWTjk-b2FIZI0VWXRdm0QtXilc7Rlklb2v4lumMt1R2Wj3x082TS8QFuvu5axHnPiPMC7WyoFT1rA7JbZmLqxU$>   ‌
ZjQcmQRYFpfptBannerEnd
Hi Stéphane Alleaume,
Thanks for your quick response. I have attached the Table stats by running the nodetool cfstats command to get to the size.
If I am correct, the partition size must be 464 Mb. However, when I exported the data as csv the size was 1510 Mb.

1) If we segment this 464Mb data into more partitions, say, with each partition sizing <100Mb, Will the count(*) query work effectively?
2) What will be the approximate response in seconds if we run a select count(*) against 1 million records?
3) Though, Elasticsearch is a future option, we want to dig more with cassandra to achieve this. Do we have any work around using data modelling ?

Thanks & Regards,
Karthikeyan K

On Wed, Sep 28, 2022 at 5:31 PM Stéphane Alleaume <cr...@gmail.com>> wrote:
Hi

1) how much size in Mo is your partition ? Should be less than 100 Mo (but less in fact)

2) could you plug an Elasticsearch or Solr search in front  ?

Kind regards
Stephane





Le mer. 28 sept. 2022, 13:46, Karthik K <ma...@gmail.com>> a écrit :
Hi,

We have two doubts on cassandra 3.11 features:

1) Need to get counts of row from a cassandra table.
We have 3 node clusters with Apache Cassandra 3.11 version.

We loaded a table in cassandra with 9lakh records. We have around 91 columns in this table. Most of the records have text as datatype.
All these 9lakh records were part of a single partition key.

When we tried a select count(*) query with that partition key, the query was timing out.

However, we were able to retrieve counts through multiple calls by fetching only
1 lakh records in each call. The only disadvantage here is the time taken which
is around 1minute and 3 seconds.

Is there any other approach to get the row count faster in cassandra? Do we need to '
change the data modelling approach to achieve this? Suggestions are welcome


2) How to data model in cassandra to support usage of multiple filters.
 We may also need the count of rows for this multiple filter query.

Thanks & Regards,
Karthikeyan


INTERNAL USE

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Questions on the count and multiple index behaviour in cassandra

Posted by Karthik K <ma...@gmail.com>.
Hi Stéphane Alleaume,
Thanks for your quick response. I have attached the Table stats by running
the nodetool cfstats command to get to the size.
If I am correct, the partition size must be 464 Mb. However, when I
exported the data as csv the size was 1510 Mb.

1) If we segment this 464Mb data into more partitions, say, with each
partition sizing <100Mb, Will the count(*) query work effectively?
2) What will be the approximate response in seconds if we run a select
count(*) against 1 million records?
3) Though, Elasticsearch is a future option, we want to dig more with
cassandra to achieve this. Do we have any work around using data modelling ?

Thanks & Regards,
Karthikeyan K

On Wed, Sep 28, 2022 at 5:31 PM Stéphane Alleaume <cr...@gmail.com>
wrote:

> Hi
>
> 1) how much size in Mo is your partition ? Should be less than 100 Mo (but
> less in fact)
>
> 2) could you plug an Elasticsearch or Solr search in front  ?
>
> Kind regards
> Stephane
>
>
>
>
>
> Le mer. 28 sept. 2022, 13:46, Karthik K <ma...@gmail.com> a
> écrit :
>
>> Hi,
>>
>> We have two doubts on cassandra 3.11 features:
>>
>> 1) Need to get counts of row from a cassandra table.
>> We have 3 node clusters with Apache Cassandra 3.11 version.
>>
>> We loaded a table in cassandra with 9lakh records. We have around 91
>> columns in this table. Most of the records have text as datatype.
>> All these 9lakh records were part of a single partition key.
>>
>> When we tried a select count(*) query with that partition key, the query
>> was timing out.
>>
>> However, we were able to retrieve counts through multiple calls by
>> fetching only
>> 1 lakh records in each call. The only disadvantage here is the time taken
>> which
>> is around 1minute and 3 seconds.
>>
>> Is there any other approach to get the row count faster in cassandra? Do
>> we need to '
>> change the data modelling approach to achieve this? Suggestions are
>> welcome
>>
>>
>> 2) How to data model in cassandra to support usage of multiple filters.
>>  We may also need the count of rows for this multiple filter query.
>>
>> Thanks & Regards,
>> Karthikeyan
>>
>

Re: Questions on the count and multiple index behaviour in cassandra

Posted by Stéphane Alleaume <cr...@gmail.com>.
Hi

1) how much size in Mo is your partition ? Should be less than 100 Mo (but
less in fact)

2) could you plug an Elasticsearch or Solr search in front  ?

Kind regards
Stephane





Le mer. 28 sept. 2022, 13:46, Karthik K <ma...@gmail.com> a
écrit :

> Hi,
>
> We have two doubts on cassandra 3.11 features:
>
> 1) Need to get counts of row from a cassandra table.
> We have 3 node clusters with Apache Cassandra 3.11 version.
>
> We loaded a table in cassandra with 9lakh records. We have around 91
> columns in this table. Most of the records have text as datatype.
> All these 9lakh records were part of a single partition key.
>
> When we tried a select count(*) query with that partition key, the query
> was timing out.
>
> However, we were able to retrieve counts through multiple calls by
> fetching only
> 1 lakh records in each call. The only disadvantage here is the time taken
> which
> is around 1minute and 3 seconds.
>
> Is there any other approach to get the row count faster in cassandra? Do
> we need to '
> change the data modelling approach to achieve this? Suggestions are welcome
>
>
> 2) How to data model in cassandra to support usage of multiple filters.
>  We may also need the count of rows for this multiple filter query.
>
> Thanks & Regards,
> Karthikeyan
>

Re: Questions on the count and multiple index behaviour in cassandra

Posted by Bowen Song via user <us...@cassandra.apache.org>.
It sounds like you are misusing/abusing Cassandra.

I've noticed the following Cassandra anti-patterns in your post:

 1. Large or uneven partitions
    All rows in a table in a single partition is definitely an
    anti-pattern unless you only have a very small number of rows.
 2. "SELECT COUNT(*) FROM ..." without providing a partition key
    In your case, since all rows are in a single partition, it's
    equivalent to without a partition key.
 3. Wide table (too many columns)
    91 columns sounds excessive, and may lead to reduced performance and
    heightened JVM GC pressure

Cassandra is not a SQL database. You should design your table schema 
around the queries, not design your queries around the table schema. You 
may also need to store multiple copies of the same data with different 
keys to satisfy different queries.

On 28/09/2022 12:44, Karthik K wrote:
> Hi,
>
> We have two doubts on cassandra 3.11 features:
>
> 1) Need to get counts of row from a cassandra table.
> We have 3 node clusters with Apache Cassandra 3.11 version.
>
> We loaded a table in cassandra with 9lakh records. We have around 91 
> columns in this table. Most of the records have text as datatype.
> All these 9lakh records were part of a single partition key.
>
> When we tried a select count(*) query with that partition key, the 
> query was timing out.
>
> However, we were able to retrieve counts through multiple calls by 
> fetching only
> 1 lakh records in each call. The only disadvantage here is the time 
> taken which
> is around 1minute and 3 seconds.
>
> Is there any other approach to get the row count faster in cassandra? 
> Do we need to '
> change the data modelling approach to achieve this? Suggestions are 
> welcome
>
>
> 2) How to data model in cassandra to support usage of multiple filters.
>  We may also need the count of rows for this multiple filter query.
>
> Thanks & Regards,
> Karthikeyan