You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Check Peck <co...@gmail.com> on 2014/09/17 22:01:08 UTC

Cassandra Data Model design

I have recently started working with Cassandra. We have cassandra cluster
which is using DSE 4.0 version and has VNODES enabled. We have a tables
like this -

Below is my first table -

    CREATE TABLE customers (
      customer_id int PRIMARY KEY,
      last_modified_date timeuuid,
      customer_value text
    )

Read query pattern is like this on above table as of now since we need to
get everything from above table and load it into our application memory
every x minutes.

    select customer_id, customer_value from datakeyspace.customers;

We have second table like this -

    CREATE TABLE client_data (
      client_name text PRIMARY KEY,
      client_id text,
      creation_date timestamp,
      is_valid int,
      last_modified_date timestamp
    )

Right now in the above table, we have 500 records and all those records has
"is_valid" column value set as 1. And the read query pattern is like this
on above table as of now since we need to get everything from above table
and load it into our application memory every x minutes so the below query
will return me all 500 records since everything has is_valid set to 1.

    select client_name, client_id from  datakeyspace.client_data where
is_valid=1;

Since our cluster is VNODES enabled so my above query pattern is not
efficient at all and it is taking lot of time to get the data from
Cassandra. We are reading from these table with consistency level QUORUM.

Is there any possibility of improving our data model?

Any suggestions will be greatly appreciated.

Re: Cassandra Data Model design

Posted by James Briggs <ja...@yahoo.com>.

Cassandra partitions data across the cluster based on PK,
thus is optimized for WHERE PK=...


You are doing table scans, the opposite of what a distributed
system is designed for.


However, some users find Solr helps with queries like yours.


To learn what C* is good at, read this:
http://planetcassandra.org/blog/getting-started-with-time-series-data-modeling/


Thanks, James Briggs. 
-- 
Cassandra/MySQL DBA. Available in San Jose area or remote.
cass_top: https://github.com/jamesbriggs/cassandra-top



________________________________
 From: Check Peck <co...@gmail.com>
To: user <us...@cassandra.apache.org> 
Sent: Wednesday, September 17, 2014 3:35 PM
Subject: Re: Cassandra Data Model design
 


It takes around more than 50 seconds to return back 500 records from cqlsh command not from the code so that's why I am saying it is pretty slow.



On Wed, Sep 17, 2014 at 3:17 PM, Hao Cheng <br...@critica.io> wrote:

How slow is slow? Regardless of the data model question, in my experience 500 rows of relatively light content should be lightning fast. Looking at my performance results on a test cluster of 3x r3.large AWS instances, we reach an op rate on Cassandra's stress test of at least 1000 operations per second and on average 7500 operations for second over the stress test data set.
>
>
>More broadly, it seems like you would benefit from either deltas (only retrieve new data) or something like paging (only retrieve currently relevant data), although its really difficult to say without more information.
>
>
>On Wed, Sep 17, 2014 at 1:01 PM, Check Peck <co...@gmail.com> wrote:
>
>I have recently started working with Cassandra. We have cassandra cluster which is using DSE 4.0 version and has VNODES enabled. We have a tables like this - 
>>
>>Below is my first table -
>>
>>    CREATE TABLE customers (
>>      customer_id int PRIMARY KEY,
>>      last_modified_date timeuuid,
>>      customer_value text
>>    )
>>    
>>Read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes.
>>
>>    select customer_id, customer_value from datakeyspace.customers;
>>
>>We have second table like this -
>>    
>>    CREATE TABLE client_data (
>>      client_name text PRIMARY KEY,
>>      client_id text,
>>      creation_date timestamp,
>>      is_valid int,
>>      last_modified_date timestamp
>>    )
>>    
>>Right now in the above table, we have 500 records and all those records has "is_valid" column value set as 1. And the read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes so the below query will return me all 500 records since everything has is_valid set to 1.
>>
>>    select client_name, client_id from  datakeyspace.client_data where is_valid=1;
>>
>>Since our cluster is VNODES enabled so my above query pattern is not efficient at all and it is taking lot of time to get the data from Cassandra. We are reading from these table with consistency level QUORUM.
>>
>>Is there any possibility of improving our data model?
>>
>>Any suggestions will be greatly appreciated.
>>
>

Re: Cassandra Data Model design

Posted by Check Peck <co...@gmail.com>.

It takes around more than 50 seconds to return back 500 records from cqlsh
command not from the code so that's why I am saying it is pretty slow.

On Wed, Sep 17, 2014 at 3:17 PM, Hao Cheng <br...@critica.io> wrote:

> How slow is slow? Regardless of the data model question, in my experience
> 500 rows of relatively light content should be lightning fast. Looking at
> my performance results on a test cluster of 3x r3.large AWS instances, we
> reach an op rate on Cassandra's stress test of at least 1000 operations per
> second and on average 7500 operations for second over the stress test data
> set.
>
> More broadly, it seems like you would benefit from either deltas (only
> retrieve new data) or something like paging (only retrieve currently
> relevant data), although its really difficult to say without more
> information.
>
> On Wed, Sep 17, 2014 at 1:01 PM, Check Peck <co...@gmail.com>
> wrote:
>
>> I have recently started working with Cassandra. We have cassandra cluster
>> which is using DSE 4.0 version and has VNODES enabled. We have a tables
>> like this -
>>
>> Below is my first table -
>>
>>     CREATE TABLE customers (
>>       customer_id int PRIMARY KEY,
>>       last_modified_date timeuuid,
>>       customer_value text
>>     )
>>
>> Read query pattern is like this on above table as of now since we need to
>> get everything from above table and load it into our application memory
>> every x minutes.
>>
>>     select customer_id, customer_value from datakeyspace.customers;
>>
>> We have second table like this -
>>
>>     CREATE TABLE client_data (
>>       client_name text PRIMARY KEY,
>>       client_id text,
>>       creation_date timestamp,
>>       is_valid int,
>>       last_modified_date timestamp
>>     )
>>
>> Right now in the above table, we have 500 records and all those records
>> has "is_valid" column value set as 1. And the read query pattern is like
>> this on above table as of now since we need to get everything from above
>> table and load it into our application memory every x minutes so the below
>> query will return me all 500 records since everything has is_valid set to 1.
>>
>>     select client_name, client_id from  datakeyspace.client_data where
>> is_valid=1;
>>
>> Since our cluster is VNODES enabled so my above query pattern is not
>> efficient at all and it is taking lot of time to get the data from
>> Cassandra. We are reading from these table with consistency level QUORUM.
>>
>> Is there any possibility of improving our data model?
>>
>> Any suggestions will be greatly appreciated.
>>
>
>

Re: Cassandra Data Model design

Posted by Hao Cheng <br...@critica.io>.

How slow is slow? Regardless of the data model question, in my experience
500 rows of relatively light content should be lightning fast. Looking at
my performance results on a test cluster of 3x r3.large AWS instances, we
reach an op rate on Cassandra's stress test of at least 1000 operations per
second and on average 7500 operations for second over the stress test data
set.

More broadly, it seems like you would benefit from either deltas (only
retrieve new data) or something like paging (only retrieve currently
relevant data), although its really difficult to say without more
information.

On Wed, Sep 17, 2014 at 1:01 PM, Check Peck <co...@gmail.com> wrote:

> I have recently started working with Cassandra. We have cassandra cluster
> which is using DSE 4.0 version and has VNODES enabled. We have a tables
> like this -
>
> Below is my first table -
>
>     CREATE TABLE customers (
>       customer_id int PRIMARY KEY,
>       last_modified_date timeuuid,
>       customer_value text
>     )
>
> Read query pattern is like this on above table as of now since we need to
> get everything from above table and load it into our application memory
> every x minutes.
>
>     select customer_id, customer_value from datakeyspace.customers;
>
> We have second table like this -
>
>     CREATE TABLE client_data (
>       client_name text PRIMARY KEY,
>       client_id text,
>       creation_date timestamp,
>       is_valid int,
>       last_modified_date timestamp
>     )
>
> Right now in the above table, we have 500 records and all those records
> has "is_valid" column value set as 1. And the read query pattern is like
> this on above table as of now since we need to get everything from above
> table and load it into our application memory every x minutes so the below
> query will return me all 500 records since everything has is_valid set to 1.
>
>     select client_name, client_id from  datakeyspace.client_data where
> is_valid=1;
>
> Since our cluster is VNODES enabled so my above query pattern is not
> efficient at all and it is taking lot of time to get the data from
> Cassandra. We are reading from these table with consistency level QUORUM.
>
> Is there any possibility of improving our data model?
>
> Any suggestions will be greatly appreciated.
>

RE: Cassandra Data Model design

Posted by Rahul Gupta <rg...@dekaresearch.com>.

You need rethink your data model for client_data table.
Unlike RDBMS, Cassandra heavily relies on Primary Key for filtering data.

In fact using any column other than primary key is not recommended when you are using Cassandra.
This means that how you design your Primary Key is critical.

There are two options in this case:


1.       Use both client_name and is_valid as Row Key

2.       Use client_name as Row Key and is_valid as partitioning key or in other words, make a composite key using client_name and is_valid

Cassandra Data Model Rule: You need to know your query patterns before you create a table.

Rahul Gupta

From: Check Peck [mailto:comptechgeeky@gmail.com]
Sent: Wednesday, September 17, 2014 4:01 PM
To: user
Subject: Cassandra Data Model design

I have recently started working with Cassandra. We have cassandra cluster which is using DSE 4.0 version and has VNODES enabled. We have a tables like this -

Below is my first table -

    CREATE TABLE customers (
      customer_id int PRIMARY KEY,
      last_modified_date timeuuid,
      customer_value text
    )

Read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes.

    select customer_id, customer_value from datakeyspace.customers;

We have second table like this -

    CREATE TABLE client_data (
      client_name text PRIMARY KEY,
      client_id text,
      creation_date timestamp,
      is_valid int,
      last_modified_date timestamp
    )

Right now in the above table, we have 500 records and all those records has "is_valid" column value set as 1. And the read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes so the below query will return me all 500 records since everything has is_valid set to 1.

    select client_name, client_id from  datakeyspace.client_data where is_valid=1;

Since our cluster is VNODES enabled so my above query pattern is not efficient at all and it is taking lot of time to get the data from Cassandra. We are reading from these table with consistency level QUORUM.

Is there any possibility of improving our data model?

Any suggestions will be greatly appreciated.


Click here<https://www.mailcontrol.com/sr/kSV!iHJdoezGX2PQPOmvUgEBY15Clgt1yZCwVg0S2deEmu+55HoGlTWtq8oOngZ2yx9zvjq!hshkxH4nYzTQYQ==> to report this email as spam.

________________________________
This e-mail and the information, including any attachments it contains, are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.