You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bobbie Haynes <ha...@gmail.com> on 2019/03/06 00:19:26 UTC

data modelling

Hi
   Could you help  modelling this usecase

   I have below table ..I will update tagid's columns set(bigit) based on
PK. I have created the secondary index column on tagid to query like below..

Select * from keyspace.customer_sensor_tagids where tagids CONTAINS
11358097;

this query is doing the range scan because of the secondary index.. and
causing performance issues

If i create a MV on Tagid's can i be able to query like above.. please
suggest a Datamodel for this scenario.Apprecite your help on this.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
example of Tagids for each row:-
   4608831, 608886, 608890, 609164, 615024, 679579, 814791, 830404, 71756,
8538307, 9936868, 10883336, 10954034, 10958062, 10976553, 10976554,
10980255, 11009971, 11043805, 11075379, 11078819, 11167844, 11358097,
11479340, 11481769, 11481770, 11481771, 11481772, 11693597, 11709012,
12193230, 12421500, 12421516, 12421781, 12422011, 12422368, 12422501,
12422512, 12422553, 12422555, 12423381, 12423382


 -----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

   CREATE TABLE keyspace.customer_sensor_tagids (
    customerid bigint,
    sensorid bigint,
    XXX frozen<activity>,
    XXX frozen<asset>,
    XXX text,
    XXX text,
    XXX frozen<hardware>,
    XXX bigint,
    XXX bigint,
    XXX list<frozen<interface>>,
    XXX frozen<inventory>,
    XXX boolean,
    XXX bigint,
    XXX list<frozen<openport>>,
    XXX frozen<operatingsystem>,
    XXX bigint,
    XXX bigint,
    XXX list<frozen<processor>>,
    XXX list<frozen<service>>,
    XXX set<frozen<software>>,
    XXX set<bigint>,
    XXX set<bigint>,
    tagids set<bigint>,
    XXX bigint,
    XXX list<frozen<volume>>,
    PRIMARY KEY ((customerid, sensorid))
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(tagids));
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));
CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

Re: data modelling

Posted by Stefan Miklosovic <st...@instaclustr.com>.

Hi Bobbie,

as Kenneth already mentioned, you should model your schema based on what
queries you are expecting to do and read related literature. From what I
see your table is named "customer_sensor_tagids" so its quite possible you
would have tagids as a part of primary key? Something like:

select * from keyspace.customer_sensor_tagids where tag_id = 11358097.

This implies that you would have as many records per customer and sensor
ids as many tag_id's there are. If you want to query such table and you
know customerid and sensorid in advance, you could query like

select * from keyspace.customer_sensor_tagids where customerid = X and
sensorid =Y and tag_id = 11358097

so your primary key would look like (customerid, sensorid, tagid) or
((customerid, sensorid), tagid)

If you do not know customerid nor sensorid while doing a query, you would
have to make tag_id a partition key and customerid and sensorid clustering
columns, optionally ordered, thats up to you. Now you may object that there
would be data duplication as you would have to have "as many tables as
queries" which might be true but thats not in general a problem. Thats the
cost you "pay" for having queries super fast and tailored for your use case.

I suggest to read more about data modelling in general.

On Wed, 6 Mar 2019 at 11:19, Bobbie Haynes <ha...@gmail.com> wrote:

> Hi
>    Could you help  modelling this usecase
>
>    I have below table ..I will update tagid's columns set(bigit) based on
> PK. I have created the secondary index column on tagid to query like below..
>
> Select * from keyspace.customer_sensor_tagids where tagids CONTAINS
> 11358097;
>
> this query is doing the range scan because of the secondary index.. and
> causing performance issues
>
> If i create a MV on Tagid's can i be able to query like above.. please
> suggest a Datamodel for this scenario.Apprecite your help on this.
>
> -----------------------------------------------------------------------------------------------
>
> -----------------------------------------------------------------------------------------------
> example of Tagids for each row:-
>    4608831, 608886, 608890, 609164, 615024, 679579, 814791, 830404, 71756,
> 8538307, 9936868, 10883336, 10954034, 10958062, 10976553, 10976554,
> 10980255, 11009971, 11043805, 11075379, 11078819, 11167844, 11358097,
> 11479340, 11481769, 11481770, 11481771, 11481772, 11693597, 11709012,
> 12193230, 12421500, 12421516, 12421781, 12422011, 12422368, 12422501,
> 12422512, 12422553, 12422555, 12423381, 12423382
>
>
>  -----------------------------------------------------------------------------------------------
>
> -----------------------------------------------------------------------------------------------
>
>    CREATE TABLE keyspace.customer_sensor_tagids (
>     customerid bigint,
>     sensorid bigint,
>     XXX frozen<activity>,
>     XXX frozen<asset>,
>     XXX text,
>     XXX text,
>     XXX frozen<hardware>,
>     XXX bigint,
>     XXX bigint,
>     XXX list<frozen<interface>>,
>     XXX frozen<inventory>,
>     XXX boolean,
>     XXX bigint,
>     XXX list<frozen<openport>>,
>     XXX frozen<operatingsystem>,
>     XXX bigint,
>     XXX bigint,
>     XXX list<frozen<processor>>,
>     XXX list<frozen<service>>,
>     XXX set<frozen<software>>,
>     XXX set<bigint>,
>     XXX set<bigint>,
>     tagids set<bigint>,
>     XXX bigint,
>     XXX list<frozen<volume>>,
>     PRIMARY KEY ((customerid, sensorid))
> ) WITH bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(tagids));
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));
> CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);
>


-- 


*Stefan Miklosovic**Senior Software Engineer*


M: +61459911436

<https://www.instaclustr.com>

<https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
<https://www.linkedin.com/company/instaclustr>

Read our latest technical blog posts here
<https://www.instaclustr.com/blog/>.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Instaclustr values your privacy. Our privacy policy can be found at
https://www.instaclustr.com/company/policies/privacy-policy

RE: data modelling

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.

You definitely don’t need a secondary index.  A MV might be the answer.  

 

How many tagids does a sensor have ?

Do you have to use a collection for tagids?

How many sensors would you expect to have a particular tagid?

Would you know the customerid and sensorid and be able to specify that in the query?

 

If you could have tagid not be a collection, and make it part of the primary key, that would help a lot.

  

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com.INVALID] 
Sent: Tuesday, March 05, 2019 4:33 PM
To: user@cassandra.apache.org
Subject: RE: data modelling

 

Hi Bobbie,

 

You’re not giving enough information to model the data.  With Cassandra it’s based on the queries you are going to need.  This link to Jeffrey Carpenter’s book, Cassandra the Definitive Guide, Chapter 5, which is on how to do data modeling for Cassandra, should be of help to you: https://books.google.com/books?id=uW-PDAAAQBAJ <https://books.google.com/books?id=uW-PDAAAQBAJ&pg=PA79&lpg=PA79&dq=jeff+carpenter+chapter+5&source=bl&ots=58cM-BII2M&sig=ACfU3U0-188Fw-jcj1tbMItdPlNH8Lk9yQ&hl=en&sa=X&ved=2ahUKEwinrY3OoezgAhWoHDQIHRfmA7IQ6AEwA3oECAcQAQ#v=onepage&q=jeff%20carpenter%20chapter%205&f=false> &pg=PA79&lpg=PA79&dq=jeff+carpenter+chapter+5&source=bl&ots=58cM-BII2M&sig=ACfU3U0-188Fw-jcj1tbMItdPlNH8Lk9yQ&hl=en&sa=X&ved=2ahUKEwinrY3OoezgAhWoHDQIHRfmA7IQ6AEwA3oECAcQAQ#v=onepage&q=jeff%20carpenter%20chapter%205&f=false

 

 

 

From: Bobbie Haynes [mailto:haynes30349@gmail.com] 
Sent: Tuesday, March 05, 2019 4:19 PM
To: user@cassandra.apache.org
Subject: data modelling

 

Hi 

   Could you help  modelling this usecase 

 

   I have below table ..I will update tagid's columns set(bigit) based on PK. I have created the secondary index column on tagid to query like below..

 

Select * from keyspace.customer_sensor_tagids where tagids CONTAINS 11358097;

 

this query is doing the range scan because of the secondary index.. and causing performance issues 

 

If i create a MV on Tagid's can i be able to query like above.. please suggest a Datamodel for this scenario.Apprecite your help on this.

-----------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------

example of Tagids for each row:-

   4608831, 608886, 608890, 609164, 615024, 679579, 814791, 830404, 71756, 8538307, 9936868, 10883336, 10954034, 10958062, 10976553, 10976554, 10980255, 11009971, 11043805, 11075379, 11078819, 11167844, 11358097, 11479340, 11481769, 11481770, 11481771, 11481772, 11693597, 11709012, 12193230, 12421500, 12421516, 12421781, 12422011, 12422368, 12422501, 12422512, 12422553, 12422555, 12423381, 12423382

 

   -----------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------- 

 

   CREATE TABLE keyspace.customer_sensor_tagids (

    customerid bigint,

    sensorid bigint,

    XXX frozen<activity>,

    XXX frozen<asset>,

    XXX text,

    XXX text,

    XXX frozen<hardware>,

    XXX bigint,

    XXX bigint,

    XXX list<frozen<interface>>,

    XXX frozen<inventory>,

    XXX boolean,

    XXX bigint,

    XXX list<frozen<openport>>,

    XXX frozen<operatingsystem>,

    XXX bigint,

    XXX bigint,

    XXX list<frozen<processor>>,

    XXX list<frozen<service>>,

    XXX set<frozen<software>>,

    XXX set<bigint>,

    XXX set<bigint>,

    tagids set<bigint>,

    XXX bigint,

    XXX list<frozen<volume>>,

    PRIMARY KEY ((customerid, sensorid))

) WITH bloom_filter_fp_chance = 0.01

    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

    AND comment = ''

    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}

    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0

    AND dclocal_read_repair_chance = 0.1

    AND default_time_to_live = 0

    AND gc_grace_seconds = 864000

    AND max_index_interval = 2048

    AND memtable_flush_period_in_ms = 0

    AND min_index_interval = 128

    AND read_repair_chance = 0.0

    AND speculative_retry = '99PERCENTILE';

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(tagids));

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

RE: data modelling

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.

Hi Bobbie,

 

You’re not giving enough information to model the data.  With Cassandra it’s based on the queries you are going to need.  This link to Jeffrey Carpenter’s book, Cassandra the Definitive Guide, Chapter 5, which is on how to do data modeling for Cassandra, should be of help to you: https://books.google.com/books?id=uW-PDAAAQBAJ <https://books.google.com/books?id=uW-PDAAAQBAJ&pg=PA79&lpg=PA79&dq=jeff+carpenter+chapter+5&source=bl&ots=58cM-BII2M&sig=ACfU3U0-188Fw-jcj1tbMItdPlNH8Lk9yQ&hl=en&sa=X&ved=2ahUKEwinrY3OoezgAhWoHDQIHRfmA7IQ6AEwA3oECAcQAQ#v=onepage&q=jeff%20carpenter%20chapter%205&f=false> &pg=PA79&lpg=PA79&dq=jeff+carpenter+chapter+5&source=bl&ots=58cM-BII2M&sig=ACfU3U0-188Fw-jcj1tbMItdPlNH8Lk9yQ&hl=en&sa=X&ved=2ahUKEwinrY3OoezgAhWoHDQIHRfmA7IQ6AEwA3oECAcQAQ#v=onepage&q=jeff%20carpenter%20chapter%205&f=false

 

 

 

From: Bobbie Haynes [mailto:haynes30349@gmail.com] 
Sent: Tuesday, March 05, 2019 4:19 PM
To: user@cassandra.apache.org
Subject: data modelling

 

Hi 

   Could you help  modelling this usecase 

 

   I have below table ..I will update tagid's columns set(bigit) based on PK. I have created the secondary index column on tagid to query like below..

 

Select * from keyspace.customer_sensor_tagids where tagids CONTAINS 11358097;

 

this query is doing the range scan because of the secondary index.. and causing performance issues 

 

If i create a MV on Tagid's can i be able to query like above.. please suggest a Datamodel for this scenario.Apprecite your help on this.

-----------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------

example of Tagids for each row:-

   4608831, 608886, 608890, 609164, 615024, 679579, 814791, 830404, 71756, 8538307, 9936868, 10883336, 10954034, 10958062, 10976553, 10976554, 10980255, 11009971, 11043805, 11075379, 11078819, 11167844, 11358097, 11479340, 11481769, 11481770, 11481771, 11481772, 11693597, 11709012, 12193230, 12421500, 12421516, 12421781, 12422011, 12422368, 12422501, 12422512, 12422553, 12422555, 12423381, 12423382

 

   -----------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------- 

 

   CREATE TABLE keyspace.customer_sensor_tagids (

    customerid bigint,

    sensorid bigint,

    XXX frozen<activity>,

    XXX frozen<asset>,

    XXX text,

    XXX text,

    XXX frozen<hardware>,

    XXX bigint,

    XXX bigint,

    XXX list<frozen<interface>>,

    XXX frozen<inventory>,

    XXX boolean,

    XXX bigint,

    XXX list<frozen<openport>>,

    XXX frozen<operatingsystem>,

    XXX bigint,

    XXX bigint,

    XXX list<frozen<processor>>,

    XXX list<frozen<service>>,

    XXX set<frozen<software>>,

    XXX set<bigint>,

    XXX set<bigint>,

    tagids set<bigint>,

    XXX bigint,

    XXX list<frozen<volume>>,

    PRIMARY KEY ((customerid, sensorid))

) WITH bloom_filter_fp_chance = 0.01

    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

    AND comment = ''

    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}

    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0

    AND dclocal_read_repair_chance = 0.1

    AND default_time_to_live = 0

    AND gc_grace_seconds = 864000

    AND max_index_interval = 2048

    AND memtable_flush_period_in_ms = 0

    AND min_index_interval = 128

    AND read_repair_chance = 0.0

    AND speculative_retry = '99PERCENTILE';

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(tagids));

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (values(XXX));

CREATE INDEX XXX ON keyspace.customer_sensor_tagids (XXX);