You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Leleu Eric <Er...@worldline.com> on 2014/08/22 16:47:07 UTC

Secondary index or dedicated CF?

Hi,

I'm new with Cassandra and I wondering what is the best design for my case.

I have a set of buckets that contain one or thousands of contents.

Here is my Content CF :

CREATE TABLE IF NOT EXISTS contents (tenantID varchar,
key varchar,
type varchar,
bucket varchar,
owner varchar,
workspace varchar,
public_read boolean PRIMARY KEY ((key, tenantID), type, workspace));

To retrieve all contents that belong to a bucket, I have created an index on the bucket column.

CREATE INDEX IF NOT EXISTS bucket_to_contents ON contents (bucket);

The column value "bucket" is concatenated with the tenantId (bucket = bucketname+tenantID) in order to avoid filtering on the tenantID on my application.

Is it the rights way to do or should I create another column family to link each content to the bucket ?

CREATE TABLE IF NOT EXISTS bucket_to_contents (tenantID varchar,
key varchar,
type varchar,
bucket varchar,
owner varchar,
workspace varchar,
public_read boolean PRIMARY KEY ((bucket, tenantID), key));

Under the hood what is the difference of the both solutions?

According to my understanding, the result will be the same. Both will have the rowkey equals to the "bucketname" and the "tenantID".
Excepted that the secondary index can have a replication delay...

Can you help me on this point?

Regards,
Eric

________________________________

Ce message et les pi?ces jointes sont confidentiels et r?serv?s ? l'usage exclusif de ses destinataires. Il peut ?galement ?tre prot?g? par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. L'int?grit? du message ne pouvant ?tre assur?e sur Internet, la responsabilit? de Worldline ne pourra ?tre recherch?e quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'exp?diteur ne donne aucune garantie ? cet ?gard et sa responsabilit? ne saurait ?tre recherch?e pour tout dommage r?sultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.

RE: Secondary index or dedicated CF?

Posted by Leleu Eric <Er...@worldline.com>.

Thanks you for your feedbacks.

De : Mark Reddy [mailto:mark.l.reddy@gmail.com]
Envoyé : vendredi 22 août 2014 17:08
À : user@cassandra.apache.org
Objet : Re: Secondary index or dedicated CF?

Hi,

As a general rule of thumb I would steer clear of secondary indexes, this is also the official stand that DataStax take (see p5 of their best practices doc: http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf).

"It is best to avoid using Cassandra's built-in secondary indexes where possible. Instead, it is recommended to denormalize data and manually maintain a dynamic table as a form of an index instead of using a secondary index. If and when secondary indexes are to be used, they should be created only on columns containing low-cardinality data (for example: fields with less than 1000 states)."

Mark

On 22 Aug 2014, at 15:58, DuyHai Doan <do...@gmail.com>> wrote:

Hello Eric

"Under the hood what is the difference of the both solutions?"

1. Cassandra secondary index: distributed index, supports better high volume of data, the index itself is distributed so there is no bottleneck. The tradeoff is that depending on the cardinality of data having the same "bucketname+tenantID" the performance may drop sharply. Please read this: http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__when-no-index. There are several restrictions to secondary index

2. Manual index: easy to design, but potentially wide row and not well balance if data having the same "bucketname+tenantID" is very large. Furthermore you need to manage index consistency manually so that it is synced with source data updates.

The best thing to do is to benchmark both solutions and takes the approach giving you the best results. Be careful with benchmarks, it should be representative of the data pattern you likely have in production.

On Fri, Aug 22, 2014 at 7:47 AM, Leleu Eric <Er...@worldline.com>> wrote:
Hi,

I'm new with Cassandra and I wondering what is the best design for my case.

I have a set of buckets that contain one or thousands of contents.

Here is my Content CF :

To retrieve all contents that belong to a bucket, I have created an index on the bucket column.

CREATE INDEX IF NOT EXISTS bucket_to_contents ON contents (bucket);

The column value "bucket" is concatenated with the tenantId (bucket = bucketname+tenantID) in order to avoid filtering on the tenantID on my application.

Is it the rights way to do or should I create another column family to link each content to the bucket ?

Under the hood what is the difference of the both solutions?

According to my understanding, the result will be the same. Both will have the rowkey equals to the "bucketname" and the "tenantID".
Excepted that the secondary index can have a replication delay...

Can you help me on this point?

Regards,
Eric

________________________________

Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.

________________________________

Re: Secondary index or dedicated CF?

Posted by Mark Reddy <ma...@gmail.com>.

Hi,

As a general rule of thumb I would steer clear of secondary indexes, this is also the official stand that DataStax take (see p5 of their best practices doc: http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf).

“It is best to avoid using Cassandra's built-in secondary indexes where possible. Instead, it is recommended to denormalize data and manually maintain a dynamic table as a form of an index instead of using a secondary index. If and when secondary indexes are to be used, they should be created only on columns containing low-cardinality data (for example: fields with less than 1000 states).“

Mark

On 22 Aug 2014, at 15:58, DuyHai Doan <do...@gmail.com> wrote:

> Hello Eric
> 
> "Under the hood what is the difference of the both solutions?" 
> 
>  1. Cassandra secondary index: distributed index, supports better high volume of data, the index itself is distributed so there is no bottleneck. The tradeoff is that depending on the cardinality of data having the same "bucketname+tenantID" the performance may drop sharply. Please read this: http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__when-no-index. There are several restrictions to secondary index
> 
> 2. Manual index: easy to design, but potentially wide row and not well balance if  data having the same "bucketname+tenantID" is very large. Furthermore you need to manage index consistency manually so that it is synced with source data updates. 
> 
>  The best thing to do is to benchmark both solutions and takes the approach giving you the best results. Be careful with benchmarks, it should be representative of the data pattern you likely have in production.
> 
> 
> On Fri, Aug 22, 2014 at 7:47 AM, Leleu Eric <Er...@worldline.com> wrote:
> Hi,
> 
>  
> 
>  
> 
> I’m new with Cassandra and I wondering what is the best design for my case.
> 
>  
> 
> I have a set of buckets that contain one or thousands of contents.
> 
>  
> 
> Here is my Content CF :
> 
>  
> 
>     CREATE TABLE IF NOT EXISTS contents (tenantID varchar,
> 
>      key varchar,
> 
>     type varchar,
> 
>     bucket varchar,
> 
>     owner varchar,
> 
>     workspace varchar,
> 
>      public_read boolean  PRIMARY KEY ((key, tenantID), type, workspace));
> 
>  
> 
>  
> 
> To retrieve all contents that belong to a bucket, I have created an index on the bucket column.
> 
>  
> 
>     CREATE INDEX IF NOT EXISTS bucket_to_contents ON contents (bucket);
> 
>  
> 
> The column value “bucket” is concatenated with the tenantId (bucket = bucketname+tenantID) in order to avoid filtering on the tenantID on my application.
> 
>  
> 
> Is it the rights way to do or should I create another column family to link each content to the bucket ?
> 
>  
> 
>     CREATE TABLE IF NOT EXISTS bucket_to_contents (tenantID varchar,
> 
>      key varchar,
> 
>     type varchar,
> 
>     bucket varchar,
> 
>     owner varchar,
> 
>     workspace varchar,
> 
>      public_read boolean  PRIMARY KEY ((bucket, tenantID), key));
> 
>  
> 
> Under the hood what is the difference of the both solutions?
> 
>  
> 
> According to my understanding, the result will be the same. Both will have the rowkey equals to the “bucketname”  and the “tenantID”.
> 
> Excepted that the secondary index can have a replication delay…
> 
>  
> 
> Can you help me on this point?
> 
>  
> 
> Regards,
> 
> Eric
> 
>  
> 
> 
> 
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.
> 
> This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.
>

Re: Secondary index or dedicated CF?

Posted by DuyHai Doan <do...@gmail.com>.

Hello Eric

"Under the hood what is the difference of the both solutions?"

 1. Cassandra secondary index: distributed index, supports better high
volume of data, the index itself is distributed so there is no bottleneck.
The tradeoff is that depending on the cardinality of data having the same "
bucketname+tenantID" the performance may drop sharply. Please read this:
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__when-no-index.
There are several restrictions to secondary index

2. Manual index: easy to design, but potentially wide row and not well
balance if  data having the same "bucketname+tenantID" is very large.
Furthermore you need to manage index consistency manually so that it is
synced with source data updates.

 The best thing to do is to benchmark both solutions and takes the approach
giving you the best results. Be careful with benchmarks, it should be
representative of the data pattern you likely have in production.


On Fri, Aug 22, 2014 at 7:47 AM, Leleu Eric <Er...@worldline.com>
wrote:

>  Hi,
>
>
>
>
>
> I’m new with Cassandra and I wondering what is the best design for my case.
>
>
>
> I have a set of buckets that contain one or thousands of contents.
>
>
>
> Here is my Content CF :
>
>
>
>     CREATE TABLE IF NOT EXISTS contents (tenantID varchar,
>
>      key varchar,
>
>     type varchar,
>
>     bucket varchar,
>
>     owner varchar,
>
>     workspace varchar,
>
>      public_read boolean  PRIMARY KEY ((key, tenantID), type, workspace));
>
>
>
>
>
> To retrieve all contents that belong to a bucket, I have created an index
> on the bucket column.
>
>
>
>     CREATE INDEX IF NOT EXISTS bucket_to_contents ON contents (bucket);
>
>
>
> The column value “bucket” is concatenated with the tenantId (bucket =
> bucketname+tenantID) in order to avoid filtering on the tenantID on my
> application.
>
>
>
> Is it the rights way to do or should I create another column family to
> link each content to the bucket ?
>
>
>
>     CREATE TABLE IF NOT EXISTS bucket_to_contents (tenantID varchar,
>
>      key varchar,
>
>     type varchar,
>
>     bucket varchar,
>
>     owner varchar,
>
>     workspace varchar,
>
>      public_read boolean  PRIMARY KEY ((bucket, tenantID), key));
>
>
>
> Under the hood what is the difference of the both solutions?
>
>
>
> According to my understanding, the result will be the same. Both will have
> the rowkey equals to the “bucketname”  and the “tenantID”.
>
> Excepted that the secondary index can have a replication delay…
>
>
>
> Can you help me on this point?
>
>
>
> Regards,
>
> Eric
>
>
>
> ------------------------------
>
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage
> exclusif de ses destinataires. Il peut également être protégé par le secret
> professionnel. Si vous recevez ce message par erreur, merci d'en avertir
> immédiatement l'expéditeur et de le détruire. L'intégrité du message ne
> pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra
> être recherchée quant au contenu de ce message. Bien que les meilleurs
> efforts soient faits pour maintenir cette transmission exempte de tout
> virus, l'expéditeur ne donne aucune garantie à cet égard et sa
> responsabilité ne saurait être recherchée pour tout dommage résultant d'un
> virus transmis.
>
> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Worldline liability
> cannot be triggered for the message content. Although the sender endeavours
> to maintain a computer virus-free network, the sender does not warrant that
> this transmission is virus-free and will not be liable for any damages
> resulting from any virus transmitted.
>