You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by pc...@cegetel.net on 2011/11/29 11:02:08 UTC

Cassandra DataModeling recommendations

Hi all,
In order to evaluate NoSQL solutions and to gain knowledge, I am currently working on a kind of prototype. 
Here is a brief overview of the scope:

I would like to manage user carts. Lets keep things simple:
A user can have up to n (lets say 3 for example) carts. Each cart will contain metadata and among them an expiration date and a blob containing stuff (xml in fact but I really don't care of the content).

A user can save, retrieve or delete his carts. Additionally, a dedicated batch process would remove carts who are expired.

Basically I was thinking of two ways to model the data:
1- A ColumnFamily with the userid as a key and having several SuperColumns each one describing a Cart and its content.
This has the advantage that I can get all the Carts in a single get or can do some slice queries to get only some Carts. The problem is that I cannot if I am right create a secondary index on the expired date column inside each Cart.
2- A ColumnFamily with a composite key like userid::cartId containing the expiration date column and the blob. I can in that case create an index to perform a query on the expiration timestamp. The drawback is that if I want to get all the Carts I need to create either a secondary ColumnFamily listing the carts associated to a userid or use a kind of OrderPreservingPartitionner if I want to perform a Key-Range Query.

I made some tests and I had some problems
First I was unable to perform queries in the case 2 like:
get Carts where timestamp < xxxxxxx; The (ugly, really!) workaround was to create a fake column always set to true and the query that worked was:
get Carts where dummy=true and timestamp < xxxxxxx; But I really dislike this solution and I am almost sure this is not the right way to go.

I tried something different like creating a dedicated timestamp columnfamily associating a key based on a timestamp and columns related to user and carts. In that case if I want outdated entries I could perform a range query on keys of this columnfamily. But again in that case I need an OrderPreservingPartionner and I fear that using a timestamp as a key would lead to a bad repartition scheme among the nodes. If I fit to the second proposal (with Standard Columns), columns could be directly the key like userId::cartId and there is no logic in the removal process. If I fit to first solution solution, I need to have some logic to analyze the column key or value to get userid + cartid.
Another point, if I use this column family I have to manage "updates". If for example I replace Cart2 of user1, I need to remove the corresponding entry and add a new one. This is honestly probably not the hardest part.

I have the feeling that having a userId based ColumnFamily with SuperColumns inside and a dedicated timestamp table is the best choice. In fact I think that basically my requests will be:
- Give me all the carts of a userId
- Remove all the expired carts: which is probably in fact 2 requests: Find all carts whose expiry date is before a given date. Delete the found stuff.

I am fairly new to NoSQL and especially to Cassandra so I would like to get any advice on:
1- Is Cassandra suited to this kind of storage ? I would say yes
2- What is the right way to model the data and the related constraints.

If my description is unclear or anyone does need more details, do not hesitate to ask
Thanks in advance for any help or advice

Regards

Pascal

Re: Re: Cassandra DataModeling recommendations

Posted by Boris Yen <yu...@gmail.com>.
I think most of the book for cassandra are outdated, try to get information
from http://www.datastax.com/docs/1.0/index

As for ttl, you could read
http://www.datastax.com/dev/blog/whats-new-cassandra-07-expiring-columns for
more information.

for composite type, you could read
http://www.slideshare.net/edanuff/indexing-in-cassandra

On Mon, Dec 5, 2011 at 5:36 PM, <pc...@cegetel.net> wrote:

> Hi
> Thanks for the answer, as I read the book on Cassandra, I was not aware at
> that time on Composite Key which I recently discovered.
>
> You mentioned a TTL and let the database remove the date for me. I never
> read about that. Is it possible without an external batch ?
>
> I will try to rephrase in any case my goal:
>
> Storage:
> - I would like to store for a user (identified by its id) several carts
> (BLOB).
> - Associated to these carts, I would like to attach metadata like
> expiration date and possibly others.
>
> Queries/tasks:
> - I would like to be able to retrieve all the carts of a given userId.
> - I would like to have a mean to remove expired carts.
>
> I think that this use case is not very complicated but I was wondering if
> there was any kind of modelling recommendation.
>
> Thanks for your answer.
>
> Best Regards
>
> ========================================
>
>
>
> Not sure I understand your use case, but I think you could use a composite
> column instead of composite key.
>
> For example,
>
> UserID:{
>     TimeUUID1:CartID1,
>     TimeUUID2:CartID2,
>     TimeUUID3:CartID3,
> }
>
> This way, you could do a slice query on the time if you do not need all
> the carts, and you could also get all the carts in one query.
>
> For expired carts, maybe you could attach TTL to each column that has time
> constraint, let the database remove the data for you.
>
> Hi all,
>> In order to evaluate NoSQL solutions and to gain knowledge, I am
>> currently working on a kind of prototype.
>> Here is a brief overview of the scope:
>>
>> I would like to manage user carts. Lets keep things simple:
>> A user can have up to n (lets say 3 for example) carts. Each cart will
>> contain metadata and among them an expiration date and a blob containing
>> stuff (xml in fact but I really don't care of the content).
>>
>> A user can save, retrieve or delete his carts. Additionally, a dedicated
>> batch process would remove carts who are expired.
>>
>> Basically I was thinking of two ways to model the data:
>> 1- A ColumnFamily with the userid as a key and having several
>> SuperColumns each one describing a Cart and its content.
>> This has the advantage that I can get all the Carts in a single get or
>> can do some slice queries to get only some Carts. The problem is that I
>> cannot if I am right create a secondary index on the expired date column
>> inside each Cart.
>> 2- A ColumnFamily with a composite key like userid::cartId containing the
>> expiration date column and the blob. I can in that case create an index to
>> perform a query on the expiration timestamp. The drawback is that if I want
>> to get all the Carts I need to create either a secondary ColumnFamily
>> listing the carts associated to a userid or use a kind of
>> OrderPreservingPartitionner if I want to perform a Key-Range Query.
>>
>> I made some tests and I had some problems
>> First I was unable to perform queries in the case 2 like:
>> get Carts where timestamp < xxxxxxx; The (ugly, really!) workaround was
>> to create a fake column always set to true and the query that worked was:
>> get Carts where dummy=true and timestamp < xxxxxxx; But I really dislike
>> this solution and I am almost sure this is not the right way to go.
>>
>> I tried something different like creating a dedicated timestamp
>> columnfamily associating a key based on a timestamp and columns related to
>> user and carts. In that case if I want outdated entries I could perform a
>> range query on keys of this columnfamily. But again in that case I need an
>> OrderPreservingPartionner and I fear that using a timestamp as a key would
>> lead to a bad repartition scheme among the nodes. If I fit to the second
>> proposal (with Standard Columns), columns could be directly the key like
>> userId::cartId and there is no logic in the removal process. If I fit to
>> first solution solution, I need to have some logic to analyze the column
>> key or value to get userid + cartid.
>> Another point, if I use this column family I have to manage "updates". If
>> for example I replace Cart2 of user1, I need to remove the corresponding
>> entry and add a new one. This is honestly probably not the hardest part.
>>
>> I have the feeling that having a userId based ColumnFamily with
>> SuperColumns inside and a dedicated timestamp table is the best choice. In
>> fact I think that basically my requests will be:
>> - Give me all the carts of a userId
>> - Remove all the expired carts: which is probably in fact 2 requests:
>> Find all carts whose expiry date is before a given date. Delete the found
>> stuff.
>>
>> I am fairly new to NoSQL and especially to Cassandra so I would like to
>> get any advice on:
>> 1- Is Cassandra suited to this kind of storage ? I would say yes
>> 2- What is the right way to model the data and the related constraints.
>>
>> If my description is unclear or anyone does need more details, do not
>> hesitate to ask
>> Thanks in advance for any help or advice
>>
>> Regards
>>
>> Pascal
>>
>
>
>

Re: Re: Cassandra DataModeling recommendations

Posted by samal <sa...@gmail.com>.
On Mon, Dec 5, 2011 at 3:06 PM, <pc...@cegetel.net> wrote:

> Hi
> Thanks for the answer, as I read the book on Cassandra, I was not aware at
> that time on Composite Key which I recently discovered.
>

*Composite Type's are useful for handling data-versions.
*
* *

> * *You mentioned a TTL and let the database remove the date for me. I
> never read about that. Is it possible without an external batch ?
>

*Yes, TTL if set on column, auto delete column for you.*


> I will try to rephrase in any case my goal:
>
> Storage:
> - I would like to store for a user (identified by its id) several carts
> (BLOB).
>


> - Associated to these carts, I would like to attach metadata like
> expiration date and possibly others.
>
> Queries/tasks:
> - I would like to be able to retrieve all the carts of a given userId.
>

*I would use timeline with TTL for carts as separate CF. And cart_Id to
reverse index in userId CF with TTL set on columns.  *

- I would like to have a mean to remove expired carts.
>
 *set TTL on each column. *

1.
cartCF{
 *cart1_uuidkey:{
   metadata_column:ttl
 }
 cart2_uuidkey:{
  metadata_column:ttl
}
.
.
.cartN_uuidkey:{
  metadata_column:ttl
}*
}

2.
userIdCF:{
 *user1:{
    id:user1 //*hack : to prevent unwanted behavior one column with no ttl.*
    cart1:cart1_uuidkey:ttl
    cart2:........ttl
    cart3:........ttl
}
user2:{
    id:user2
    cart1:cartX_uuidkey:ttl
    cart2:cart4........:ttl
    cart3:cartM........ttl

}*

}

/Samal

Re: Re: Re: Cassandra DataModeling recommendations

Posted by pc...@cegetel.net.
Thanks to both of you for these very useful and interesting links/proposals.

Best Regards

========================================


 Hi 
 Thanks for the answer, as I read the book on Cassandra, I was not aware  at that time on Composite Key which I recently discovered.
 
Composite Type's are useful for handling data-versions.   
 
 
 You mentioned a TTL and let the database remove the date for me. I  never read about that. Is it possible without an external batch ?
  
Yes, TTL if set on column, auto delete column for you. 
 
  I will try to rephrase in any case my goal:
 
 Storage:
 - I would like to store for a user (identified by its id) several carts (BLOB).
 
  - Associated to these carts, I would like to attach metadata like expiration date and possibly others.
 
 Queries/tasks:
 - I would like to be able to retrieve all the carts of a given userId.
  
I would use timeline with TTL for carts as separate CF. And cart_Id to reverse index in userId CF with TTL set on columns.  


  - I would like to have a mean to remove expired carts.
 set TTL on each column. 

 
1.
cartCF{
 cart1_uuidkey:{
   metadata_column:ttl
 }
 cart2_uuidkey:{
   metadata_column:ttl
}
.
.
.cartN_uuidkey:{
  metadata_column:ttl
}
}

2.
userIdCF:{
 user1:{
    id:user1 //hack : to prevent unwanted behavior one column with no ttl.
     cart1:cart1_uuidkey:ttl
    cart2:........ttl
    cart3:........ttl
}
user2:{
    id:user2  
    cart1:cartX_uuidkey:ttl
     cart2:cart4........:ttl
    cart3:cartM........ttl

}

}

/Samal


 

Re: Re: Cassandra DataModeling recommendations

Posted by pc...@cegetel.net.
Hi 
 Thanks for the answer, as I read the book on Cassandra, I was not aware  at that time on Composite Key which I recently discovered.
 
 You mentioned a TTL and let the database remove the date for me. I  never read about that. Is it possible without an external batch ?
 
 I will try to rephrase in any case my goal:
 
 Storage:
 - I would like to store for a user (identified by its id) several carts (BLOB).
 - Associated to these carts, I would like to attach metadata like expiration date and possibly others.
 
 Queries/tasks:
 - I would like to be able to retrieve all the carts of a given userId.
 - I would like to have a mean to remove expired carts.
 
 I think that this use case is not very complicated but I was wondering if there was any kind of modelling recommendation.
 
 Thanks for your answer.
 
 Best Regards

========================================


 Not sure I understand your use case, but I think you could use a composite column instead of composite key.

For example, 


UserID:{
    TimeUUID1:CartID1,
    TimeUUID2:CartID2,
     TimeUUID3:CartID3,
}


This way, you could do a slice query on the time if you do not need all the carts, and you could also get all the carts in one query.


 For expired carts, maybe you could attach TTL to each column that has time constraint, let the database remove the data for you.

 Hi all,
In order to evaluate NoSQL solutions and to gain knowledge, I am currently working on a kind of prototype. 
 Here is a brief overview of the scope:

I would like to manage user carts. Lets keep things simple:
A user can have up to n (lets say 3 for example) carts. Each cart will contain metadata and among them an expiration date and a blob containing stuff (xml in fact but I really don't care of the content).
 
A user can save, retrieve or delete his carts. Additionally, a dedicated batch process would remove carts who are expired.

Basically I was thinking of two ways to model the data:
1- A ColumnFamily with the userid as a key and having several SuperColumns each one describing a Cart and its content.
 This has the advantage that I can get all the Carts in a single get or can do some slice queries to get only some Carts. The problem is that I cannot if I am right create a secondary index on the expired date column inside each Cart.
 2- A ColumnFamily with a composite key like userid::cartId containing the expiration date column and the blob. I can in that case create an index to perform a query on the expiration timestamp. The drawback is that if I want to get all the Carts I need to create either a secondary ColumnFamily listing the carts associated to a userid or use a kind of OrderPreservingPartitionner if I want to perform a Key-Range Query.
 
I made some tests and I had some problems
First I was unable to perform queries in the case 2 like:
get Carts where timestamp < xxxxxxx; The (ugly, really!) workaround was to create a fake column always set to true and the query that worked was:
 get Carts where dummy=true and timestamp < xxxxxxx; But I really dislike this solution and I am almost sure this is not the right way to go.

I tried something different like creating a dedicated timestamp columnfamily associating a key based on a timestamp and columns related to user and carts. In that case if I want outdated entries I could perform a range query on keys of this columnfamily. But again in that case I need an OrderPreservingPartionner and I fear that using a timestamp as a key would lead to a bad repartition scheme among the nodes. If I fit to the second proposal (with Standard Columns), columns could be directly the key like userId::cartId and there is no logic in the removal process. If I fit to first solution solution, I need to have some logic to analyze the column key or value to get userid + cartid.
 Another point, if I use this column family I have to manage "updates". If for example I replace Cart2 of user1, I need to remove the corresponding entry and add a new one. This is honestly probably not the hardest part.
 
I have the feeling that having a userId based ColumnFamily with SuperColumns inside and a dedicated timestamp table is the best choice. In fact I think that basically my requests will be:
- Give me all the carts of a userId
 - Remove all the expired carts: which is probably in fact 2 requests: Find all carts whose expiry date is before a given date. Delete the found stuff.

I am fairly new to NoSQL and especially to Cassandra so I would like to get any advice on:
 1- Is Cassandra suited to this kind of storage ? I would say yes
2- What is the right way to model the data and the related constraints.

If my description is unclear or anyone does need more details, do not hesitate to ask
 Thanks in advance for any help or advice

Regards

Pascal




 

Re: Cassandra DataModeling recommendations

Posted by Boris Yen <yu...@gmail.com>.
Not sure I understand your use case, but I think you could use a composite
column instead of composite key.

For example,

UserID:{
    TimeUUID1:CartID1,
    TimeUUID2:CartID2,
    TimeUUID3:CartID3,
}

This way, you could do a slice query on the time if you do not need all the
carts, and you could also get all the carts in one query.

For expired carts, maybe you could attach TTL to each column that has time
constraint, let the database remove the data for you.

On Tue, Nov 29, 2011 at 6:02 PM, <pc...@cegetel.net> wrote:

> Hi all,
> In order to evaluate NoSQL solutions and to gain knowledge, I am currently
> working on a kind of prototype.
> Here is a brief overview of the scope:
>
> I would like to manage user carts. Lets keep things simple:
> A user can have up to n (lets say 3 for example) carts. Each cart will
> contain metadata and among them an expiration date and a blob containing
> stuff (xml in fact but I really don't care of the content).
>
> A user can save, retrieve or delete his carts. Additionally, a dedicated
> batch process would remove carts who are expired.
>
> Basically I was thinking of two ways to model the data:
> 1- A ColumnFamily with the userid as a key and having several SuperColumns
> each one describing a Cart and its content.
> This has the advantage that I can get all the Carts in a single get or can
> do some slice queries to get only some Carts. The problem is that I cannot
> if I am right create a secondary index on the expired date column inside
> each Cart.
> 2- A ColumnFamily with a composite key like userid::cartId containing the
> expiration date column and the blob. I can in that case create an index to
> perform a query on the expiration timestamp. The drawback is that if I want
> to get all the Carts I need to create either a secondary ColumnFamily
> listing the carts associated to a userid or use a kind of
> OrderPreservingPartitionner if I want to perform a Key-Range Query.
>
> I made some tests and I had some problems
> First I was unable to perform queries in the case 2 like:
> get Carts where timestamp < xxxxxxx; The (ugly, really!) workaround was to
> create a fake column always set to true and the query that worked was:
> get Carts where dummy=true and timestamp < xxxxxxx; But I really dislike
> this solution and I am almost sure this is not the right way to go.
>
> I tried something different like creating a dedicated timestamp
> columnfamily associating a key based on a timestamp and columns related to
> user and carts. In that case if I want outdated entries I could perform a
> range query on keys of this columnfamily. But again in that case I need an
> OrderPreservingPartionner and I fear that using a timestamp as a key would
> lead to a bad repartition scheme among the nodes. If I fit to the second
> proposal (with Standard Columns), columns could be directly the key like
> userId::cartId and there is no logic in the removal process. If I fit to
> first solution solution, I need to have some logic to analyze the column
> key or value to get userid + cartid.
> Another point, if I use this column family I have to manage "updates". If
> for example I replace Cart2 of user1, I need to remove the corresponding
> entry and add a new one. This is honestly probably not the hardest part.
>
> I have the feeling that having a userId based ColumnFamily with
> SuperColumns inside and a dedicated timestamp table is the best choice. In
> fact I think that basically my requests will be:
> - Give me all the carts of a userId
> - Remove all the expired carts: which is probably in fact 2 requests: Find
> all carts whose expiry date is before a given date. Delete the found stuff.
>
> I am fairly new to NoSQL and especially to Cassandra so I would like to
> get any advice on:
> 1- Is Cassandra suited to this kind of storage ? I would say yes
> 2- What is the right way to model the data and the related constraints.
>
> If my description is unclear or anyone does need more details, do not
> hesitate to ask
> Thanks in advance for any help or advice
>
> Regards
>
> Pascal
>