You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by srinivas s <ss...@gmail.com> on 2015/09/04 23:15:54 UTC

Convert joins in RDBMS to Cassandra

   1. Hi All,



   1. I am trying to model  RDBMS joins into cassandra. As I am new to
   cassandra, I need your help/suggestion on this.  Below is the information
   regarding the query:
   2.
   3. I have a query in RDBMS as follows:
   4.
   5. select t3.name from  Table1 t1, Table2 t2, Table3 t3, Table4 t4 where
   6. t2.cust_id = 3 and t4.sid = t1.sid and t1.colid = t2.colid  and
   t4.cid = t3.cid
   7.
   8.
   9. Now, trying to make a shimilar query in cassandra:
   10.
   11. As per my learning experience in Cassandra, I got the below 2
   solutions: (as cassandra does not support joins)
   12.
   13. ****Solution 1:*****
   14.
   15. 1) Fetch all the records with t2.cust_id = 3
   16. 2) Now again run another query that will do the condition t3.sid =
   t1.sid on the results returned from point 1.
   17. 3) continue the same for all the conditions.
   18.
   19. Drawbacks with this approach:
   20.
   21. For each join, I have to do a network call to fetch the details.
   Also, it will take more time..as I am running multiple conditions
   22.
   23.
   24. ****Solution 2: *****
   25.
   26. 1) Create a map table for every possible join.
   27.
   28. Drawbacks with this aproach:
   29.
   30. I think, this is not a right approach. So join to table (map table)
   mapping idea is not right.
   31.
   32. pastebin link for the same: http://pastebin.com/FRAyihPT
   33. Please suggest me on this.

Re: Convert joins in RDBMS to Cassandra

Posted by Ryan Svihla <rs...@foundev.pro>.

The normal approach is denormalization to a materialized view (the
traditional thinking of it not the new 3.0 feature coming out), which is
also true of using an RBMS at scale (joins across all data sets get
expensive once you start having to shard across different servers).

The simplistic idea is to build your tables to map your queries (or API
calls if you want to be more advanced in your thinking) 1 for 1 with the
following sets of constraints:

   1. All data needed to satisfy that query exists in this table even if it
   also exists somewhere else.
   2. The partition key (first part of the primary key) matching the where
   clause you'd like to use on this table for that query.
   3. The clustering column defines order inside a given partition key.
   4. Partitions should not be "too fat". This is a more advanced topic.

Imagine the case of a user profile, I may choose to store all changes of
that user profile in a "profile history" table. It would probably look like

CREATE TABLE user_profile_history ( user_id uuid, ts timestamp, change
text, PRIMARY KEY( user_id, ts))

In this case I'll get a partition key of user_id, a timestamp of when the
change occurred as the clustering column giving me an implied order of
ascending, and the change in a text field.

Practical considerations for updating and keeping these tables in sync are
myriad. Starting out it's probably easiest to have one table be the "source
of truth" and all other views derived off that single source of truth,
either at write time, or with batches running throughout the day, or if
latency is a concern streaming (you can do streaming and batching for a
particularly potent combination).  This pattern I see frequently and have
pushed it's use to good effect many times.

The following blog provides some great introductory ideas on data modeling
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

On Fri, Sep 4, 2015 at 4:15 PM, srinivas s <ss...@gmail.com> wrote:

>
>    1. Hi All,
>
>
>
>    1. I am trying to model  RDBMS joins into cassandra. As I am new to
>    cassandra, I need your help/suggestion on this.  Below is the information
>    regarding the query:
>    2.
>    3. I have a query in RDBMS as follows:
>    4.
>    5. select t3.name from  Table1 t1, Table2 t2, Table3 t3, Table4 t4
>    where
>    6. t2.cust_id = 3 and t4.sid = t1.sid and t1.colid = t2.colid  and
>    t4.cid = t3.cid
>    7.
>    8.
>    9. Now, trying to make a shimilar query in cassandra:
>    10.
>    11. As per my learning experience in Cassandra, I got the below 2
>    solutions: (as cassandra does not support joins)
>    12.
>    13. ****Solution 1:*****
>    14.
>    15. 1) Fetch all the records with t2.cust_id = 3
>    16. 2) Now again run another query that will do the condition t3.sid =
>    t1.sid on the results returned from point 1.
>    17. 3) continue the same for all the conditions.
>    18.
>    19. Drawbacks with this approach:
>    20.
>    21. For each join, I have to do a network call to fetch the details.
>    Also, it will take more time..as I am running multiple conditions
>    22.
>    23.
>    24. ****Solution 2: *****
>    25.
>    26. 1) Create a map table for every possible join.
>    27.
>    28. Drawbacks with this aproach:
>    29.
>    30. I think, this is not a right approach. So join to table (map
>    table) mapping idea is not right.
>    31.
>    32. pastebin link for the same: http://pastebin.com/FRAyihPT
>    33. Please suggest me on this.
>
>
>
>

-- 

Thanks,
Ryan Svihla