You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Matthew Johnson <ma...@algomi.com> on 2015/06/03 16:48:47 UTC

How to store denormalized data

Hi all,



I am trying to store some data (user actions in our application) for future
analysis (probably using Spark). I understand best practice is to store it
in denormalized form, and this will definitely make some of our future
queries much easier. But I have a problem with denormalizing the data.



For example, let’s say one of my queries is “the number of reports
generated by user type”. In the part of the application that the user
connects to to generate reports, we only have access to the user id. In a
traditional RDBMS, this is fine, because at query time you join the user id
onto the users table and get all the user data associated with that user.
But how do I populate extra fields like user type on the fly?



My ideas so far:

1.       I try and maintain an in-memory cache of data such as “user”, and
do a lookup to this cache for every user action and store the user data
with it. #PROS: fast #CONS: not scalable, will run out of memory if data
sets grow

2.       For each user action, I do a call to RDBMS and look up the data
for the user in question, then store the user action plus the user data as
a single row. #PROS easy to scale #CONS slow

3.       I write only the user id and the action straight away, and have a
separate batch process that periodically goes through my table looking for
rows without user data, and looks up the user data from RDBMS and populates
it





None of these solutions seem ideal to me. Does Cassandra have something
like ‘triggers’, where I can set up a table to automatically populate some
rows based on a lookup from another table? Or perhaps Spark or some other
library has built-in functionality that solves exactly this problem?



Any suggestions much appreciated.



Thanks,

Matthew

RE: How to store denormalized data

Posted by Matthew Johnson <ma...@algomi.com>.

Thanks Shahab,



That was my initial thought. The downside I can think of for that approach
is if/when we decide to use this data to serve suggestions in real time
back to the users (in a sort of “if you clicked on this you might also like
to click on this”) and the algorithms for that would need to be driven off
the extra columns. Having said that, we are nowhere near that stage with
our application, so I could opt for the batch approach for now and cross
that bridge when we come to it! Just wondering if anyone else has already
solved this in a really elegant way already :)



Cheers,

Matthew



*From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
*Sent:* 03 June 2015 15:55
*To:* user@cassandra.apache.org
*Subject:* Re: How to store denormalized data



Suggestion or rather food for thought....



Do you expect to read/analyze the written data right away? Or will it be a
batch process, kicked off later in time? What I am trying to say is that if
the 'read/analysis' part is a) batch process and b) kicked off later in
time, then #3 is a fine solution? What harm in it? Also, you can slightly
change it, (if applicable) and not populate as a separate batch process but
in fact make part of  your analysis job? Kind of a pre-process/prep step?



Regards,

Shahab



On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson <ma...@algomi.com>
wrote:

Hi all,



I am trying to store some data (user actions in our application) for future
analysis (probably using Spark). I understand best practice is to store it
in denormalized form, and this will definitely make some of our future
queries much easier. But I have a problem with denormalizing the data.



For example, let’s say one of my queries is “the number of reports
generated by user type”. In the part of the application that the user
connects to to generate reports, we only have access to the user id. In a
traditional RDBMS, this is fine, because at query time you join the user id
onto the users table and get all the user data associated with that user.
But how do I populate extra fields like user type on the fly?



My ideas so far:

1.       I try and maintain an in-memory cache of data such as “user”, and
do a lookup to this cache for every user action and store the user data
with it. #PROS: fast #CONS: not scalable, will run out of memory if data
sets grow

2.       For each user action, I do a call to RDBMS and look up the data
for the user in question, then store the user action plus the user data as
a single row. #PROS easy to scale #CONS slow

3.       I write only the user id and the action straight away, and have a
separate batch process that periodically goes through my table looking for
rows without user data, and looks up the user data from RDBMS and populates
it





None of these solutions seem ideal to me. Does Cassandra have something
like ‘triggers’, where I can set up a table to automatically populate some
rows based on a lookup from another table? Or perhaps Spark or some other
library has built-in functionality that solves exactly this problem?



Any suggestions much appreciated.



Thanks,

Matthew

Re: How to store denormalized data

Posted by Shahab Yunus <sh...@gmail.com>.

Suggestion or rather food for thought....

Do you expect to read/analyze the written data right away? Or will it be a
batch process, kicked off later in time? What I am trying to say is that if
the 'read/analysis' part is a) batch process and b) kicked off later in
time, then #3 is a fine solution? What harm in it? Also, you can slightly
change it, (if applicable) and not populate as a separate batch process but
in fact make part of  your analysis job? Kind of a pre-process/prep step?

Regards,
Shahab

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson <ma...@algomi.com>
wrote:

> Hi all,
>
>
>
> I am trying to store some data (user actions in our application) for
> future analysis (probably using Spark). I understand best practice is to
> store it in denormalized form, and this will definitely make some of our
> future queries much easier. But I have a problem with denormalizing the
> data.
>
>
>
> For example, let’s say one of my queries is “the number of reports
> generated by user type”. In the part of the application that the user
> connects to to generate reports, we only have access to the user id. In a
> traditional RDBMS, this is fine, because at query time you join the user id
> onto the users table and get all the user data associated with that user.
> But how do I populate extra fields like user type on the fly?
>
>
>
> My ideas so far:
>
> 1.       I try and maintain an in-memory cache of data such as “user”,
> and do a lookup to this cache for every user action and store the user data
> with it. #PROS: fast #CONS: not scalable, will run out of memory if data
> sets grow
>
> 2.       For each user action, I do a call to RDBMS and look up the data
> for the user in question, then store the user action plus the user data as
> a single row. #PROS easy to scale #CONS slow
>
> 3.       I write only the user id and the action straight away, and have
> a separate batch process that periodically goes through my table looking
> for rows without user data, and looks up the user data from RDBMS and
> populates it
>
>
>
>
>
> None of these solutions seem ideal to me. Does Cassandra have something
> like ‘triggers’, where I can set up a table to automatically populate some
> rows based on a lookup from another table? Or perhaps Spark or some other
> library has built-in functionality that solves exactly this problem?
>
>
>
> Any suggestions much appreciated.
>
>
>
> Thanks,
>
> Matthew
>
>
>

Re: How to store denormalized data

Posted by Jack Krupansky <ja...@gmail.com>.

Your requirement is still not quite clear - are you counting users or
reports, or reports of a type for each user, or...?

You can have a separate table, with the partition key being the user type,
and using the user id as a clustering column - provided that the number of
users is only thousands or no more than low millions. Then write a row
whenever a report is generated for a given type and user ID. Do you need to
count multiple instances of the same report for a given user? If so, you
can use a time stamp as an additional clustering column.


-- Jack Krupansky

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson <ma...@algomi.com>
wrote:

> Hi all,
>
>
>
> I am trying to store some data (user actions in our application) for
> future analysis (probably using Spark). I understand best practice is to
> store it in denormalized form, and this will definitely make some of our
> future queries much easier. But I have a problem with denormalizing the
> data.
>
>
>
> For example, let’s say one of my queries is “the number of reports
> generated by user type”. In the part of the application that the user
> connects to to generate reports, we only have access to the user id. In a
> traditional RDBMS, this is fine, because at query time you join the user id
> onto the users table and get all the user data associated with that user.
> But how do I populate extra fields like user type on the fly?
>
>
>
> My ideas so far:
>
> 1.       I try and maintain an in-memory cache of data such as “user”,
> and do a lookup to this cache for every user action and store the user data
> with it. #PROS: fast #CONS: not scalable, will run out of memory if data
> sets grow
>
> 2.       For each user action, I do a call to RDBMS and look up the data
> for the user in question, then store the user action plus the user data as
> a single row. #PROS easy to scale #CONS slow
>
> 3.       I write only the user id and the action straight away, and have
> a separate batch process that periodically goes through my table looking
> for rows without user data, and looks up the user data from RDBMS and
> populates it
>
>
>
>
>
> None of these solutions seem ideal to me. Does Cassandra have something
> like ‘triggers’, where I can set up a table to automatically populate some
> rows based on a lookup from another table? Or perhaps Spark or some other
> library has built-in functionality that solves exactly this problem?
>
>
>
> Any suggestions much appreciated.
>
>
>
> Thanks,
>
> Matthew
>
>
>