You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Fredrik Stigbäck <fr...@sitevision.se> on 2013/01/27 22:03:39 UTC

Denormalization

Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik

Re: Denormalization

Posted by chandra Varahala <ha...@gmail.com>.

My experience we can design main column families  and lookup column
families.
Main column family have all denormalized data,lookup column  families have
rowkey of denormalized column families's column.

In users column family  all user's  denormalized data and  lookup column
family name like  userByemail.
when i first make request to userByemail retuns unique key which is rowkey
of User column family then call to User column family returns all data,
same other lookup column families too.

-
Chandra



On Sun, Jan 27, 2013 at 8:53 PM, Hiller, Dean <De...@nrel.gov> wrote:

> Agreed, was just making sure others knew ;).
>
> Dean
>
> From: Edward Capriolo <edlinuxguru@gmail.com<mailto:edlinuxguru@gmail.com
> >>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Sunday, January 27, 2013 6:51 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Denormalization
>
> When I said that writes were cheap, I was speaking that in a normal case
> people are making 2-10 inserts what in a relational database might be one.
> 30K inserts is certainly not cheap.
>
> Your use case with 30,000 inserts is probably a special case. Most
> directory services that I am aware of OpenLDAP, Active Directory, Sun
> Directory server do eventually consistent master/slave and multi-master
> replication. So no worries about having to background something. You just
> want the replication to be fast enough so that when you call the employee
> about to be fired into the office, that by the time he leaves and gets home
> he can not VPN to rm -rf / your main file server :)
>
>
> On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean <Dean.Hiller@nrel.gov
> <ma...@nrel.gov>> wrote:
> Sometimes this is true, sometimes not…..….We have a use case where we have
> an admin tool where we choose to do this denorm for ACL on permission
> checks to make permission checks extremely fast.  That said, we have one
> issue with one object that too many children(30,000) so when someone gives
> a user access to this one object with 30,000 children, we end up with a bad
> 60 second wait and users ended up getting frustrated and trying to
> cancel(our plan since admin activity hardly ever happens is to do it on our
> background thread and just return immediately to the user and tell him his
> changes will take affect in 1 minute ).  After all, admin changes are
> infrequent anyways.  This example demonstrates how sometimes it could
> almost burn you.
>
> I guess my real point is it really depends on your use cases ;).  In a lot
> of cases denorm can work but in some cases it burns you so you have to
> balance it all.  In 90% of our cases our denorm is working great and for
> this one case, we need to background the permission change as we still LOVE
> the performance of our ACL checks.
>
> Ps. 30,000 writes in cassandra is not cheap when done from one server ;)
> but in general parallized writes is very fast for like 500.
>
> Later,
> Dean
>
> From: Edward Capriolo <edlinuxguru@gmail.com<mailto:edlinuxguru@gmail.com
> ><ma...@gmail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><ma...@cassandra.apache.org>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>
> Date: Sunday, January 27, 2013 5:50 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>
> Subject: Re: Denormalization
>
> One technique is on the client side you build a tool that takes the even
> and produces N mutations. In c* writes are cheap so essentially, re-write
> everything on all changes.
>
> On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck <
> fredrik.l.stigback@sitevision.se<mailto:fredrik.l.stigback@sitevision.se
> ><mailto:fredrik.l.stigback@sitevision.se<mailto:
> fredrik.l.stigback@sitevision.se>>> wrote:
> Hi.
> Since denormalized data is first-class citizen in Cassandra, how to
> handle updating denormalized data.
> E.g. If we have  a USER cf with name, email etc. and denormalize user
> data into many other CF:s and then
> update the information about a user (name, email...). What is the best
> way to handle updating those user data properties
> which might be spread out over many cf:s and many rows?
>
> Regards
> /Fredrik
>
>
>

Re: Denormalization

Posted by "Hiller, Dean" <De...@nrel.gov>.

Agreed, was just making sure others knew ;).

Dean

From: Edward Capriolo <ed...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Sunday, January 27, 2013 6:51 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Denormalization

When I said that writes were cheap, I was speaking that in a normal case people are making 2-10 inserts what in a relational database might be one. 30K inserts is certainly not cheap.

Your use case with 30,000 inserts is probably a special case. Most directory services that I am aware of OpenLDAP, Active Directory, Sun Directory server do eventually consistent master/slave and multi-master replication. So no worries about having to background something. You just want the replication to be fast enough so that when you call the employee about to be fired into the office, that by the time he leaves and gets home he can not VPN to rm -rf / your main file server :)

On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean <De...@nrel.gov>> wrote:
Sometimes this is true, sometimes not…..….We have a use case where we have an admin tool where we choose to do this denorm for ACL on permission checks to make permission checks extremely fast.  That said, we have one issue with one object that too many children(30,000) so when someone gives a user access to this one object with 30,000 children, we end up with a bad 60 second wait and users ended up getting frustrated and trying to cancel(our plan since admin activity hardly ever happens is to do it on our background thread and just return immediately to the user and tell him his changes will take affect in 1 minute ).  After all, admin changes are infrequent anyways.  This example demonstrates how sometimes it could almost burn you.

I guess my real point is it really depends on your use cases ;).  In a lot of cases denorm can work but in some cases it burns you so you have to balance it all.  In 90% of our cases our denorm is working great and for this one case, we need to background the permission change as we still LOVE the performance of our ACL checks.

Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in general parallized writes is very fast for like 500.

Later,
Dean

From: Edward Capriolo <ed...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Sunday, January 27, 2013 5:50 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Re: Denormalization

One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes.

On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck <fr...@sitevision.se>>> wrote:
Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik

Re: Denormalization

Posted by Edward Capriolo <ed...@gmail.com>.

When I said that writes were cheap, I was speaking that in a normal case
people are making 2-10 inserts what in a relational database might be one.
30K inserts is certainly not cheap.

Your use case with 30,000 inserts is probably a special case. Most
directory services that I am aware of OpenLDAP, Active Directory, Sun
Directory server do eventually consistent master/slave and multi-master
replication. So no worries about having to background something. You just
want the replication to be fast enough so that when you call the employee
about to be fired into the office, that by the time he leaves and gets home
he can not VPN to rm -rf / your main file server :)


On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean <De...@nrel.gov> wrote:

> Sometimes this is true, sometimes not…..….We have a use case where we have
> an admin tool where we choose to do this denorm for ACL on permission
> checks to make permission checks extremely fast.  That said, we have one
> issue with one object that too many children(30,000) so when someone gives
> a user access to this one object with 30,000 children, we end up with a bad
> 60 second wait and users ended up getting frustrated and trying to
> cancel(our plan since admin activity hardly ever happens is to do it on our
> background thread and just return immediately to the user and tell him his
> changes will take affect in 1 minute ).  After all, admin changes are
> infrequent anyways.  This example demonstrates how sometimes it could
> almost burn you.
>
> I guess my real point is it really depends on your use cases ;).  In a lot
> of cases denorm can work but in some cases it burns you so you have to
> balance it all.  In 90% of our cases our denorm is working great and for
> this one case, we need to background the permission change as we still LOVE
> the performance of our ACL checks.
>
> Ps. 30,000 writes in cassandra is not cheap when done from one server ;)
> but in general parallized writes is very fast for like 500.
>
> Later,
> Dean
>
> From: Edward Capriolo <edlinuxguru@gmail.com<mailto:edlinuxguru@gmail.com
> >>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Sunday, January 27, 2013 5:50 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Denormalization
>
> One technique is on the client side you build a tool that takes the even
> and produces N mutations. In c* writes are cheap so essentially, re-write
> everything on all changes.
>
> On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck <
> fredrik.l.stigback@sitevision.se<ma...@sitevision.se>>
> wrote:
> Hi.
> Since denormalized data is first-class citizen in Cassandra, how to
> handle updating denormalized data.
> E.g. If we have  a USER cf with name, email etc. and denormalize user
> data into many other CF:s and then
> update the information about a user (name, email...). What is the best
> way to handle updating those user data properties
> which might be spread out over many cf:s and many rows?
>
> Regards
> /Fredrik
>
>

Re: Denormalization

Posted by "Hiller, Dean" <De...@nrel.gov>.

Sometimes this is true, sometimes not…..….We have a use case where we have an admin tool where we choose to do this denorm for ACL on permission checks to make permission checks extremely fast.  That said, we have one issue with one object that too many children(30,000) so when someone gives a user access to this one object with 30,000 children, we end up with a bad 60 second wait and users ended up getting frustrated and trying to cancel(our plan since admin activity hardly ever happens is to do it on our background thread and just return immediately to the user and tell him his changes will take affect in 1 minute ).  After all, admin changes are infrequent anyways.  This example demonstrates how sometimes it could almost burn you.

I guess my real point is it really depends on your use cases ;).  In a lot of cases denorm can work but in some cases it burns you so you have to balance it all.  In 90% of our cases our denorm is working great and for this one case, we need to background the permission change as we still LOVE the performance of our ACL checks.

Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in general parallized writes is very fast for like 500.

Later,
Dean

From: Edward Capriolo <ed...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Sunday, January 27, 2013 5:50 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Denormalization

One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes.

On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck <fr...@sitevision.se>> wrote:
Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik

Re: Denormalization

Posted by Edward Capriolo <ed...@gmail.com>.

One technique is on the client side you build a tool that takes the even
and produces N mutations. In c* writes are cheap so essentially, re-write
everything on all changes.

On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck <
fredrik.l.stigback@sitevision.se> wrote:

> Hi.
> Since denormalized data is first-class citizen in Cassandra, how to
> handle updating denormalized data.
> E.g. If we have  a USER cf with name, email etc. and denormalize user
> data into many other CF:s and then
> update the information about a user (name, email...). What is the best
> way to handle updating those user data properties
> which might be spread out over many cf:s and many rows?
>
> Regards
> /Fredrik
>

Re: Denormalization

Posted by "Hiller, Dean" <De...@nrel.gov>.

Things like PlayOrm exist to help you with half and half of denormalized and normalized data.  There are more and more patterns out there of denormalization and normalization but allowing for scalability still.  Here is one patterns page

https://github.com/deanhiller/playorm/wiki/Patterns-Page

Dean

From: Adam Venturella <av...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Sunday, January 27, 2013 3:44 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Denormalization

In my experience, if you foresee needing to do a lot of updates where a "master" record would need to propagate its changes to other records, then in general a non-sql based data store may be the wrong fit for your data.

If you have a lot of data that doesn't really change or is not linked in some way to other rows (in Cassandra's case). Then a non-sql based data store could be a great fit.

Yes, you can do some fancy stuff to force things like Cassandra to behave like an RDBMS, but it's at the cost of application complexity; more code, more bugs.

I often end up mixing the data stores sql/non-sql to play to their respective strengths.

If I start seeing a lot of "related" data, relational databases are really good at solving that problem.

On Sunday, January 27, 2013, Fredrik Stigbäck wrote:
I don't have a current use-case. I was just curious how applications
handle and how to think when modelling, since I guess denormalization
might increase the complexity of the application.

Fredrik

2013/1/27 Hiller, Dean <Dean.Hiller@nrel.gov<javascript:;>>:
> There is a really a mix of denormalization and normalization.  It really
> depends on specific use-cases.  To get better help on the email list, a
> more specific use case may be appropriate.
>
> Dean
>
> On 1/27/13 2:03 PM, "Fredrik Stigbäck" <fredrik.l.stigback@sitevision.se<javascript:;>>
> wrote:
>
>>Hi.
>>Since denormalized data is first-class citizen in Cassandra, how to
>>handle updating denormalized data.
>>E.g. If we have  a USER cf with name, email etc. and denormalize user
>>data into many other CF:s and then
>>update the information about a user (name, email...). What is the best
>>way to handle updating those user data properties
>>which might be spread out over many cf:s and many rows?
>>
>>Regards
>>/Fredrik
>

--
Fredrik Larsson Stigbäck
SiteVision AB Vasagatan 10, 107 10 Örebro
019-17 30 30

Re: Denormalization

Posted by "Hiller, Dean" <De...@nrel.gov>.

Oh and check out the last pattern "Scalable equals only index" which can allow you to still have normalized data though the pattern does denormalization just enough that you can

 1.  Update just two pieces of info (the users email for instance and the Xref table email as well).
 2.  Allow everyone else to have foreign references into that piece. (everyone references the guid not the email….while the xref table has an email to guid for your use…this can be quite a common pattern actually when you may be having issues denormalizing)

Dean

From: Adam Venturella <av...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Sunday, January 27, 2013 3:44 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Denormalization

In my experience, if you foresee needing to do a lot of updates where a "master" record would need to propagate its changes to other records, then in general a non-sql based data store may be the wrong fit for your data.

If you have a lot of data that doesn't really change or is not linked in some way to other rows (in Cassandra's case). Then a non-sql based data store could be a great fit.

Yes, you can do some fancy stuff to force things like Cassandra to behave like an RDBMS, but it's at the cost of application complexity; more code, more bugs.

I often end up mixing the data stores sql/non-sql to play to their respective strengths.

If I start seeing a lot of "related" data, relational databases are really good at solving that problem.

On Sunday, January 27, 2013, Fredrik Stigbäck wrote:
I don't have a current use-case. I was just curious how applications
handle and how to think when modelling, since I guess denormalization
might increase the complexity of the application.

Fredrik

2013/1/27 Hiller, Dean <Dean.Hiller@nrel.gov<javascript:;>>:
> There is a really a mix of denormalization and normalization.  It really
> depends on specific use-cases.  To get better help on the email list, a
> more specific use case may be appropriate.
>
> Dean
>
> On 1/27/13 2:03 PM, "Fredrik Stigbäck" <fredrik.l.stigback@sitevision.se<javascript:;>>
> wrote:
>
>>Hi.
>>Since denormalized data is first-class citizen in Cassandra, how to
>>handle updating denormalized data.
>>E.g. If we have  a USER cf with name, email etc. and denormalize user
>>data into many other CF:s and then
>>update the information about a user (name, email...). What is the best
>>way to handle updating those user data properties
>>which might be spread out over many cf:s and many rows?
>>
>>Regards
>>/Fredrik
>

--
Fredrik Larsson Stigbäck
SiteVision AB Vasagatan 10, 107 10 Örebro
019-17 30 30

Re: Denormalization

Posted by Adam Venturella <av...@gmail.com>.

In my experience, if you foresee needing to do a lot of updates where a
"master" record would need to propagate its changes to other
records, then in general a non-sql based data store may be the wrong fit
for your data.

If you have a lot of data that doesn't really change or is not linked in
some way to other rows (in Cassandra's case). Then a non-sql based data
store could be a great fit.

Yes, you can do some fancy stuff to force things like Cassandra to behave
like an RDBMS, but it's at the cost of application complexity; more code,
more bugs.

I often end up mixing the data stores sql/non-sql to play to their
respective strengths.

If I start seeing a lot of "related" data, relational databases are really
good at solving that problem.

On Sunday, January 27, 2013, Fredrik Stigbäck wrote:

> I don't have a current use-case. I was just curious how applications
> handle and how to think when modelling, since I guess denormalization
> might increase the complexity of the application.
>
> Fredrik
>
> 2013/1/27 Hiller, Dean <Dean.Hiller@nrel.gov <javascript:;>>:
> > There is a really a mix of denormalization and normalization.  It really
> > depends on specific use-cases.  To get better help on the email list, a
> > more specific use case may be appropriate.
> >
> > Dean
> >
> > On 1/27/13 2:03 PM, "Fredrik Stigbäck" <fredrik.l.stigback@sitevision.se<javascript:;>
> >
> > wrote:
> >
> >>Hi.
> >>Since denormalized data is first-class citizen in Cassandra, how to
> >>handle updating denormalized data.
> >>E.g. If we have  a USER cf with name, email etc. and denormalize user
> >>data into many other CF:s and then
> >>update the information about a user (name, email...). What is the best
> >>way to handle updating those user data properties
> >>which might be spread out over many cf:s and many rows?
> >>
> >>Regards
> >>/Fredrik
> >
>
>
>
> --
> Fredrik Larsson Stigbäck
> SiteVision AB Vasagatan 10, 107 10 Örebro
> 019-17 30 30
>

Re: Denormalization

Posted by Fredrik Stigbäck <fr...@sitevision.se>.

I don't have a current use-case. I was just curious how applications
handle and how to think when modelling, since I guess denormalization
might increase the complexity of the application.

Fredrik

2013/1/27 Hiller, Dean <De...@nrel.gov>:
> There is a really a mix of denormalization and normalization.  It really
> depends on specific use-cases.  To get better help on the email list, a
> more specific use case may be appropriate.
>
> Dean
>
> On 1/27/13 2:03 PM, "Fredrik Stigbäck" <fr...@sitevision.se>
> wrote:
>
>>Hi.
>>Since denormalized data is first-class citizen in Cassandra, how to
>>handle updating denormalized data.
>>E.g. If we have  a USER cf with name, email etc. and denormalize user
>>data into many other CF:s and then
>>update the information about a user (name, email...). What is the best
>>way to handle updating those user data properties
>>which might be spread out over many cf:s and many rows?
>>
>>Regards
>>/Fredrik
>



-- 
Fredrik Larsson Stigbäck
SiteVision AB Vasagatan 10, 107 10 Örebro
019-17 30 30

Re: Denormalization

Posted by "Hiller, Dean" <De...@nrel.gov>.

There is a really a mix of denormalization and normalization.  It really
depends on specific use-cases.  To get better help on the email list, a
more specific use case may be appropriate.

Dean

On 1/27/13 2:03 PM, "Fredrik Stigbäck" <fr...@sitevision.se>
wrote:

>Hi.
>Since denormalized data is first-class citizen in Cassandra, how to
>handle updating denormalized data.
>E.g. If we have  a USER cf with name, email etc. and denormalize user
>data into many other CF:s and then
>update the information about a user (name, email...). What is the best
>way to handle updating those user data properties
>which might be spread out over many cf:s and many rows?
>
>Regards
>/Fredrik