You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Green, John M (HP Education)" <jo...@hp.com> on 2014/02/21 17:01:21 UTC

Naive question about orphan rows

I'm very much a newbie so this may be a silly question but ...

I have a situation similar to the music service example (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html) of songs and playlists.  However, in my case, the "songs" would be considered orphans that should be deleted when no "playlists" refer to them.  Relational databases have mechanisms to manage this relationship so that a "song" could be deleted as soon as the last "playlist" referencing it is deleted.    While I do NOT need to manage this as an atomic transaction, I'm wondering what is the best way to delete orphaned rows (i.e., "songs" not referenced by any "playlists")  using Cassandra.

I guess an alternative approach would be to store "songs" directly in the "playlists" but this could lead to many redundant copies of the same "song" which is something I'm hoping to avoid.  I'm my case the "playlists" could have thousands of entries and the "songs" might be blobs of 10s of Mbytes.    Maybe I'm just having a hard time abandoning my relational roots?

John

Re: Naive question about orphan rows

Posted by Edward Capriolo <ed...@gmail.com>.

One way to handle this is that both tables should be de-normalized. Take
this:

SongsAndPlaylists
PlaylistsAndSongs

In this way your client software is charged with keeping data in sync.

When you remove a song from a PlaylistsAndSongs you do a read for that song
in SongsAnyPlaylists. If the number of people with that song is now 0 the
song can be removed.




On Wed, Feb 26, 2014 at 11:17 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Right the problem with building a list of counts in a batch is what
> happens if song added as you are building the counts.
>
>
> On Wed, Feb 26, 2014 at 10:32 AM, Green, John M (HP Education) <
> john.green@hp.com> wrote:
>
>>  Edward,
>>
>>
>> Thanks for your insight.
>>
>>
>>
>> One other thought I had was to store a reference count with the "song".
>> When the last "playlist" referencing the "song" is deleted the "song" will
>> also be deleted because the reference count decrements to zero.   However,
>> this would create some nastiness when it comes to reliably maintaining
>> reference counts.   I'm not sure if it would help to split the reference
>> count into two monotonically increasing counters (number of references
>> added, and number of references deleted).
>>
>>
>>
>> In my case, users cannot browse a repository of "songs" to build a
>> playlist from scratch.  They can only import "songs" themselves or create
>> references to "songs" other users have explicitly made available to them.
>> Once a "song" is not referred to by any "playlist" it will never be
>> re-discovered so it should be deleted.   This could be done in some sort of
>> background data maintenance job that runs periodically.   Even if it is a
>> low-priority background job it look like it will create a lot overhead
>> (scanning and producing counts).
>>
>>
>>
>> John
>>
>> *From:* Edward Capriolo [mailto:edlinuxguru@gmail.com]
>> *Sent:* Wednesday, February 26, 2014 5:56 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Naive question about orphan rows
>>
>>
>>
>> It is probably ok to have redundant songs in playlists, cassandra is
>> about denormalization.
>>
>> Dealing with this issue is going to be hard since the only way to dwal
>> with this would be scanning through the firsr cf and procing counts then
>> using that information to delete in the second table. However that
>> information can change rapidly and then will fall out of sink fast.
>>
>> The only ways yo handle this are
>>
>> 1) never delete songs
>> 2) store copies of songs ib playlist
>>
>> On Friday, February 21, 2014, Green, John M (HP Education) <
>> john.green@hp.com> wrote:
>> > I'm very much a newbie so this may be a silly question but ...
>> >
>> >
>> >
>> > I have a situation similar to the music service example (
>> http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html)
>> of songs and playlists.  However, in my case, the "songs" would be
>> considered orphans that should be deleted when no "playlists" refer to
>> them.  Relational databases have mechanisms to manage this relationship so
>> that a "song" could be deleted as soon as the last "playlist" referencing
>> it is deleted.    While I do NOT need to manage this as an atomic
>> transaction, I'm wondering what is the best way to delete orphaned rows
>> (i.e., "songs" not referenced by any "playlists")  using Cassandra.
>> >
>> >
>> >
>> > I guess an alternative approach would be to store "songs" directly in
>> the "playlists" but this could lead to many redundant copies of the same
>> "song" which is something I'm hoping to avoid.  I'm my case the "playlists"
>> could have thousands of entries and the "songs" might be blobs of 10s of
>> Mbytes.    Maybe I'm just having a hard time abandoning my relational roots?
>> >
>> >
>> >
>> > John
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>

Re: Naive question about orphan rows

Posted by Edward Capriolo <ed...@gmail.com>.

Right the problem with building a list of counts in a batch is what happens
if song added as you are building the counts.


On Wed, Feb 26, 2014 at 10:32 AM, Green, John M (HP Education) <
john.green@hp.com> wrote:

>  Edward,
>
>
> Thanks for your insight.
>
>
>
> One other thought I had was to store a reference count with the "song".
> When the last "playlist" referencing the "song" is deleted the "song" will
> also be deleted because the reference count decrements to zero.   However,
> this would create some nastiness when it comes to reliably maintaining
> reference counts.   I'm not sure if it would help to split the reference
> count into two monotonically increasing counters (number of references
> added, and number of references deleted).
>
>
>
> In my case, users cannot browse a repository of "songs" to build a
> playlist from scratch.  They can only import "songs" themselves or create
> references to "songs" other users have explicitly made available to them.
> Once a "song" is not referred to by any "playlist" it will never be
> re-discovered so it should be deleted.   This could be done in some sort of
> background data maintenance job that runs periodically.   Even if it is a
> low-priority background job it look like it will create a lot overhead
> (scanning and producing counts).
>
>
>
> John
>
> *From:* Edward Capriolo [mailto:edlinuxguru@gmail.com]
> *Sent:* Wednesday, February 26, 2014 5:56 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Naive question about orphan rows
>
>
>
> It is probably ok to have redundant songs in playlists, cassandra is about
> denormalization.
>
> Dealing with this issue is going to be hard since the only way to dwal
> with this would be scanning through the firsr cf and procing counts then
> using that information to delete in the second table. However that
> information can change rapidly and then will fall out of sink fast.
>
> The only ways yo handle this are
>
> 1) never delete songs
> 2) store copies of songs ib playlist
>
> On Friday, February 21, 2014, Green, John M (HP Education) <
> john.green@hp.com> wrote:
> > I'm very much a newbie so this may be a silly question but ...
> >
> >
> >
> > I have a situation similar to the music service example (
> http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html)
> of songs and playlists.  However, in my case, the "songs" would be
> considered orphans that should be deleted when no "playlists" refer to
> them.  Relational databases have mechanisms to manage this relationship so
> that a "song" could be deleted as soon as the last "playlist" referencing
> it is deleted.    While I do NOT need to manage this as an atomic
> transaction, I'm wondering what is the best way to delete orphaned rows
> (i.e., "songs" not referenced by any "playlists")  using Cassandra.
> >
> >
> >
> > I guess an alternative approach would be to store "songs" directly in
> the "playlists" but this could lead to many redundant copies of the same
> "song" which is something I'm hoping to avoid.  I'm my case the "playlists"
> could have thousands of entries and the "songs" might be blobs of 10s of
> Mbytes.    Maybe I'm just having a hard time abandoning my relational roots?
> >
> >
> >
> > John
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

RE: Naive question about orphan rows

Posted by "Green, John M (HP Education)" <jo...@hp.com>.

Edward,

Thanks for your insight.

One other thought I had was to store a reference count with the "song".  When the last "playlist" referencing the "song" is deleted the "song" will also be deleted because the reference count decrements to zero.   However, this would create some nastiness when it comes to reliably maintaining reference counts.   I'm not sure if it would help to split the reference count into two monotonically increasing counters (number of references added, and number of references deleted).

In my case, users cannot browse a repository of "songs" to build a playlist from scratch.  They can only import "songs" themselves or create references to "songs" other users have explicitly made available to them.  Once a "song" is not referred to by any "playlist" it will never be re-discovered so it should be deleted.   This could be done in some sort of background data maintenance job that runs periodically.   Even if it is a low-priority background job it look like it will create a lot overhead (scanning and producing counts).

John
From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Wednesday, February 26, 2014 5:56 AM
To: user@cassandra.apache.org
Subject: Re: Naive question about orphan rows

It is probably ok to have redundant songs in playlists, cassandra is about denormalization.

Dealing with this issue is going to be hard since the only way to dwal with this would be scanning through the firsr cf and procing counts then using that information to delete in the second table. However that information can change rapidly and then will fall out of sink fast.

The only ways yo handle this are

1) never delete songs
2) store copies of songs ib playlist

On Friday, February 21, 2014, Green, John M (HP Education) <jo...@hp.com>> wrote:
> I'm very much a newbie so this may be a silly question but ...
>
>
>
> I have a situation similar to the music service example (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html) of songs and playlists.  However, in my case, the "songs" would be considered orphans that should be deleted when no "playlists" refer to them.  Relational databases have mechanisms to manage this relationship so that a "song" could be deleted as soon as the last "playlist" referencing it is deleted.    While I do NOT need to manage this as an atomic transaction, I'm wondering what is the best way to delete orphaned rows (i.e., "songs" not referenced by any "playlists")  using Cassandra.
>
>
>
> I guess an alternative approach would be to store "songs" directly in the "playlists" but this could lead to many redundant copies of the same "song" which is something I'm hoping to avoid.  I'm my case the "playlists" could have thousands of entries and the "songs" might be blobs of 10s of Mbytes.    Maybe I'm just having a hard time abandoning my relational roots?
>
>
>
> John

--
Sorry this was sent from mobile. Will do less grammar and spell check than usual.

Re: Naive question about orphan rows

Posted by Edward Capriolo <ed...@gmail.com>.

It is probably ok to have redundant songs in playlists, cassandra is about
denormalization.

Dealing with this issue is going to be hard since the only way to dwal with
this would be scanning through the firsr cf and procing counts then using
that information to delete in the second table. However that information
can change rapidly and then will fall out of sink fast.

The only ways yo handle this are

1) never delete songs
2) store copies of songs ib playlist

On Friday, February 21, 2014, Green, John M (HP Education) <
john.green@hp.com> wrote:
> I'm very much a newbie so this may be a silly question but ...
>
>
>
> I have a situation similar to the music service example (
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html)
of songs and playlists.  However, in my case, the "songs" would be
considered orphans that should be deleted when no "playlists" refer to
them.  Relational databases have mechanisms to manage this relationship so
that a "song" could be deleted as soon as the last "playlist" referencing
it is deleted.    While I do NOT need to manage this as an atomic
transaction, I'm wondering what is the best way to delete orphaned rows
(i.e., "songs" not referenced by any "playlists")  using Cassandra.
>
>
>
> I guess an alternative approach would be to store "songs" directly in the
"playlists" but this could lead to many redundant copies of the same "song"
which is something I'm hoping to avoid.  I'm my case the "playlists" could
have thousands of entries and the "songs" might be blobs of 10s of
Mbytes.    Maybe I'm just having a hard time abandoning my relational roots?
>
>
>
> John

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.