You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Raj <ra...@gmail.com> on 2011/01/07 17:28:56 UTC

Is this a good schema design to implement a social application..

My question is in context of a social network schema design

I am thinking of following schema for storing a user's data that is
required as he logs in & is led to his homepage:-
(I aimed at a schema design such that through a single row read query
all the data that would be required to put up the homepage of that
user, is retreived.)

UserSuperColumnFamily: {    // Column Family

UserIDKey:
{columns:            MyName, MyEmail, MyCity,...etc
 supercolumns:    MyFollowersList, MyFollowiesList, MyPostsIdKeysList,
MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList,
RecentNotificationsForUserList,  MessagesReceivedList,
MessagesSentList, AccountSettingsList, RecentSelfActivityList,
UpdatesFromFollowiesList
}
}

Thus user's newfeed would be generated using superColumn:
UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would
obviously contain only Id of the posts and not the entire post data.

Questions:

1.) What could be the problems with this design, any improvements ?

2.) Would frequent & heavy overwrite operations/ row mutations (for
example; when propagating the post updates for news-feed from some
user to all his followies) which leads to rows ultimately being in
several SSTables, will lead to degraded read performance ?? Is it
suitable to use row Cache(too big row but all data required uptil user
is logged in) If I do not use cache, it may be very expensive to pull
the row each time a data is required for the given user since row
would be in several sstables. How can I improve the
read performance here

The actual data of the posts from network would be retrieved using
PostIdKey through subsequent read queries from columnFamily
PostsSuperColumnFamily which would be like follows:

PostsSuperColumnFamily:{

PostIdKey:
{
columns:            PostOwnerId, PostBody
supercolumns:   TagsForPost {list of columns of all tags for the
post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the
likers}
}
}

Is this the best design to go with or are there any issues to consider
here ? Thanks in anticipation of your valuable comments.!

Re: Is this a good schema design to implement a social application..

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Jan 7, 2011 at 11:38 PM, Rajkumar Gupta <ra...@gmail.com> wrote:
> In the twissandra example,
> http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends ,
> I find that they have split the materialized view of a user's homepage
> (like his followers list, tweets from friends) into several
> columnfamilies instead of putting in supercolumns inside a single
> SupercolumnFamily thereby making the rows skinnier, I was wandering as
> to which one will give better performance in terms of reads.
> I think skinnier will definitely have the advantage of less row
> mutations thus good read performance, when, only they, need to be
> retrieved, plus supercolumns of followerlist ,etc are avoided(this
> sounds good as supercolumn indexing limitations will not suck), but I
> still not pretty sure whether it would beneficial in terms of
> performance numbers, if I split the materialized view of single user
> into several columnfamilies instead of single row in single
> Supercolumnfamily.
>
>
>
>
>
> On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta <ra...@gmail.com> wrote:
>> The fact that subcolumns inside the supercolumns aren't indexed
>> currently may suck here, whenever a small no (10-20 ) of subcolumns
>> need to be retreived from a large list of subcolumns of a supercolumn
>> like MyPostsIdKeysList.
>>
>> On Fri, Jan 7, 2011 at 9:58 PM, Raj <ra...@gmail.com> wrote:
>>> My question is in context of a social network schema design
>>>
>>> I am thinking of following schema for storing a user's data that is
>>> required as he logs in & is led to his homepage:-
>>> (I aimed at a schema design such that through a single row read query
>>> all the data that would be required to put up the homepage of that
>>> user, is retreived.)
>>>
>>> UserSuperColumnFamily: {    // Column Family
>>>
>>> UserIDKey:
>>> {columns:            MyName, MyEmail, MyCity,...etc
>>>  supercolumns:    MyFollowersList, MyFollowiesList, MyPostsIdKeysList,
>>> MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList,
>>> RecentNotificationsForUserList,  MessagesReceivedList,
>>> MessagesSentList, AccountSettingsList, RecentSelfActivityList,
>>> UpdatesFromFollowiesList
>>> }
>>> }
>>>
>>> Thus user's newfeed would be generated using superColumn:
>>> UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would
>>> obviously contain only Id of the posts and not the entire post data.
>>>
>>> Questions:
>>>
>>> 1.) What could be the problems with this design, any improvements ?
>>>
>>> 2.) Would frequent & heavy overwrite operations/ row mutations (for
>>> example; when propagating the post updates for news-feed from some
>>> user to all his followies) which leads to rows ultimately being in
>>> several SSTables, will lead to degraded read performance ?? Is it
>>> suitable to use row Cache(too big row but all data required uptil user
>>> is logged in) If I do not use cache, it may be very expensive to pull
>>> the row each time a data is required for the given user since row
>>> would be in several sstables. How can I improve the
>>> read performance here
>>>
>>> The actual data of the posts from network would be retrieved using
>>> PostIdKey through subsequent read queries from columnFamily
>>> PostsSuperColumnFamily which would be like follows:
>>>
>>> PostsSuperColumnFamily:{
>>>
>>> PostIdKey:
>>> {
>>> columns:            PostOwnerId, PostBody
>>> supercolumns:   TagsForPost {list of columns of all tags for the
>>> post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the
>>> likers}
>>> }
>>> }
>>>
>>> Is this the best design to go with or are there any issues to consider
>>> here ? Thanks in anticipation of your valuable comments.!
>>>
>>
>

>From your description UserSuperColumnFamily it seems to be both a
Standard Column and a Super Column. You can not do that. However you
can encode things such as MyName MyCity and MyState into a 'UserInfo'
super Column column. UserInfo:MyState...

(as your mentioned) Super Columns are not indexed and have to be
completely de-serialized for each access. Because of this they are not
widely used for anything but small keys with a few columns. This also
applies to mutations as well, the row can exist in multiple SSTables
until it finally gets compacted. That can result in much more storage
used for an object that changes often.

Most designs use composite keys or using something like JSON encoded
values with Standard Column Families to achieve something like a Super
Column.

(SuperColumns are not always as Super as they seem :)

Re: Is this a good schema design to implement a social application..

Posted by Rajkumar Gupta <ra...@gmail.com>.

In the twissandra example,
http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends ,
I find that they have split the materialized view of a user's homepage
(like his followers list, tweets from friends) into several
columnfamilies instead of putting in supercolumns inside a single
SupercolumnFamily thereby making the rows skinnier, I was wandering as
to which one will give better performance in terms of reads.
I think skinnier will definitely have the advantage of less row
mutations thus good read performance, when, only they, need to be
retrieved, plus supercolumns of followerlist ,etc are avoided(this
sounds good as supercolumn indexing limitations will not suck), but I
still not pretty sure whether it would beneficial in terms of
performance numbers, if I split the materialized view of single user
into several columnfamilies instead of single row in single
Supercolumnfamily.





On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta <ra...@gmail.com> wrote:
> The fact that subcolumns inside the supercolumns aren't indexed
> currently may suck here, whenever a small no (10-20 ) of subcolumns
> need to be retreived from a large list of subcolumns of a supercolumn
> like MyPostsIdKeysList.
>
> On Fri, Jan 7, 2011 at 9:58 PM, Raj <ra...@gmail.com> wrote:
>> My question is in context of a social network schema design
>>
>> I am thinking of following schema for storing a user's data that is
>> required as he logs in & is led to his homepage:-
>> (I aimed at a schema design such that through a single row read query
>> all the data that would be required to put up the homepage of that
>> user, is retreived.)
>>
>> UserSuperColumnFamily: {    // Column Family
>>
>> UserIDKey:
>> {columns:            MyName, MyEmail, MyCity,...etc
>>  supercolumns:    MyFollowersList, MyFollowiesList, MyPostsIdKeysList,
>> MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList,
>> RecentNotificationsForUserList,  MessagesReceivedList,
>> MessagesSentList, AccountSettingsList, RecentSelfActivityList,
>> UpdatesFromFollowiesList
>> }
>> }
>>
>> Thus user's newfeed would be generated using superColumn:
>> UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would
>> obviously contain only Id of the posts and not the entire post data.
>>
>> Questions:
>>
>> 1.) What could be the problems with this design, any improvements ?
>>
>> 2.) Would frequent & heavy overwrite operations/ row mutations (for
>> example; when propagating the post updates for news-feed from some
>> user to all his followies) which leads to rows ultimately being in
>> several SSTables, will lead to degraded read performance ?? Is it
>> suitable to use row Cache(too big row but all data required uptil user
>> is logged in) If I do not use cache, it may be very expensive to pull
>> the row each time a data is required for the given user since row
>> would be in several sstables. How can I improve the
>> read performance here
>>
>> The actual data of the posts from network would be retrieved using
>> PostIdKey through subsequent read queries from columnFamily
>> PostsSuperColumnFamily which would be like follows:
>>
>> PostsSuperColumnFamily:{
>>
>> PostIdKey:
>> {
>> columns:            PostOwnerId, PostBody
>> supercolumns:   TagsForPost {list of columns of all tags for the
>> post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the
>> likers}
>> }
>> }
>>
>> Is this the best design to go with or are there any issues to consider
>> here ? Thanks in anticipation of your valuable comments.!
>>
>

Re: Is this a good schema design to implement a social application..

Posted by Rajkumar Gupta <ra...@gmail.com>.

The fact that subcolumns inside the supercolumns aren't indexed
currently may suck here, whenever a small no (10-20 ) of subcolumns
need to be retreived from a large list of subcolumns of a supercolumn
like MyPostsIdKeysList.

On Fri, Jan 7, 2011 at 9:58 PM, Raj <ra...@gmail.com> wrote:
> My question is in context of a social network schema design
>
> I am thinking of following schema for storing a user's data that is
> required as he logs in & is led to his homepage:-
> (I aimed at a schema design such that through a single row read query
> all the data that would be required to put up the homepage of that
> user, is retreived.)
>
> UserSuperColumnFamily: {    // Column Family
>
> UserIDKey:
> {columns:            MyName, MyEmail, MyCity,...etc
>  supercolumns:    MyFollowersList, MyFollowiesList, MyPostsIdKeysList,
> MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList,
> RecentNotificationsForUserList,  MessagesReceivedList,
> MessagesSentList, AccountSettingsList, RecentSelfActivityList,
> UpdatesFromFollowiesList
> }
> }
>
> Thus user's newfeed would be generated using superColumn:
> UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would
> obviously contain only Id of the posts and not the entire post data.
>
> Questions:
>
> 1.) What could be the problems with this design, any improvements ?
>
> 2.) Would frequent & heavy overwrite operations/ row mutations (for
> example; when propagating the post updates for news-feed from some
> user to all his followies) which leads to rows ultimately being in
> several SSTables, will lead to degraded read performance ?? Is it
> suitable to use row Cache(too big row but all data required uptil user
> is logged in) If I do not use cache, it may be very expensive to pull
> the row each time a data is required for the given user since row
> would be in several sstables. How can I improve the
> read performance here
>
> The actual data of the posts from network would be retrieved using
> PostIdKey through subsequent read queries from columnFamily
> PostsSuperColumnFamily which would be like follows:
>
> PostsSuperColumnFamily:{
>
> PostIdKey:
> {
> columns:            PostOwnerId, PostBody
> supercolumns:   TagsForPost {list of columns of all tags for the
> post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the
> likers}
> }
> }
>
> Is this the best design to go with or are there any issues to consider
> here ? Thanks in anticipation of your valuable comments.!
>