You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Matteo Caprari <ma...@gmail.com> on 2010/03/08 13:18:16 UTC

schema design question

Hi.

We have a collection operation that generates documents like this:

item: {
 "id": "<unique item id>",
"title": "...",
"liked_by": ["user_2", "user_3", ...]
}

The liked_by list contains on average 100 unique users. Users may also
appear in other items.

Our database contains a few million entries and is growing at about 1M a day.
Around 10% of the incoming data is additional info about an item (ie:
more likers) and a merge operation needs to be done.

We are not too happy with our current system and are considering cassandra.

I'm new to this kind of db, and I'd like to hear a few informed
opinions on how to design a cassandra schema.
Of course we wish the system to keep up with the write/update rate and
answer our key queries 'as quickly as possible'.

The 'key' queries are:
- list all the items a user liked
- list all the users that liked an item
- list all users and count how many items each user liked
(we need this every few hours and in fact we are only interested in
the top N users that liked most stuff)

Thanks!
-- 
:Matteo Caprari
matteo.caprari@gmail.com

Re: schema design question

Posted by Matteo Caprari <ma...@gmail.com>.
Well, I don't like clunky and I'm java friendly. I'll go for the abstract class.

Thanks for the help.

On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari <ma...@gmail.com> wrote:
>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> That's true.  So you'd want to use a custom comparator where first 64
>>> bits is the Long and the rest is the userid, for instance.
>>>
>>> (Long + something else is common enough that we might want to add it
>>> to the defaults...)
>>
>> What about using a SuperColumn for each like-count and then the list
>> of users that hit that level?
>
> That would also work, it's just a little clunky pulling things out of
> a nested structure when really you want a flat list.  But if you are
> allergic to Java that is the way to go so you don't have to write a
> custom AbstractType subclass. :)
>
> -Jonathan
>



-- 
:Matteo Caprari
matteo.caprari@gmail.com

Re: schema design question

Posted by Jonathan Ellis <jb...@gmail.com>.
if you want to select stuff out w/ one query, then single CF is the
only sane choice

if not then 2 CFs may be more performant

On Wed, Mar 10, 2010 at 4:42 AM, Matteo Caprari
<ma...@gmail.com> wrote:
> I can't quite decide if to go with a flat schema, with keys repeated
> in different CFs
> or have one CF with nested supercolumns.
>
> I guess there is no straight answer here,  but what's a good reasoning
> about the choice?
>
> This two mutation maps should clarify my dillemma:
>
> deep_mutation_map = {
>        'example_item': {
>                'Items': [
>                        Mutation(SuperColumn('details', [
>                                Column('title', 'an article'),
>                                Column('link', 'www.example.com')
>                        ])),
>                        Mutation(SuperColumn('likers', [
>                                Column('user_1', 'xx'),
>                                Column('user_2', 'xx')
>                        ]))
>                ]
>        }
> }
>
> flat_mutation_map = {
>        'example_item': {
>                'Item_Info': [
>                        Mutation(Column('title', 'an_article')),
>                        Mutation(Column('link', 'www.example.com')),
>                ],
>                'Item_likers': [
>                        Mutation(Column('user_1', 'xx')),
>                        Mutation(Column('user_2', 'xx'))
>                ]
>        }
> }
>
>
> On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari <ma...@gmail.com> wrote:
>>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> That's true.  So you'd want to use a custom comparator where first 64
>>>> bits is the Long and the rest is the userid, for instance.
>>>>
>>>> (Long + something else is common enough that we might want to add it
>>>> to the defaults...)
>>>
>>> What about using a SuperColumn for each like-count and then the list
>>> of users that hit that level?
>>
>> That would also work, it's just a little clunky pulling things out of
>> a nested structure when really you want a flat list.  But if you are
>> allergic to Java that is the way to go so you don't have to write a
>> custom AbstractType subclass. :)
>>
>> -Jonathan
>>
>
>
>
> --
> :Matteo Caprari
> matteo.caprari@gmail.com
>

Re: schema design question

Posted by Matteo Caprari <ma...@gmail.com>.
I can't quite decide if to go with a flat schema, with keys repeated
in different CFs
or have one CF with nested supercolumns.

I guess there is no straight answer here,  but what's a good reasoning
about the choice?

This two mutation maps should clarify my dillemma:

deep_mutation_map = {
	'example_item': {
		'Items': [								
			Mutation(SuperColumn('details', [
				Column('title', 'an article'),
				Column('link', 'www.example.com')
			])),																				
			Mutation(SuperColumn('likers', [
				Column('user_1', 'xx'),
				Column('user_2', 'xx')
			]))
		]
	}
}
		
flat_mutation_map = {
	'example_item': {			
		'Item_Info': [
			Mutation(Column('title', 'an_article')),
			Mutation(Column('link', 'www.example.com')),
		],
		'Item_likers': [
			Mutation(Column('user_1', 'xx')),
			Mutation(Column('user_2', 'xx'))
		]
	}
}


On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari <ma...@gmail.com> wrote:
>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> That's true.  So you'd want to use a custom comparator where first 64
>>> bits is the Long and the rest is the userid, for instance.
>>>
>>> (Long + something else is common enough that we might want to add it
>>> to the defaults...)
>>
>> What about using a SuperColumn for each like-count and then the list
>> of users that hit that level?
>
> That would also work, it's just a little clunky pulling things out of
> a nested structure when really you want a flat list.  But if you are
> allergic to Java that is the way to go so you don't have to write a
> custom AbstractType subclass. :)
>
> -Jonathan
>



-- 
:Matteo Caprari
matteo.caprari@gmail.com

Re: schema design question

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari <ma...@gmail.com> wrote:
> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> That's true.  So you'd want to use a custom comparator where first 64
>> bits is the Long and the rest is the userid, for instance.
>>
>> (Long + something else is common enough that we might want to add it
>> to the defaults...)
>
> What about using a SuperColumn for each like-count and then the list
> of users that hit that level?

That would also work, it's just a little clunky pulling things out of
a nested structure when really you want a flat list.  But if you are
allergic to Java that is the way to go so you don't have to write a
custom AbstractType subclass. :)

-Jonathan

Re: schema design question

Posted by Matteo Caprari <ma...@gmail.com>.
On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> One quad-core node can handle ~14000 inserts per second so you are in
> good shape.

Well, yeah!

>> instead of 'all users that liked N items'?
>
> That's true.  So you'd want to use a custom comparator where first 64
> bits is the Long and the rest is the userid, for instance.
>
> (Long + something else is common enough that we might want to add it
> to the defaults...)

What about using a SuperColumn for each like-count and then the list
of users that hit that level?

hits {
 '100': [ user_1, user_2 ],
'33': [ ... ]
}


Very helpful, thanks
--
:Matteo Caprari
matteo.caprari@gmail.com

Re: schema design question

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Mar 9, 2010 at 3:53 AM, Matteo Caprari <ma...@gmail.com> wrote:
> Thanks Jonathan.
>
> Correct if I'm wrong: you are suggesting that each time we receive a new
> row (item, [users]) we do 2 operations:
>
> 1) insert (or merge) this row 'as it is' (item, [users])
> 2) for each user in [users]: insert  (user, [item])
>
> Each incoming item is liked by 100 users, so it would be 100 db ops per item.
> User ids are 20b, so it's about 2k per item sent to the database.

Right.

> At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec.
>
> Can you make a gross estimate of hardware requirements?

One quad-core node can handle ~14000 inserts per second so you are in
good shape.

> We don't know when the like-ing happened: is there something like
> incremental column names?

You can use insert time, or just use a LexicalUUID.

> Or can I user item_id as column name and a null-ish placeolder as value?

Or that too.

> I share Keith concern: if we use Long as column names, won't we end up
> seeing just one user
> instead of 'all users that liked N items'?

That's true.  So you'd want to use a custom comparator where first 64
bits is the Long and the rest is the userid, for instance.

(Long + something else is common enough that we might want to add it
to the defaults...)

-Jonathan

Re: schema design question

Posted by Matteo Caprari <ma...@gmail.com>.
Thanks Jonathan.

Correct if I'm wrong: you are suggesting that each time we receive a new
row (item, [users]) we do 2 operations:

1) insert (or merge) this row 'as it is' (item, [users])
2) for each user in [users]: insert  (user, [item])

Each incoming item is liked by 100 users, so it would be 100 db ops per item.
User ids are 20b, so it's about 2k per item sent to the database.

At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec.

Can you make a gross estimate of hardware requirements?

(more inline questions below. sorry)

On Tue, Mar 9, 2010 at 3:33 AM, Jonathan Ellis <jb...@gmail.com> wrote:

>> - list all the items a user liked
>
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data

We don't know when the like-ing happened: is there something like
incremental column names?
Or can I user item_id as column name and a null-ish placeolder as value?

>> - list all the users that liked an item
>
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized

Same problem with timeuuids as above.

>> - list all users and count how many items each user liked
>
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data

I share Keith concern: if we use Long as column names, won't we end up
seeing just one user
instead of 'all users that liked N items'?

-- 
:Matteo Caprari
matteo.caprari@gmail.com

Re: schema design question

Posted by Keith Thornhill <ke...@raptr.com>.
jonathan,

wouldn't using Long values as the column names for the 3rd CF cause
potential conflicts if 2 users liked the same # of items? (only saving one
user for any given value)

was thinking about this same problem (sorted lists of top N user activity)
and thought that was a roadblock for that design.

-keith

On Mon, Mar 8, 2010 at 7:33 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari <ma...@gmail.com>
> wrote:
> > The 'key' queries are:
>
> These map straightforwardly to one CF per query.
>
> > - list all the items a user liked
>
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data
>
> > - list all the users that liked an item
>
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized
>
> > - list all users and count how many items each user liked
> > (we need this every few hours and in fact we are only interested in
> > the top N users that liked most stuff)
>
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data
>
> If you just need it every few hours, run a map/reduce job (Hadoop
> integration in 0.6) to compute this that often.  Otherwise you will
> have to update it on each insert for each user which is probably a bad
> idea if you have millions of users (all that activity will go to just
> the machines replicating that row).  And if you have tens of millions
> of users you will almost certainly run into the
> row-must-fit-in-memory-during-compaction limitation that we're
> removing in 0.7.
>
> -Jonathan
>

Re: schema design question

Posted by Jonathan Ellis <jb...@gmail.com>.
On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari <ma...@gmail.com> wrote:
> The 'key' queries are:

These map straightforwardly to one CF per query.

> - list all the items a user liked

row key is user id, columns names are timeuuid of when the like-ing
occurred, column value is either item id, or a supercolumn containing
the denormalized item data

> - list all the users that liked an item

row key is item id, column names are same timeuuids, values are either
user id or again denormalized

> - list all users and count how many items each user liked
> (we need this every few hours and in fact we are only interested in
> the top N users that liked most stuff)

row key is something you hardcode ("topusers"), column names are Long
values of how many liked, column value is user id or denormalized user
data

If you just need it every few hours, run a map/reduce job (Hadoop
integration in 0.6) to compute this that often.  Otherwise you will
have to update it on each insert for each user which is probably a bad
idea if you have millions of users (all that activity will go to just
the machines replicating that row).  And if you have tens of millions
of users you will almost certainly run into the
row-must-fit-in-memory-during-compaction limitation that we're
removing in 0.7.

-Jonathan