You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jonathan Ellis <jb...@gmail.com> on 2009/05/19 05:19:17 UTC

schema example

Does anyone have a simple app schema they can share?

I can't share the one for our main app.  But we do need an example
here.  A real one would be nice if we can find one.

I checked App Engine.  They don't have a whole lot of examples either.
 They do have a really simple one:
http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html

The most important thing in Cassandra modeling is choosing a good key,
since that is what most of your lookups will be by.  Keys are also how
Cassandra scales -- Cassandra can handle effectively infinite keys
(given enough nodes obviously) but only thousands to millions of
columns per key/CF (depending on what API calls you use -- Jun is
adding one now that does not deseriailze everything in the whole CF
into memory.  The rest will need to follow this model eventually too).

For this guestbook I think the choice is obvious: use the name as the
key, and have a single simple CF for the messages.  Each column will
be a message (you can even use the mandatory timestamp field as part
of your user-visible data.  win!).  You get the list (or page) of
users with get_key_range and then their messages with get_slice.

<ColumnFamily ColumnSort="Name" Name="Message"/>

Anyone got another one for pedagogical purposes?

-Jonathan

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

FYI, Yahoo does an interesting thing in this case. They usually use
token pagination, but if a page displays limit 20 records, they
actually request limit 100 behind the scenes. The extra records are
used to generate deep links. So instead of just being able to go to
the next page:

prev | cur | next

You can render:

prev | cur | next | cur + 2 | cur + 3 | cur + 4 | cur + 5

This lets you smoothly trade off navigability for performance.

Evan

On Fri, Jul 3, 2009 at 6:53 PM, Evan Weaver<ew...@gmail.com> wrote:
> (From talking on IRC):
>
> I think this boils down to the offset/limit vs. token/limit debate.
>
> Token/limit is fine in all cases for me, but you still have to be able
> to query the head of the list (with a limit, but no token) to get
> started. Right now there is no facility for that on time-sorted column
> families:
>
>  list<column_t> get_columns_since(1:string tablename, 2:string key,
> 3:string columnParent, 4:i64 timeStamp)
>
> I don't think token ranges are supported on time columns, either.
>
> Also, to be optimally useable, you need to be able to begin a
> token-based pagination system from either the head or tail of the
> list, but that may not be possible with sstables.
>
> It may just be an oversight...the API is confusingly organized, and
> it's hard to be sure if some likely feature is there or not.
>
> Related:
>
> http://issues.apache.org/jira/browse/CASSANDRA-261
> http://issues.apache.org/jira/browse/CASSANDRA-217
> http://issues.apache.org/jira/browse/CASSANDRA-263
>
>
> Evan
>
> On Fri, Jul 3, 2009 at 6:06 PM, Evan Weaver<ew...@gmail.com> wrote:
>> That requires you to know the timestamp, so you can't just ask for the
>> most recent one.
>>
>> Evan
>>
>> On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> get_columns_since
>>>
>>> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>> This helps a lot.
>>>>
>>>> However, I can't find any API method that actually lets me do a
>>>> slice query on a time-sorted column, as necessary for the second blog
>>>> example. I get the following error on r789419:
>>>>
>>>> InvalidRequestException: get_slice_from requires CF indexed by name
>>>>
>>>> Evan
>>>>
>>>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>> Mail storage, man, I think pretty much anything I could come up with
>>>>> would look pretty simplistic compared to what "real" systems do in
>>>>> that domain. :)
>>>>>
>>>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>>>> or there isn't enough scale to make it interesting. :)
>>>>>
>>>>> The interesting thing here is we want to be able to query two things
>>>>> efficiently:
>>>>>  - the most recent posts belonging to a given blog, in reverse
>>>>> chronological order
>>>>>  - a single post and its comments, in chronological order
>>>>>
>>>>> At first glance you might think we can again reasonably do this with a
>>>>> single CF, this time a super CF:
>>>>>
>>>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>>>
>>>>> The key is the blog name, the supercolumns are posts and the
>>>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>>>> are just containers, they have no data or timestamp associated with
>>>>> them directly (only through their subcolumns).  So you cannot sort a
>>>>> super CF by time.
>>>>>
>>>>> So instead what I would do would be to use two CFs:
>>>>>
>>>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>>>
>>>>> For the first, the keys used would be blog names, and the columns
>>>>> would be the post titles and body.  So to get a list of most recent
>>>>> posts you just do a slice query.  Even though Cassandra currently
>>>>> handles large groups of columns sub-optimally, even with a blog
>>>>> updated several times a day you'd be safe taking this approach (i.e.
>>>>> we'll have that problem fixed before you start seeing it :).
>>>>>
>>>>> For the second, the keys are blog name<delimiter><post title>.  The
>>>>> columns are the comment data.  You can serialize these a number of
>>>>> ways; I would probably use title as the column name and have the value
>>>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>>>> call to get the comments in order.  (We will have to manually reverse
>>>>> what slice gives us since time sort is always reverse chronological
>>>>> atm, but the overhead of doing this in memory will be negligible.)
>>>>>
>>>>> Does this help?
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>>>>> Even if it's not actually in real-life use, some examples for common
>>>>>> domains would really help clarify things.
>>>>>>
>>>>>>  * blog
>>>>>>  * email storage
>>>>>>  * search index
>>>>>>
>>>>>> etc.
>>>>>>
>>>>>> Evan
>>>>>>
>>>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>>>> Does anyone have a simple app schema they can share?
>>>>>>>
>>>>>>> I can't share the one for our main app.  But we do need an example
>>>>>>> here.  A real one would be nice if we can find one.
>>>>>>>
>>>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>>>  They do have a really simple one:
>>>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>>>
>>>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>>>
>>>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>>>> users with get_key_range and then their messages with get_slice.
>>>>>>>
>>>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>>>
>>>>>>> Anyone got another one for pedagogical purposes?
>>>>>>>
>>>>>>> -Jonathan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Evan Weaver
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Re: schema example

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Jul 3, 2009 at 8:53 PM, Evan Weaver<ew...@gmail.com> wrote:
> (From talking on IRC):
>
> I think this boils down to the offset/limit vs. token/limit debate.
>
> Token/limit is fine in all cases for me, but you still have to be able
> to query the head of the list (with a limit, but no token) to get
> started. Right now there is no facility for that on time-sorted column
> families:
>
>  list<column_t> get_columns_since(1:string tablename, 2:string key,
> 3:string columnParent, 4:i64 timeStamp)

basically we need _since to add the kind of functionality we have in
Slice (or will, after 261 is committed).

it's probably better to get 240 (and 185 + 189) done sooner than later
though instead of wasting effort on an API we know is broken.

(the old get_slice could do basically anything since it deserialized
the entire CF into memory.  we're moving away from that to support
larger-than-memory CFs.)

-Jonathan

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

(From talking on IRC):

I think this boils down to the offset/limit vs. token/limit debate.

Token/limit is fine in all cases for me, but you still have to be able
to query the head of the list (with a limit, but no token) to get
started. Right now there is no facility for that on time-sorted column
families:

  list<column_t> get_columns_since(1:string tablename, 2:string key,
3:string columnParent, 4:i64 timeStamp)

I don't think token ranges are supported on time columns, either.

Also, to be optimally useable, you need to be able to begin a
token-based pagination system from either the head or tail of the
list, but that may not be possible with sstables.

It may just be an oversight...the API is confusingly organized, and
it's hard to be sure if some likely feature is there or not.

Related:

http://issues.apache.org/jira/browse/CASSANDRA-261
http://issues.apache.org/jira/browse/CASSANDRA-217
http://issues.apache.org/jira/browse/CASSANDRA-263


Evan

On Fri, Jul 3, 2009 at 6:06 PM, Evan Weaver<ew...@gmail.com> wrote:
> That requires you to know the timestamp, so you can't just ask for the
> most recent one.
>
> Evan
>
> On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> get_columns_since
>>
>> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
>>> This helps a lot.
>>>
>>> However, I can't find any API method that actually lets me do a
>>> slice query on a time-sorted column, as necessary for the second blog
>>> example. I get the following error on r789419:
>>>
>>> InvalidRequestException: get_slice_from requires CF indexed by name
>>>
>>> Evan
>>>
>>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>> Mail storage, man, I think pretty much anything I could come up with
>>>> would look pretty simplistic compared to what "real" systems do in
>>>> that domain. :)
>>>>
>>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>>> or there isn't enough scale to make it interesting. :)
>>>>
>>>> The interesting thing here is we want to be able to query two things
>>>> efficiently:
>>>>  - the most recent posts belonging to a given blog, in reverse
>>>> chronological order
>>>>  - a single post and its comments, in chronological order
>>>>
>>>> At first glance you might think we can again reasonably do this with a
>>>> single CF, this time a super CF:
>>>>
>>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>>
>>>> The key is the blog name, the supercolumns are posts and the
>>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>>> are just containers, they have no data or timestamp associated with
>>>> them directly (only through their subcolumns).  So you cannot sort a
>>>> super CF by time.
>>>>
>>>> So instead what I would do would be to use two CFs:
>>>>
>>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>>
>>>> For the first, the keys used would be blog names, and the columns
>>>> would be the post titles and body.  So to get a list of most recent
>>>> posts you just do a slice query.  Even though Cassandra currently
>>>> handles large groups of columns sub-optimally, even with a blog
>>>> updated several times a day you'd be safe taking this approach (i.e.
>>>> we'll have that problem fixed before you start seeing it :).
>>>>
>>>> For the second, the keys are blog name<delimiter><post title>.  The
>>>> columns are the comment data.  You can serialize these a number of
>>>> ways; I would probably use title as the column name and have the value
>>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>>> call to get the comments in order.  (We will have to manually reverse
>>>> what slice gives us since time sort is always reverse chronological
>>>> atm, but the overhead of doing this in memory will be negligible.)
>>>>
>>>> Does this help?
>>>>
>>>> -Jonathan
>>>>
>>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>>>> Even if it's not actually in real-life use, some examples for common
>>>>> domains would really help clarify things.
>>>>>
>>>>>  * blog
>>>>>  * email storage
>>>>>  * search index
>>>>>
>>>>> etc.
>>>>>
>>>>> Evan
>>>>>
>>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>>> Does anyone have a simple app schema they can share?
>>>>>>
>>>>>> I can't share the one for our main app.  But we do need an example
>>>>>> here.  A real one would be nice if we can find one.
>>>>>>
>>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>>  They do have a really simple one:
>>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>>
>>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>>
>>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>>> users with get_key_range and then their messages with get_slice.
>>>>>>
>>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>>
>>>>>> Anyone got another one for pedagogical purposes?
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Evan Weaver
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

That requires you to know the timestamp, so you can't just ask for the
most recent one.

Evan

On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> get_columns_since
>
> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
>> This helps a lot.
>>
>> However, I can't find any API method that actually lets me do a
>> slice query on a time-sorted column, as necessary for the second blog
>> example. I get the following error on r789419:
>>
>> InvalidRequestException: get_slice_from requires CF indexed by name
>>
>> Evan
>>
>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> Mail storage, man, I think pretty much anything I could come up with
>>> would look pretty simplistic compared to what "real" systems do in
>>> that domain. :)
>>>
>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>> or there isn't enough scale to make it interesting. :)
>>>
>>> The interesting thing here is we want to be able to query two things
>>> efficiently:
>>>  - the most recent posts belonging to a given blog, in reverse
>>> chronological order
>>>  - a single post and its comments, in chronological order
>>>
>>> At first glance you might think we can again reasonably do this with a
>>> single CF, this time a super CF:
>>>
>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>
>>> The key is the blog name, the supercolumns are posts and the
>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>> are just containers, they have no data or timestamp associated with
>>> them directly (only through their subcolumns).  So you cannot sort a
>>> super CF by time.
>>>
>>> So instead what I would do would be to use two CFs:
>>>
>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>
>>> For the first, the keys used would be blog names, and the columns
>>> would be the post titles and body.  So to get a list of most recent
>>> posts you just do a slice query.  Even though Cassandra currently
>>> handles large groups of columns sub-optimally, even with a blog
>>> updated several times a day you'd be safe taking this approach (i.e.
>>> we'll have that problem fixed before you start seeing it :).
>>>
>>> For the second, the keys are blog name<delimiter><post title>.  The
>>> columns are the comment data.  You can serialize these a number of
>>> ways; I would probably use title as the column name and have the value
>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>> call to get the comments in order.  (We will have to manually reverse
>>> what slice gives us since time sort is always reverse chronological
>>> atm, but the overhead of doing this in memory will be negligible.)
>>>
>>> Does this help?
>>>
>>> -Jonathan
>>>
>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>>> Even if it's not actually in real-life use, some examples for common
>>>> domains would really help clarify things.
>>>>
>>>>  * blog
>>>>  * email storage
>>>>  * search index
>>>>
>>>> etc.
>>>>
>>>> Evan
>>>>
>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>> Does anyone have a simple app schema they can share?
>>>>>
>>>>> I can't share the one for our main app.  But we do need an example
>>>>> here.  A real one would be nice if we can find one.
>>>>>
>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>  They do have a really simple one:
>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>
>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>
>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>> users with get_key_range and then their messages with get_slice.
>>>>>
>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>
>>>>> Anyone got another one for pedagogical purposes?
>>>>>
>>>>> -Jonathan
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: schema example

Posted by Jonathan Ellis <jb...@gmail.com>.

get_columns_since

On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
> This helps a lot.
>
> However, I can't find any API method that actually lets me do a
> slice query on a time-sorted column, as necessary for the second blog
> example. I get the following error on r789419:
>
> InvalidRequestException: get_slice_from requires CF indexed by name
>
> Evan
>
> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> Mail storage, man, I think pretty much anything I could come up with
>> would look pretty simplistic compared to what "real" systems do in
>> that domain. :)
>>
>> But blogs, I think I can handle those.  Let's make it ours multiuser
>> or there isn't enough scale to make it interesting. :)
>>
>> The interesting thing here is we want to be able to query two things
>> efficiently:
>>  - the most recent posts belonging to a given blog, in reverse
>> chronological order
>>  - a single post and its comments, in chronological order
>>
>> At first glance you might think we can again reasonably do this with a
>> single CF, this time a super CF:
>>
>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>
>> The key is the blog name, the supercolumns are posts and the
>> subcolumns are comments.  This would be reasonable BUT supercolumns
>> are just containers, they have no data or timestamp associated with
>> them directly (only through their subcolumns).  So you cannot sort a
>> super CF by time.
>>
>> So instead what I would do would be to use two CFs:
>>
>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>
>> For the first, the keys used would be blog names, and the columns
>> would be the post titles and body.  So to get a list of most recent
>> posts you just do a slice query.  Even though Cassandra currently
>> handles large groups of columns sub-optimally, even with a blog
>> updated several times a day you'd be safe taking this approach (i.e.
>> we'll have that problem fixed before you start seeing it :).
>>
>> For the second, the keys are blog name<delimiter><post title>.  The
>> columns are the comment data.  You can serialize these a number of
>> ways; I would probably use title as the column name and have the value
>> be the author + body (e.g. as a json dict).  Again we use the slice
>> call to get the comments in order.  (We will have to manually reverse
>> what slice gives us since time sort is always reverse chronological
>> atm, but the overhead of doing this in memory will be negligible.)
>>
>> Does this help?
>>
>> -Jonathan
>>
>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>> Even if it's not actually in real-life use, some examples for common
>>> domains would really help clarify things.
>>>
>>>  * blog
>>>  * email storage
>>>  * search index
>>>
>>> etc.
>>>
>>> Evan
>>>
>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> Does anyone have a simple app schema they can share?
>>>>
>>>> I can't share the one for our main app.  But we do need an example
>>>> here.  A real one would be nice if we can find one.
>>>>
>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>  They do have a really simple one:
>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>
>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>> since that is what most of your lookups will be by.  Keys are also how
>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>> (given enough nodes obviously) but only thousands to millions of
>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>> adding one now that does not deseriailze everything in the whole CF
>>>> into memory.  The rest will need to follow this model eventually too).
>>>>
>>>> For this guestbook I think the choice is obvious: use the name as the
>>>> key, and have a single simple CF for the messages.  Each column will
>>>> be a message (you can even use the mandatory timestamp field as part
>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>> users with get_key_range and then their messages with get_slice.
>>>>
>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>
>>>> Anyone got another one for pedagogical purposes?
>>>>
>>>> -Jonathan
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

This helps a lot.

However, I can't find any API method that actually lets me do a
slice query on a time-sorted column, as necessary for the second blog
example. I get the following error on r789419:

InvalidRequestException: get_slice_from requires CF indexed by name

Evan

On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> Mail storage, man, I think pretty much anything I could come up with
> would look pretty simplistic compared to what "real" systems do in
> that domain. :)
>
> But blogs, I think I can handle those.  Let's make it ours multiuser
> or there isn't enough scale to make it interesting. :)
>
> The interesting thing here is we want to be able to query two things
> efficiently:
>  - the most recent posts belonging to a given blog, in reverse
> chronological order
>  - a single post and its comments, in chronological order
>
> At first glance you might think we can again reasonably do this with a
> single CF, this time a super CF:
>
> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>
> The key is the blog name, the supercolumns are posts and the
> subcolumns are comments.  This would be reasonable BUT supercolumns
> are just containers, they have no data or timestamp associated with
> them directly (only through their subcolumns).  So you cannot sort a
> super CF by time.
>
> So instead what I would do would be to use two CFs:
>
> <ColumnFamily ColumnSort="Time" Name="Post"/>
> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>
> For the first, the keys used would be blog names, and the columns
> would be the post titles and body.  So to get a list of most recent
> posts you just do a slice query.  Even though Cassandra currently
> handles large groups of columns sub-optimally, even with a blog
> updated several times a day you'd be safe taking this approach (i.e.
> we'll have that problem fixed before you start seeing it :).
>
> For the second, the keys are blog name<delimiter><post title>.  The
> columns are the comment data.  You can serialize these a number of
> ways; I would probably use title as the column name and have the value
> be the author + body (e.g. as a json dict).  Again we use the slice
> call to get the comments in order.  (We will have to manually reverse
> what slice gives us since time sort is always reverse chronological
> atm, but the overhead of doing this in memory will be negligible.)
>
> Does this help?
>
> -Jonathan
>
> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>> Even if it's not actually in real-life use, some examples for common
>> domains would really help clarify things.
>>
>>  * blog
>>  * email storage
>>  * search index
>>
>> etc.
>>
>> Evan
>>
>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> Does anyone have a simple app schema they can share?
>>>
>>> I can't share the one for our main app.  But we do need an example
>>> here.  A real one would be nice if we can find one.
>>>
>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>  They do have a really simple one:
>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>
>>> The most important thing in Cassandra modeling is choosing a good key,
>>> since that is what most of your lookups will be by.  Keys are also how
>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>> (given enough nodes obviously) but only thousands to millions of
>>> columns per key/CF (depending on what API calls you use -- Jun is
>>> adding one now that does not deseriailze everything in the whole CF
>>> into memory.  The rest will need to follow this model eventually too).
>>>
>>> For this guestbook I think the choice is obvious: use the name as the
>>> key, and have a single simple CF for the messages.  Each column will
>>> be a message (you can even use the mandatory timestamp field as part
>>> of your user-visible data.  win!).  You get the list (or page) of
>>> users with get_key_range and then their messages with get_slice.
>>>
>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>
>>> Anyone got another one for pedagogical purposes?
>>>
>>> -Jonathan
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: schema example

Posted by Jonathan Ellis <jb...@gmail.com>.

Mail storage, man, I think pretty much anything I could come up with
would look pretty simplistic compared to what "real" systems do in
that domain. :)

But blogs, I think I can handle those.  Let's make it ours multiuser
or there isn't enough scale to make it interesting. :)

The interesting thing here is we want to be able to query two things
efficiently:
 - the most recent posts belonging to a given blog, in reverse
chronological order
 - a single post and its comments, in chronological order

At first glance you might think we can again reasonably do this with a
single CF, this time a super CF:

<ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>

The key is the blog name, the supercolumns are posts and the
subcolumns are comments.  This would be reasonable BUT supercolumns
are just containers, they have no data or timestamp associated with
them directly (only through their subcolumns).  So you cannot sort a
super CF by time.

So instead what I would do would be to use two CFs:

<ColumnFamily ColumnSort="Time" Name="Post"/>
<ColumnFamily ColumnSort="Time" Name="Comment"/>

For the first, the keys used would be blog names, and the columns
would be the post titles and body.  So to get a list of most recent
posts you just do a slice query.  Even though Cassandra currently
handles large groups of columns sub-optimally, even with a blog
updated several times a day you'd be safe taking this approach (i.e.
we'll have that problem fixed before you start seeing it :).

For the second, the keys are blog name<delimiter><post title>.  The
columns are the comment data.  You can serialize these a number of
ways; I would probably use title as the column name and have the value
be the author + body (e.g. as a json dict).  Again we use the slice
call to get the comments in order.  (We will have to manually reverse
what slice gives us since time sort is always reverse chronological
atm, but the overhead of doing this in memory will be negligible.)

Does this help?

-Jonathan

On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
> Even if it's not actually in real-life use, some examples for common
> domains would really help clarify things.
>
>  * blog
>  * email storage
>  * search index
>
> etc.
>
> Evan
>
> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> Does anyone have a simple app schema they can share?
>>
>> I can't share the one for our main app.  But we do need an example
>> here.  A real one would be nice if we can find one.
>>
>> I checked App Engine.  They don't have a whole lot of examples either.
>>  They do have a really simple one:
>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>
>> The most important thing in Cassandra modeling is choosing a good key,
>> since that is what most of your lookups will be by.  Keys are also how
>> Cassandra scales -- Cassandra can handle effectively infinite keys
>> (given enough nodes obviously) but only thousands to millions of
>> columns per key/CF (depending on what API calls you use -- Jun is
>> adding one now that does not deseriailze everything in the whole CF
>> into memory.  The rest will need to follow this model eventually too).
>>
>> For this guestbook I think the choice is obvious: use the name as the
>> key, and have a single simple CF for the messages.  Each column will
>> be a message (you can even use the mandatory timestamp field as part
>> of your user-visible data.  win!).  You get the list (or page) of
>> users with get_key_range and then their messages with get_slice.
>>
>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>
>> Anyone got another one for pedagogical purposes?
>>
>> -Jonathan
>>
>
>
>
> --
> Evan Weaver
>

Re: schema example

Posted by Evan Weaver <ev...@cloudbur.st>.

I have watched it. I am looking for some actual storage-conf.xml
configurations that I can run and play around with.

Evan

On Tue, May 19, 2009 at 10:18 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> Evan,
>
> You can watch the video presentation at Facebook from
> http://code.google.com/p/the-cassandra-project/ (follow the link on the
> right). The presentation talks about the schema used by FB for email search.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> junrao@almaden.ibm.com
>
> Evan Weaver <ev...@cloudbur.st>
>
>
> Evan Weaver <ev...@cloudbur.st>
> Sent by: eweaver@gmail.com
>
> 05/19/2009 09:49 AM
>
> Please respond to
> cassandra-user@incubator.apache.org
>
> To
> cassandra-user@incubator.apache.org
> cc
>
> Subject
> Re: schema example
>
> Even if it's not actually in real-life use, some examples for common
> domains would really help clarify things.
>
> * blog
> * email storage
> * search index
>
> etc.
>
> Evan
>
> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> Does anyone have a simple app schema they can share?
>>
>> I can't share the one for our main app.  But we do need an example
>> here.  A real one would be nice if we can find one.
>>
>> I checked App Engine.  They don't have a whole lot of examples either.
>>  They do have a really simple one:
>>
>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>
>> The most important thing in Cassandra modeling is choosing a good key,
>> since that is what most of your lookups will be by.  Keys are also how
>> Cassandra scales -- Cassandra can handle effectively infinite keys
>> (given enough nodes obviously) but only thousands to millions of
>> columns per key/CF (depending on what API calls you use -- Jun is
>> adding one now that does not deseriailze everything in the whole CF
>> into memory.  The rest will need to follow this model eventually too).
>>
>> For this guestbook I think the choice is obvious: use the name as the
>> key, and have a single simple CF for the messages.  Each column will
>> be a message (you can even use the mandatory timestamp field as part
>> of your user-visible data.  win!).  You get the list (or page) of
>> users with get_key_range and then their messages with get_slice.
>>
>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>
>> Anyone got another one for pedagogical purposes?
>>
>> -Jonathan
>>
>
>
>
> --
> Evan Weaver
>
>



-- 
Evan Weaver

Re: schema example

Posted by Jun Rao <ju...@almaden.ibm.com>.

Evan,

You can watch the video presentation at Facebook from
http://code.google.com/p/the-cassandra-project/ (follow the link on the
right). The presentation talks about the schema used by FB for email
search.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com



                                                                           
             Evan Weaver                                                   
             <evan@cloudbur.st                                             
             >                                                          To 
             Sent by:                  cassandra-user@incubator.apache.org 
             eweaver@gmail.com                                          cc 
                                                                           
                                                                   Subject 
             05/19/2009 09:49          Re: schema example                  
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
             cassandra-user@in                                             
             cubator.apache.or                                             
                     g                                                     
                                                                           
                                                                           





Even if it's not actually in real-life use, some examples for common
domains would really help clarify things.

 * blog
 * email storage
 * search index

etc.

Evan

On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> Does anyone have a simple app schema they can share?
>
> I can't share the one for our main app.  But we do need an example
> here.  A real one would be nice if we can find one.
>
> I checked App Engine.  They don't have a whole lot of examples either.
>  They do have a really simple one:
>
http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html

>
> The most important thing in Cassandra modeling is choosing a good key,
> since that is what most of your lookups will be by.  Keys are also how
> Cassandra scales -- Cassandra can handle effectively infinite keys
> (given enough nodes obviously) but only thousands to millions of
> columns per key/CF (depending on what API calls you use -- Jun is
> adding one now that does not deseriailze everything in the whole CF
> into memory.  The rest will need to follow this model eventually too).
>
> For this guestbook I think the choice is obvious: use the name as the
> key, and have a single simple CF for the messages.  Each column will
> be a message (you can even use the mandatory timestamp field as part
> of your user-visible data.  win!).  You get the list (or page) of
> users with get_key_range and then their messages with get_slice.
>
> <ColumnFamily ColumnSort="Name" Name="Message"/>
>
> Anyone got another one for pedagogical purposes?
>
> -Jonathan
>



--
Evan Weaver

Re: schema example

Posted by Evan Weaver <ev...@cloudbur.st>.

Even if it's not actually in real-life use, some examples for common
domains would really help clarify things.

 * blog
 * email storage
 * search index

etc.

Evan

On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> Does anyone have a simple app schema they can share?
>
> I can't share the one for our main app.  But we do need an example
> here.  A real one would be nice if we can find one.
>
> I checked App Engine.  They don't have a whole lot of examples either.
>  They do have a really simple one:
> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>
> The most important thing in Cassandra modeling is choosing a good key,
> since that is what most of your lookups will be by.  Keys are also how
> Cassandra scales -- Cassandra can handle effectively infinite keys
> (given enough nodes obviously) but only thousands to millions of
> columns per key/CF (depending on what API calls you use -- Jun is
> adding one now that does not deseriailze everything in the whole CF
> into memory.  The rest will need to follow this model eventually too).
>
> For this guestbook I think the choice is obvious: use the name as the
> key, and have a single simple CF for the messages.  Each column will
> be a message (you can even use the mandatory timestamp field as part
> of your user-visible data.  win!).  You get the list (or page) of
> users with get_key_range and then their messages with get_slice.
>
> <ColumnFamily ColumnSort="Name" Name="Message"/>
>
> Anyone got another one for pedagogical purposes?
>
> -Jonathan
>



-- 
Evan Weaver