You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Evan Weaver <ew...@gmail.com> on 2009/07/04 02:21:52 UTC

Re: schema example

This helps a lot.

However, I can't find any API method that actually lets me do a
slice query on a time-sorted column, as necessary for the second blog
example. I get the following error on r789419:

InvalidRequestException: get_slice_from requires CF indexed by name

Evan

On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> Mail storage, man, I think pretty much anything I could come up with
> would look pretty simplistic compared to what "real" systems do in
> that domain. :)
>
> But blogs, I think I can handle those.  Let's make it ours multiuser
> or there isn't enough scale to make it interesting. :)
>
> The interesting thing here is we want to be able to query two things
> efficiently:
>  - the most recent posts belonging to a given blog, in reverse
> chronological order
>  - a single post and its comments, in chronological order
>
> At first glance you might think we can again reasonably do this with a
> single CF, this time a super CF:
>
> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>
> The key is the blog name, the supercolumns are posts and the
> subcolumns are comments.  This would be reasonable BUT supercolumns
> are just containers, they have no data or timestamp associated with
> them directly (only through their subcolumns).  So you cannot sort a
> super CF by time.
>
> So instead what I would do would be to use two CFs:
>
> <ColumnFamily ColumnSort="Time" Name="Post"/>
> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>
> For the first, the keys used would be blog names, and the columns
> would be the post titles and body.  So to get a list of most recent
> posts you just do a slice query.  Even though Cassandra currently
> handles large groups of columns sub-optimally, even with a blog
> updated several times a day you'd be safe taking this approach (i.e.
> we'll have that problem fixed before you start seeing it :).
>
> For the second, the keys are blog name<delimiter><post title>.  The
> columns are the comment data.  You can serialize these a number of
> ways; I would probably use title as the column name and have the value
> be the author + body (e.g. as a json dict).  Again we use the slice
> call to get the comments in order.  (We will have to manually reverse
> what slice gives us since time sort is always reverse chronological
> atm, but the overhead of doing this in memory will be negligible.)
>
> Does this help?
>
> -Jonathan
>
> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>> Even if it's not actually in real-life use, some examples for common
>> domains would really help clarify things.
>>
>>  * blog
>>  * email storage
>>  * search index
>>
>> etc.
>>
>> Evan
>>
>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> Does anyone have a simple app schema they can share?
>>>
>>> I can't share the one for our main app.  But we do need an example
>>> here.  A real one would be nice if we can find one.
>>>
>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>  They do have a really simple one:
>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>
>>> The most important thing in Cassandra modeling is choosing a good key,
>>> since that is what most of your lookups will be by.  Keys are also how
>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>> (given enough nodes obviously) but only thousands to millions of
>>> columns per key/CF (depending on what API calls you use -- Jun is
>>> adding one now that does not deseriailze everything in the whole CF
>>> into memory.  The rest will need to follow this model eventually too).
>>>
>>> For this guestbook I think the choice is obvious: use the name as the
>>> key, and have a single simple CF for the messages.  Each column will
>>> be a message (you can even use the mandatory timestamp field as part
>>> of your user-visible data.  win!).  You get the list (or page) of
>>> users with get_key_range and then their messages with get_slice.
>>>
>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>
>>> Anyone got another one for pedagogical purposes?
>>>
>>> -Jonathan
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

FYI, Yahoo does an interesting thing in this case. They usually use
token pagination, but if a page displays limit 20 records, they
actually request limit 100 behind the scenes. The extra records are
used to generate deep links. So instead of just being able to go to
the next page:

prev | cur | next

You can render:

prev | cur | next | cur + 2 | cur + 3 | cur + 4 | cur + 5

This lets you smoothly trade off navigability for performance.

Evan

On Fri, Jul 3, 2009 at 6:53 PM, Evan Weaver<ew...@gmail.com> wrote:
> (From talking on IRC):
>
> I think this boils down to the offset/limit vs. token/limit debate.
>
> Token/limit is fine in all cases for me, but you still have to be able
> to query the head of the list (with a limit, but no token) to get
> started. Right now there is no facility for that on time-sorted column
> families:
>
>  list<column_t> get_columns_since(1:string tablename, 2:string key,
> 3:string columnParent, 4:i64 timeStamp)
>
> I don't think token ranges are supported on time columns, either.
>
> Also, to be optimally useable, you need to be able to begin a
> token-based pagination system from either the head or tail of the
> list, but that may not be possible with sstables.
>
> It may just be an oversight...the API is confusingly organized, and
> it's hard to be sure if some likely feature is there or not.
>
> Related:
>
> http://issues.apache.org/jira/browse/CASSANDRA-261
> http://issues.apache.org/jira/browse/CASSANDRA-217
> http://issues.apache.org/jira/browse/CASSANDRA-263
>
>
> Evan
>
> On Fri, Jul 3, 2009 at 6:06 PM, Evan Weaver<ew...@gmail.com> wrote:
>> That requires you to know the timestamp, so you can't just ask for the
>> most recent one.
>>
>> Evan
>>
>> On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> get_columns_since
>>>
>>> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>> This helps a lot.
>>>>
>>>> However, I can't find any API method that actually lets me do a
>>>> slice query on a time-sorted column, as necessary for the second blog
>>>> example. I get the following error on r789419:
>>>>
>>>> InvalidRequestException: get_slice_from requires CF indexed by name
>>>>
>>>> Evan
>>>>
>>>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>> Mail storage, man, I think pretty much anything I could come up with
>>>>> would look pretty simplistic compared to what "real" systems do in
>>>>> that domain. :)
>>>>>
>>>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>>>> or there isn't enough scale to make it interesting. :)
>>>>>
>>>>> The interesting thing here is we want to be able to query two things
>>>>> efficiently:
>>>>>  - the most recent posts belonging to a given blog, in reverse
>>>>> chronological order
>>>>>  - a single post and its comments, in chronological order
>>>>>
>>>>> At first glance you might think we can again reasonably do this with a
>>>>> single CF, this time a super CF:
>>>>>
>>>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>>>
>>>>> The key is the blog name, the supercolumns are posts and the
>>>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>>>> are just containers, they have no data or timestamp associated with
>>>>> them directly (only through their subcolumns).  So you cannot sort a
>>>>> super CF by time.
>>>>>
>>>>> So instead what I would do would be to use two CFs:
>>>>>
>>>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>>>
>>>>> For the first, the keys used would be blog names, and the columns
>>>>> would be the post titles and body.  So to get a list of most recent
>>>>> posts you just do a slice query.  Even though Cassandra currently
>>>>> handles large groups of columns sub-optimally, even with a blog
>>>>> updated several times a day you'd be safe taking this approach (i.e.
>>>>> we'll have that problem fixed before you start seeing it :).
>>>>>
>>>>> For the second, the keys are blog name<delimiter><post title>.  The
>>>>> columns are the comment data.  You can serialize these a number of
>>>>> ways; I would probably use title as the column name and have the value
>>>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>>>> call to get the comments in order.  (We will have to manually reverse
>>>>> what slice gives us since time sort is always reverse chronological
>>>>> atm, but the overhead of doing this in memory will be negligible.)
>>>>>
>>>>> Does this help?
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>>>>> Even if it's not actually in real-life use, some examples for common
>>>>>> domains would really help clarify things.
>>>>>>
>>>>>>  * blog
>>>>>>  * email storage
>>>>>>  * search index
>>>>>>
>>>>>> etc.
>>>>>>
>>>>>> Evan
>>>>>>
>>>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>>>> Does anyone have a simple app schema they can share?
>>>>>>>
>>>>>>> I can't share the one for our main app.  But we do need an example
>>>>>>> here.  A real one would be nice if we can find one.
>>>>>>>
>>>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>>>  They do have a really simple one:
>>>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>>>
>>>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>>>
>>>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>>>> users with get_key_range and then their messages with get_slice.
>>>>>>>
>>>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>>>
>>>>>>> Anyone got another one for pedagogical purposes?
>>>>>>>
>>>>>>> -Jonathan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Evan Weaver
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Re: schema example

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Jul 3, 2009 at 8:53 PM, Evan Weaver<ew...@gmail.com> wrote:
> (From talking on IRC):
>
> I think this boils down to the offset/limit vs. token/limit debate.
>
> Token/limit is fine in all cases for me, but you still have to be able
> to query the head of the list (with a limit, but no token) to get
> started. Right now there is no facility for that on time-sorted column
> families:
>
>  list<column_t> get_columns_since(1:string tablename, 2:string key,
> 3:string columnParent, 4:i64 timeStamp)

basically we need _since to add the kind of functionality we have in
Slice (or will, after 261 is committed).

it's probably better to get 240 (and 185 + 189) done sooner than later
though instead of wasting effort on an API we know is broken.

(the old get_slice could do basically anything since it deserialized
the entire CF into memory.  we're moving away from that to support
larger-than-memory CFs.)

-Jonathan

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

(From talking on IRC):

I think this boils down to the offset/limit vs. token/limit debate.

Token/limit is fine in all cases for me, but you still have to be able
to query the head of the list (with a limit, but no token) to get
started. Right now there is no facility for that on time-sorted column
families:

  list<column_t> get_columns_since(1:string tablename, 2:string key,
3:string columnParent, 4:i64 timeStamp)

I don't think token ranges are supported on time columns, either.

Also, to be optimally useable, you need to be able to begin a
token-based pagination system from either the head or tail of the
list, but that may not be possible with sstables.

It may just be an oversight...the API is confusingly organized, and
it's hard to be sure if some likely feature is there or not.

Related:

http://issues.apache.org/jira/browse/CASSANDRA-261
http://issues.apache.org/jira/browse/CASSANDRA-217
http://issues.apache.org/jira/browse/CASSANDRA-263


Evan

On Fri, Jul 3, 2009 at 6:06 PM, Evan Weaver<ew...@gmail.com> wrote:
> That requires you to know the timestamp, so you can't just ask for the
> most recent one.
>
> Evan
>
> On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> get_columns_since
>>
>> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
>>> This helps a lot.
>>>
>>> However, I can't find any API method that actually lets me do a
>>> slice query on a time-sorted column, as necessary for the second blog
>>> example. I get the following error on r789419:
>>>
>>> InvalidRequestException: get_slice_from requires CF indexed by name
>>>
>>> Evan
>>>
>>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>> Mail storage, man, I think pretty much anything I could come up with
>>>> would look pretty simplistic compared to what "real" systems do in
>>>> that domain. :)
>>>>
>>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>>> or there isn't enough scale to make it interesting. :)
>>>>
>>>> The interesting thing here is we want to be able to query two things
>>>> efficiently:
>>>>  - the most recent posts belonging to a given blog, in reverse
>>>> chronological order
>>>>  - a single post and its comments, in chronological order
>>>>
>>>> At first glance you might think we can again reasonably do this with a
>>>> single CF, this time a super CF:
>>>>
>>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>>
>>>> The key is the blog name, the supercolumns are posts and the
>>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>>> are just containers, they have no data or timestamp associated with
>>>> them directly (only through their subcolumns).  So you cannot sort a
>>>> super CF by time.
>>>>
>>>> So instead what I would do would be to use two CFs:
>>>>
>>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>>
>>>> For the first, the keys used would be blog names, and the columns
>>>> would be the post titles and body.  So to get a list of most recent
>>>> posts you just do a slice query.  Even though Cassandra currently
>>>> handles large groups of columns sub-optimally, even with a blog
>>>> updated several times a day you'd be safe taking this approach (i.e.
>>>> we'll have that problem fixed before you start seeing it :).
>>>>
>>>> For the second, the keys are blog name<delimiter><post title>.  The
>>>> columns are the comment data.  You can serialize these a number of
>>>> ways; I would probably use title as the column name and have the value
>>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>>> call to get the comments in order.  (We will have to manually reverse
>>>> what slice gives us since time sort is always reverse chronological
>>>> atm, but the overhead of doing this in memory will be negligible.)
>>>>
>>>> Does this help?
>>>>
>>>> -Jonathan
>>>>
>>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>>>> Even if it's not actually in real-life use, some examples for common
>>>>> domains would really help clarify things.
>>>>>
>>>>>  * blog
>>>>>  * email storage
>>>>>  * search index
>>>>>
>>>>> etc.
>>>>>
>>>>> Evan
>>>>>
>>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>>> Does anyone have a simple app schema they can share?
>>>>>>
>>>>>> I can't share the one for our main app.  But we do need an example
>>>>>> here.  A real one would be nice if we can find one.
>>>>>>
>>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>>  They do have a really simple one:
>>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>>
>>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>>
>>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>>> users with get_key_range and then their messages with get_slice.
>>>>>>
>>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>>
>>>>>> Anyone got another one for pedagogical purposes?
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Evan Weaver
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Re: schema example

Posted by Evan Weaver <ew...@gmail.com>.

That requires you to know the timestamp, so you can't just ask for the
most recent one.

Evan

On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> get_columns_since
>
> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
>> This helps a lot.
>>
>> However, I can't find any API method that actually lets me do a
>> slice query on a time-sorted column, as necessary for the second blog
>> example. I get the following error on r789419:
>>
>> InvalidRequestException: get_slice_from requires CF indexed by name
>>
>> Evan
>>
>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> Mail storage, man, I think pretty much anything I could come up with
>>> would look pretty simplistic compared to what "real" systems do in
>>> that domain. :)
>>>
>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>> or there isn't enough scale to make it interesting. :)
>>>
>>> The interesting thing here is we want to be able to query two things
>>> efficiently:
>>>  - the most recent posts belonging to a given blog, in reverse
>>> chronological order
>>>  - a single post and its comments, in chronological order
>>>
>>> At first glance you might think we can again reasonably do this with a
>>> single CF, this time a super CF:
>>>
>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>
>>> The key is the blog name, the supercolumns are posts and the
>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>> are just containers, they have no data or timestamp associated with
>>> them directly (only through their subcolumns).  So you cannot sort a
>>> super CF by time.
>>>
>>> So instead what I would do would be to use two CFs:
>>>
>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>
>>> For the first, the keys used would be blog names, and the columns
>>> would be the post titles and body.  So to get a list of most recent
>>> posts you just do a slice query.  Even though Cassandra currently
>>> handles large groups of columns sub-optimally, even with a blog
>>> updated several times a day you'd be safe taking this approach (i.e.
>>> we'll have that problem fixed before you start seeing it :).
>>>
>>> For the second, the keys are blog name<delimiter><post title>.  The
>>> columns are the comment data.  You can serialize these a number of
>>> ways; I would probably use title as the column name and have the value
>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>> call to get the comments in order.  (We will have to manually reverse
>>> what slice gives us since time sort is always reverse chronological
>>> atm, but the overhead of doing this in memory will be negligible.)
>>>
>>> Does this help?
>>>
>>> -Jonathan
>>>
>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>>> Even if it's not actually in real-life use, some examples for common
>>>> domains would really help clarify things.
>>>>
>>>>  * blog
>>>>  * email storage
>>>>  * search index
>>>>
>>>> etc.
>>>>
>>>> Evan
>>>>
>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>> Does anyone have a simple app schema they can share?
>>>>>
>>>>> I can't share the one for our main app.  But we do need an example
>>>>> here.  A real one would be nice if we can find one.
>>>>>
>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>  They do have a really simple one:
>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>
>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>
>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>> users with get_key_range and then their messages with get_slice.
>>>>>
>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>
>>>>> Anyone got another one for pedagogical purposes?
>>>>>
>>>>> -Jonathan
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: schema example

Posted by Jonathan Ellis <jb...@gmail.com>.

get_columns_since

On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<ew...@gmail.com> wrote:
> This helps a lot.
>
> However, I can't find any API method that actually lets me do a
> slice query on a time-sorted column, as necessary for the second blog
> example. I get the following error on r789419:
>
> InvalidRequestException: get_slice_from requires CF indexed by name
>
> Evan
>
> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> Mail storage, man, I think pretty much anything I could come up with
>> would look pretty simplistic compared to what "real" systems do in
>> that domain. :)
>>
>> But blogs, I think I can handle those.  Let's make it ours multiuser
>> or there isn't enough scale to make it interesting. :)
>>
>> The interesting thing here is we want to be able to query two things
>> efficiently:
>>  - the most recent posts belonging to a given blog, in reverse
>> chronological order
>>  - a single post and its comments, in chronological order
>>
>> At first glance you might think we can again reasonably do this with a
>> single CF, this time a super CF:
>>
>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>
>> The key is the blog name, the supercolumns are posts and the
>> subcolumns are comments.  This would be reasonable BUT supercolumns
>> are just containers, they have no data or timestamp associated with
>> them directly (only through their subcolumns).  So you cannot sort a
>> super CF by time.
>>
>> So instead what I would do would be to use two CFs:
>>
>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>
>> For the first, the keys used would be blog names, and the columns
>> would be the post titles and body.  So to get a list of most recent
>> posts you just do a slice query.  Even though Cassandra currently
>> handles large groups of columns sub-optimally, even with a blog
>> updated several times a day you'd be safe taking this approach (i.e.
>> we'll have that problem fixed before you start seeing it :).
>>
>> For the second, the keys are blog name<delimiter><post title>.  The
>> columns are the comment data.  You can serialize these a number of
>> ways; I would probably use title as the column name and have the value
>> be the author + body (e.g. as a json dict).  Again we use the slice
>> call to get the comments in order.  (We will have to manually reverse
>> what slice gives us since time sort is always reverse chronological
>> atm, but the overhead of doing this in memory will be negligible.)
>>
>> Does this help?
>>
>> -Jonathan
>>
>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <ev...@cloudbur.st> wrote:
>>> Even if it's not actually in real-life use, some examples for common
>>> domains would really help clarify things.
>>>
>>>  * blog
>>>  * email storage
>>>  * search index
>>>
>>> etc.
>>>
>>> Evan
>>>
>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> Does anyone have a simple app schema they can share?
>>>>
>>>> I can't share the one for our main app.  But we do need an example
>>>> here.  A real one would be nice if we can find one.
>>>>
>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>  They do have a really simple one:
>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>
>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>> since that is what most of your lookups will be by.  Keys are also how
>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>> (given enough nodes obviously) but only thousands to millions of
>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>> adding one now that does not deseriailze everything in the whole CF
>>>> into memory.  The rest will need to follow this model eventually too).
>>>>
>>>> For this guestbook I think the choice is obvious: use the name as the
>>>> key, and have a single simple CF for the messages.  Each column will
>>>> be a message (you can even use the mandatory timestamp field as part
>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>> users with get_key_range and then their messages with get_slice.
>>>>
>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>
>>>> Anyone got another one for pedagogical purposes?
>>>>
>>>> -Jonathan
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>