You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Schubert Zhang <zs...@gmail.com> on 2010/04/26 18:05:43 UTC

Is SuperColumn necessary?

I don't think the SuperColumn is so necessary.
I think this level of logic can be leaved to application.

Do you think so?

If SuperColumn is needed,  as
https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index
in SuperColumns level and SubColumns level.
Thus, the levels of index is too many.

Re: Is SuperColumn necessary?

Posted by Schubert Zhang <zs...@gmail.com>.

I think, at least currently, we should leave the logic of current
SuperColumn and addational indexing features to application layer of
cassandra core.

On Wed, Apr 28, 2010 at 6:44 PM, Schubert Zhang <zs...@gmail.com> wrote:

> I don't think secondary index is necessary for cassandra core, at least it
> is not urgent.
> I think currently, the first urgent improvements of cassandra are:
> 1. re-clarify the data-model.
> 2. re-implement the storage and index, especially the current SSTable
> implement is not good.
>
> In fact, the current storage/index implement is the most poor point.
>
>
>
> On Tue, Apr 27, 2010 at 12:11 AM, Jonathan Ellis <jb...@gmail.com>wrote:
>
>> I think that once we have built-in indexing (CASSANDRA-749) you can
>> make a good case for dropping supercolumns (at least, dropping them
>> from the public API and reserving them for internal use).
>>
>> On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang <zs...@gmail.com>
>> wrote:
>> > I don't think the SuperColumn is so necessary.
>> > I think this level of logic can be leaved to application.
>> >
>> > Do you think so?
>> >
>> > If SuperColumn is needed,  as
>> > https://issues.apache.org/jira/browse/CASSANDRA-598, we should build
>> index
>> > in SuperColumns level and SubColumns level.
>> > Thus, the levels of index is too many.
>> >
>> >
>>
>
>

Re: Is SuperColumn necessary?

Posted by Schubert Zhang <zs...@gmail.com>.

I don't think secondary index is necessary for cassandra core, at least it
is not urgent.
I think currently, the first urgent improvements of cassandra are:
1. re-clarify the data-model.
2. re-implement the storage and index, especially the current SSTable
implement is not good.

In fact, the current storage/index implement is the most poor point.

On Tue, Apr 27, 2010 at 12:11 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> I think that once we have built-in indexing (CASSANDRA-749) you can
> make a good case for dropping supercolumns (at least, dropping them
> from the public API and reserving them for internal use).
>
> On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang <zs...@gmail.com>
> wrote:
> > I don't think the SuperColumn is so necessary.
> > I think this level of logic can be leaved to application.
> >
> > Do you think so?
> >
> > If SuperColumn is needed,  as
> > https://issues.apache.org/jira/browse/CASSANDRA-598, we should build
> index
> > in SuperColumns level and SubColumns level.
> > Thus, the levels of index is too many.
> >
> >
>

Re: Is SuperColumn necessary?

Posted by Stu Hood <st...@rackspace.com>.

Hey Ed,

I've been working on a similar approach for arbitarily nested/compound column names in #998. See: http://github.com/stuhood/cassandra/blob/998/src/java/org/apache/cassandra/db/ColumnKey.java

The goal is to provide native support and potentially (in the very long term), API support for nested/compound names. The difference between our approaches boils down to needing to define a comparator for every level in #998, versus having dynamic types per name in your approach.

Thanks,
Stu

-----Original Message-----
From: "Ed Anuff" <ed...@anuff.com>
Sent: Wednesday, May 5, 2010 1:31pm
To: user@cassandra.apache.org
Subject: Re: Is SuperColumn necessary?

Follow-up from last weeks discussion, I've been playing around with a simple
column comparator for composite column names that I put up on github.  I'd
be interested to hear what people think of this approach.

http://github.com/edanuff/CassandraCompositeType

Ed

On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:

> It might make sense to create a CompositeType subclass of AbstractType for
> the purpose of constructing and comparing these types of "composite" column
> names so that if you could more easily do that sort of thing rather than
> having to concatenate into one big string.
>
>
> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com> wrote:
>
>> The only thing SuperColumns appear to buy you (as someone pointed out to
>> me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
>> use different comparator types for the Super/SubColumns, I guess..? But you
>> should be able to do the same thing by creating your own Column comparator.
>> I guess my point is that SuperColumns are mostly a convenience mechanism, as
>> far as I can tell.
>>
>> Mike
>>
>
>

Re: Is SuperColumn necessary?

Posted by Eric Evans <ee...@rackspace.com>.

On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote:
> Follow-up from last weeks discussion, I've been playing around with a
> simple
> column comparator for composite column names that I put up on github.
> I'd
> be interested to hear what people think of this approach.
> 
> http://github.com/edanuff/CassandraCompositeType 

Clever. I wonder what a useful abstraction in Hector or one of the other
idiomatic clients would look like.

-- 
Eric Evans
eevans@rackspace.com

Re: Is SuperColumn necessary?

Posted by Schubert Zhang <zs...@gmail.com>.

Hi Stu,
Thanks for your hard work. That's not a easy work.

With my partners, after days of reading of the code.
We really know that current code implementation of the storage-layer should
be rewrite for a clear implementation.


On Tue, May 11, 2010 at 12:44 AM, Stu Hood <st...@rackspace.com> wrote:

> I think that it is 100% ideal: it's what I've been working on implementing
> in #674, #847 and #998. I'm hoping to post a large patchset and docs this
> week, and I'm aiming to get it committed for 0.8.
>
> The work I've been doing doesn't touch the user interface: it only deals
> with the internal changes necessary to make this type of storage possible.
>
>
> -----Original Message-----
> From: "Mike Malone" <mi...@simplegeo.com>
> Sent: Monday, May 10, 2010 11:37am
> To: user@cassandra.apache.org
> Subject: Re: Is SuperColumn necessary?
>
> Maybe... but honestly, it doesn't affect the architecture or interface at
> all. I'm more interested in thinking about how the system should work than
> what things are called. Naming things are important, but that can happen
> later.
>
> Does anyone have any thoughts or comments on the architecture I suggested
> earlier?
>
> Mike
>
> On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com> wrote:
>
> > Yes, the "column" here is not appropriate.
> > Maybe we need not to create new terms, in Google's Bigtable, the term
> > "qualifier" is a good one.
> >
> >
> > On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
> wrote:
> >
> >> That would be a good time to get rid of the confusing "column" term,
> which
> >> incorrectly suggests a two-dimensional tabular structure.
> >>
> >> Suggestions:
> >>
> >> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
> >> "column" with "1st dimension", "2nd dimension", etc.
> >>
> >> 2. A file system: replace "key" and "column" with "directory" and
> >> "subdirectory"
> >>
> >> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
> value
> >> is the set of keys, whose value is the set of supercolumns of the key,
> whose
> >> value is the set of columns for the supercolumn, etc.
> >>
> >> 4. Etc.
> >>
> >> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
> >>
> >>> Nice, Ed, we're doing something very similar but less generic.
> >>>
> >>> Now replace all of the various methods for querying with a simple query
> >>> interface that takes a Predicate, allow the user to specify (in
> >>> storage-conf) which levels of the nested Columns should be indexed, and
> >>> completely remove Comparators and have people subclass Column /
> implement
> >>> IColumn and we'd really be on to something ;).
> >>>
> >>> Mock storage-conf.xml:
> >>>   <Column Name="ThingThatsNowKey" Indexed="True"
> >>> ClusterPartitioned="True" Type="UTF8">
> >>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >>> Type="UTF8">
> >>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> >>> Type="ASCII">
> >>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >>>         </Column>
> >>>       </Column>
> >>>     </Column>
> >>>   </Column>
> >>>
> >>> Thrift:
> >>>   struct NamePredicate {
> >>>     1: required list<binary> column_names,
> >>>   }
> >>>   struct SlicePredicate {
> >>>     1: required binary start,
> >>>     2: required binary end,
> >>>   }
> >>>   struct CountPredicate {
> >>>     1: required struct predicate,
> >>>     2: required i32 count=100,
> >>>   }
> >>>   struct AndPredicate {
> >>>     1: required Predicate left,
> >>>     2: required Predicate right,
> >>>   }
> >>>   struct SubColumnsPredicate {
> >>>     1: required Predicate columns,
> >>>     2: required Predicate subcolumns,
> >>>   }
> >>>   ... OrPredicate, OtherUsefulPredicates ...
> >>>   query(predicate, count, consistency_level) # Count here would be
> total
> >>> count of leaf values returned, whereas CountPredicate specifies a
> column
> >>> count for a particular sub-slice.
> >>>
> >>> Not fully baked... but I think this could really simplify stuff and
> make
> >>> it more flexible. Downside is it may give people enough rope to hang
> >>> themselves, but at least the predicate stuff is easily distributable.
> >>>
> >>> I'm thinking I'll play around with implementing some of this stuff
> myself
> >>> if I have any free time in the near future.
> >>>
> >>> Mike
> >>>
> >>>
> >>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jbellis@gmail.com
> >wrote:
> >>>
> >>>> Very interesting, thanks!
> >>>>
> >>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>>> > Follow-up from last weeks discussion, I've been playing around with
> a
> >>>> simple
> >>>> > column comparator for composite column names that I put up on
> github.
> >>>> I'd
> >>>> > be interested to hear what people think of this approach.
> >>>> >
> >>>> > http://github.com/edanuff/CassandraCompositeType
> >>>> >
> >>>> > Ed
> >>>> >
> >>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>>> >>
> >>>> >> It might make sense to create a CompositeType subclass of
> >>>> AbstractType for
> >>>> >> the purpose of constructing and comparing these types of
> "composite"
> >>>> column
> >>>> >> names so that if you could more easily do that sort of thing rather
> >>>> than
> >>>> >> having to concatenate into one big string.
> >>>> >>
> >>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
> >>>> wrote:
> >>>> >>>
> >>>> >>> The only thing SuperColumns appear to buy you (as someone pointed
> >>>> out to
> >>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
> that
> >>>> you can
> >>>> >>> use different comparator types for the Super/SubColumns, I
> guess..?
> >>>> But you
> >>>> >>> should be able to do the same thing by creating your own Column
> >>>> comparator.
> >>>> >>> I guess my point is that SuperColumns are mostly a convenience
> >>>> mechanism, as
> >>>> >>> far as I can tell.
> >>>> >>> Mike
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jonathan Ellis
> >>>> Project Chair, Apache Cassandra
> >>>> co-founder of Riptano, the source for professional Cassandra support
> >>>> http://riptano.com
> >>>>
> >>>
> >>>
> >>
> >
>
>
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Mon, May 10, 2010 at 9:44 AM, Stu Hood <st...@rackspace.com> wrote:

> I think that it is 100% ideal: it's what I've been working on implementing
> in #674, #847 and #998. I'm hoping to post a large patchset and docs this
> week, and I'm aiming to get it committed for 0.8.
>
> The work I've been doing doesn't touch the user interface: it only deals
> with the internal changes necessary to make this type of storage possible.
>

Yea, Stu, I've been looking at your github changes. I think we both have a
lot of the same ideas. I'd love to chat more about this stuff sometime.


>
>
> -----Original Message-----
> From: "Mike Malone" <mi...@simplegeo.com>
> Sent: Monday, May 10, 2010 11:37am
> To: user@cassandra.apache.org
> Subject: Re: Is SuperColumn necessary?
>
> Maybe... but honestly, it doesn't affect the architecture or interface at
> all. I'm more interested in thinking about how the system should work than
> what things are called. Naming things are important, but that can happen
> later.
>
> Does anyone have any thoughts or comments on the architecture I suggested
> earlier?
>
> Mike
>
> On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com> wrote:
>
> > Yes, the "column" here is not appropriate.
> > Maybe we need not to create new terms, in Google's Bigtable, the term
> > "qualifier" is a good one.
> >
> >
> > On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
> wrote:
> >
> >> That would be a good time to get rid of the confusing "column" term,
> which
> >> incorrectly suggests a two-dimensional tabular structure.
> >>
> >> Suggestions:
> >>
> >> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
> >> "column" with "1st dimension", "2nd dimension", etc.
> >>
> >> 2. A file system: replace "key" and "column" with "directory" and
> >> "subdirectory"
> >>
> >> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
> value
> >> is the set of keys, whose value is the set of supercolumns of the key,
> whose
> >> value is the set of columns for the supercolumn, etc.
> >>
> >> 4. Etc.
> >>
> >> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
> >>
> >>> Nice, Ed, we're doing something very similar but less generic.
> >>>
> >>> Now replace all of the various methods for querying with a simple query
> >>> interface that takes a Predicate, allow the user to specify (in
> >>> storage-conf) which levels of the nested Columns should be indexed, and
> >>> completely remove Comparators and have people subclass Column /
> implement
> >>> IColumn and we'd really be on to something ;).
> >>>
> >>> Mock storage-conf.xml:
> >>>   <Column Name="ThingThatsNowKey" Indexed="True"
> >>> ClusterPartitioned="True" Type="UTF8">
> >>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >>> Type="UTF8">
> >>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> >>> Type="ASCII">
> >>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >>>         </Column>
> >>>       </Column>
> >>>     </Column>
> >>>   </Column>
> >>>
> >>> Thrift:
> >>>   struct NamePredicate {
> >>>     1: required list<binary> column_names,
> >>>   }
> >>>   struct SlicePredicate {
> >>>     1: required binary start,
> >>>     2: required binary end,
> >>>   }
> >>>   struct CountPredicate {
> >>>     1: required struct predicate,
> >>>     2: required i32 count=100,
> >>>   }
> >>>   struct AndPredicate {
> >>>     1: required Predicate left,
> >>>     2: required Predicate right,
> >>>   }
> >>>   struct SubColumnsPredicate {
> >>>     1: required Predicate columns,
> >>>     2: required Predicate subcolumns,
> >>>   }
> >>>   ... OrPredicate, OtherUsefulPredicates ...
> >>>   query(predicate, count, consistency_level) # Count here would be
> total
> >>> count of leaf values returned, whereas CountPredicate specifies a
> column
> >>> count for a particular sub-slice.
> >>>
> >>> Not fully baked... but I think this could really simplify stuff and
> make
> >>> it more flexible. Downside is it may give people enough rope to hang
> >>> themselves, but at least the predicate stuff is easily distributable.
> >>>
> >>> I'm thinking I'll play around with implementing some of this stuff
> myself
> >>> if I have any free time in the near future.
> >>>
> >>> Mike
> >>>
> >>>
> >>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jbellis@gmail.com
> >wrote:
> >>>
> >>>> Very interesting, thanks!
> >>>>
> >>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>>> > Follow-up from last weeks discussion, I've been playing around with
> a
> >>>> simple
> >>>> > column comparator for composite column names that I put up on
> github.
> >>>> I'd
> >>>> > be interested to hear what people think of this approach.
> >>>> >
> >>>> > http://github.com/edanuff/CassandraCompositeType
> >>>> >
> >>>> > Ed
> >>>> >
> >>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>>> >>
> >>>> >> It might make sense to create a CompositeType subclass of
> >>>> AbstractType for
> >>>> >> the purpose of constructing and comparing these types of
> "composite"
> >>>> column
> >>>> >> names so that if you could more easily do that sort of thing rather
> >>>> than
> >>>> >> having to concatenate into one big string.
> >>>> >>
> >>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
> >>>> wrote:
> >>>> >>>
> >>>> >>> The only thing SuperColumns appear to buy you (as someone pointed
> >>>> out to
> >>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
> that
> >>>> you can
> >>>> >>> use different comparator types for the Super/SubColumns, I
> guess..?
> >>>> But you
> >>>> >>> should be able to do the same thing by creating your own Column
> >>>> comparator.
> >>>> >>> I guess my point is that SuperColumns are mostly a convenience
> >>>> mechanism, as
> >>>> >>> far as I can tell.
> >>>> >>> Mike
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jonathan Ellis
> >>>> Project Chair, Apache Cassandra
> >>>> co-founder of Riptano, the source for professional Cassandra support
> >>>> http://riptano.com
> >>>>
> >>>
> >>>
> >>
> >
>
>
>

Re: Is SuperColumn necessary?

Posted by Stu Hood <st...@rackspace.com>.

I think that it is 100% ideal: it's what I've been working on implementing in #674, #847 and #998. I'm hoping to post a large patchset and docs this week, and I'm aiming to get it committed for 0.8.

The work I've been doing doesn't touch the user interface: it only deals with the internal changes necessary to make this type of storage possible.


-----Original Message-----
From: "Mike Malone" <mi...@simplegeo.com>
Sent: Monday, May 10, 2010 11:37am
To: user@cassandra.apache.org
Subject: Re: Is SuperColumn necessary?

Maybe... but honestly, it doesn't affect the architecture or interface at
all. I'm more interested in thinking about how the system should work than
what things are called. Naming things are important, but that can happen
later.

Does anyone have any thoughts or comments on the architecture I suggested
earlier?

Mike

On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com> wrote:

> Yes, the "column" here is not appropriate.
> Maybe we need not to create new terms, in Google's Bigtable, the term
> "qualifier" is a good one.
>
>
> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com> wrote:
>
>> That would be a good time to get rid of the confusing "column" term, which
>> incorrectly suggests a two-dimensional tabular structure.
>>
>> Suggestions:
>>
>> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
>> "column" with "1st dimension", "2nd dimension", etc.
>>
>> 2. A file system: replace "key" and "column" with "directory" and
>> "subdirectory"
>>
>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose value
>> is the set of keys, whose value is the set of supercolumns of the key, whose
>> value is the set of columns for the supercolumn, etc.
>>
>> 4. Etc.
>>
>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
>>
>>> Nice, Ed, we're doing something very similar but less generic.
>>>
>>> Now replace all of the various methods for querying with a simple query
>>> interface that takes a Predicate, allow the user to specify (in
>>> storage-conf) which levels of the nested Columns should be indexed, and
>>> completely remove Comparators and have people subclass Column / implement
>>> IColumn and we'd really be on to something ;).
>>>
>>> Mock storage-conf.xml:
>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>> ClusterPartitioned="True" Type="UTF8">
>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>>> Type="UTF8">
>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>> Type="ASCII">
>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>         </Column>
>>>       </Column>
>>>     </Column>
>>>   </Column>
>>>
>>> Thrift:
>>>   struct NamePredicate {
>>>     1: required list<binary> column_names,
>>>   }
>>>   struct SlicePredicate {
>>>     1: required binary start,
>>>     2: required binary end,
>>>   }
>>>   struct CountPredicate {
>>>     1: required struct predicate,
>>>     2: required i32 count=100,
>>>   }
>>>   struct AndPredicate {
>>>     1: required Predicate left,
>>>     2: required Predicate right,
>>>   }
>>>   struct SubColumnsPredicate {
>>>     1: required Predicate columns,
>>>     2: required Predicate subcolumns,
>>>   }
>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>   query(predicate, count, consistency_level) # Count here would be total
>>> count of leaf values returned, whereas CountPredicate specifies a column
>>> count for a particular sub-slice.
>>>
>>> Not fully baked... but I think this could really simplify stuff and make
>>> it more flexible. Downside is it may give people enough rope to hang
>>> themselves, but at least the predicate stuff is easily distributable.
>>>
>>> I'm thinking I'll play around with implementing some of this stuff myself
>>> if I have any free time in the near future.
>>>
>>> Mike
>>>
>>>
>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>wrote:
>>>
>>>> Very interesting, thanks!
>>>>
>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>> > Follow-up from last weeks discussion, I've been playing around with a
>>>> simple
>>>> > column comparator for composite column names that I put up on github.
>>>> I'd
>>>> > be interested to hear what people think of this approach.
>>>> >
>>>> > http://github.com/edanuff/CassandraCompositeType
>>>> >
>>>> > Ed
>>>> >
>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>> >>
>>>> >> It might make sense to create a CompositeType subclass of
>>>> AbstractType for
>>>> >> the purpose of constructing and comparing these types of "composite"
>>>> column
>>>> >> names so that if you could more easily do that sort of thing rather
>>>> than
>>>> >> having to concatenate into one big string.
>>>> >>
>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
>>>> wrote:
>>>> >>>
>>>> >>> The only thing SuperColumns appear to buy you (as someone pointed
>>>> out to
>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is that
>>>> you can
>>>> >>> use different comparator types for the Super/SubColumns, I guess..?
>>>> But you
>>>> >>> should be able to do the same thing by creating your own Column
>>>> comparator.
>>>> >>> I guess my point is that SuperColumns are mostly a convenience
>>>> mechanism, as
>>>> >>> far as I can tell.
>>>> >>> Mike
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder of Riptano, the source for professional Cassandra support
>>>> http://riptano.com
>>>>
>>>
>>>
>>
>

Re: Is SuperColumn necessary?

Posted by Schubert Zhang <zs...@gmail.com>.

I appreciate to let cassandra core data model clear and pure.


On Tue, May 11, 2010 at 5:20 AM, Mike Malone <mi...@simplegeo.com> wrote:

> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>
>> Could someone confirm this discussion is not about abandoning supercolumn
>> family? I have found modeling data with supercolumn family is actually an
>> advantage of cassadra compared to relational database. Hope you are going to
>> drop this important concept.  How it's implemented internally is a different
>> matter.
>>
>
> SuperColumns are useful as a convenience mechanism. That's pretty much it.
> There's _nothing_ (as far as I can tell) that you can do with SuperColumns
> that you can't do by manually concatenating key names with a separator on
> the client side and implementing a custom comparator on the server (as ugly
> as that is).
>
> This discussion is about getting rid of SuperColumns and adding a more
> generic mechanism that will actually be useful and interesting and will
> continue to be convenient for the types of use cases for which people use
> SuperColumns.
>
> If there's a particular use case that you feel you can only implement with
> SuperColumns, please share! I honestly can't think of any.
>
> Mike
>
>
>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>
>>> Agreed
>>>
>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>> wrote:
>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>> wrote:
>>> >>
>>> >> I have to disagree about the naming of things. The name of something
>>> >> isn't just a literal identifier. It affects the way people think about
>>> >> it. For new users, the whole naming thing has been a persistent
>>> >> barrier.
>>> >
>>> > I'm saying we shouldn't be worried too much about coming up with names
>>> and
>>> > analogies until we've decided what it is we're naming.
>>> >
>>> >>
>>> >> As for your suggestions, I'm all for simplifying or generalizing the
>>> >> "how it works" part down to a more generalized set of operations. I'm
>>> >> not sure it's a good idea to require users to think in terms building
>>> >> up a fluffy query structure just to thread it through a needle of an
>>> >> API, even for the simplest of queries. At some point, the level of
>>> >> generic boilerplate takes away from the semantic hand rails that
>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>> >> "how we use it" are not always exactly the same. At least they should
>>> >> both hinge on a common conceptual model, which is where the naming
>>> >> becomes an important anchoring point.
>>> >
>>> > If things are done properly, client libraries could expose simplified
>>> query
>>> > interfaces without much effort. Most ORMs these days work by building a
>>> > propositional directed acyclic graph that's serialized to SQL. This
>>> would
>>> > work the same way, but it wouldn't be converted into a 4GL.
>>> > Mike
>>> >
>>> >>
>>> >> Jonathan
>>> >>
>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>> wrote:
>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>> interface
>>> >> > at
>>> >> > all. I'm more interested in thinking about how the system should
>>> work
>>> >> > than
>>> >> > what things are called. Naming things are important, but that can
>>> happen
>>> >> > later.
>>> >> > Does anyone have any thoughts or comments on the architecture I
>>> >> > suggested
>>> >> > earlier?
>>> >> >
>>> >> > Mike
>>> >> >
>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Yes, the "column" here is not appropriate.
>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>> term
>>> >> >> "qualifier" is a good one.
>>> >> >>
>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <david@lookin2.com
>>> >
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> That would be a good time to get rid of the confusing "column"
>>> term,
>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>> >> >>>
>>> >> >>> Suggestions:
>>> >> >>>
>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>> "key"
>>> >> >>> and
>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>> >> >>>
>>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>>> >> >>> "subdirectory"
>>> >> >>>
>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>> whose
>>> >> >>> value
>>> >> >>> is the set of keys, whose value is the set of supercolumns of the
>>> key,
>>> >> >>> whose
>>> >> >>> value is the set of columns for the supercolumn, etc.
>>> >> >>>
>>> >> >>> 4. Etc.
>>> >> >>>
>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>> >> >>>> Now replace all of the various methods for querying with a simple
>>> >> >>>> query
>>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>> indexed,
>>> >> >>>> and
>>> >> >>>> completely remove Comparators and have people subclass Column /
>>> >> >>>> implement
>>> >> >>>> IColumn and we'd really be on to something ;).
>>> >> >>>> Mock storage-conf.xml:
>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>> DiskPartitioned="True"
>>> >> >>>> Type="UTF8">
>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>> >> >>>> Type="ASCII">
>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>> >> >>>>         </Column>
>>> >> >>>>       </Column>
>>> >> >>>>     </Column>
>>> >> >>>>   </Column>
>>> >> >>>> Thrift:
>>> >> >>>>   struct NamePredicate {
>>> >> >>>>     1: required list<binary> column_names,
>>> >> >>>>   }
>>> >> >>>>   struct SlicePredicate {
>>> >> >>>>     1: required binary start,
>>> >> >>>>     2: required binary end,
>>> >> >>>>   }
>>> >> >>>>   struct CountPredicate {
>>> >> >>>>     1: required struct predicate,
>>> >> >>>>     2: required i32 count=100,
>>> >> >>>>   }
>>> >> >>>>   struct AndPredicate {
>>> >> >>>>     1: required Predicate left,
>>> >> >>>>     2: required Predicate right,
>>> >> >>>>   }
>>> >> >>>>   struct SubColumnsPredicate {
>>> >> >>>>     1: required Predicate columns,
>>> >> >>>>     2: required Predicate subcolumns,
>>> >> >>>>   }
>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>> >> >>>>   query(predicate, count, consistency_level) # Count here would
>>> be
>>> >> >>>> total
>>> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
>>> >> >>>> column
>>> >> >>>> count for a particular sub-slice.
>>> >> >>>> Not fully baked... but I think this could really simplify stuff
>>> and
>>> >> >>>> make
>>> >> >>>> it more flexible. Downside is it may give people enough rope to
>>> hang
>>> >> >>>> themselves, but at least the predicate stuff is easily
>>> distributable.
>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>> stuff
>>> >> >>>> myself if I have any free time in the near future.
>>> >> >>>> Mike
>>> >> >>>>
>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>> jbellis@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> Very interesting, thanks!
>>> >> >>>>>
>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
>>> >> >>>>> > with a
>>> >> >>>>> > simple
>>> >> >>>>> > column comparator for composite column names that I put up on
>>> >> >>>>> > github.  I'd
>>> >> >>>>> > be interested to hear what people think of this approach.
>>> >> >>>>> >
>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>> >> >>>>> >
>>> >> >>>>> > Ed
>>> >> >>>>> >
>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>> wrote:
>>> >> >>>>> >>
>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>> >> >>>>> >> AbstractType for
>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>> >> >>>>> >> "composite"
>>> >> >>>>> >> column
>>> >> >>>>> >> names so that if you could more easily do that sort of thing
>>> >> >>>>> >> rather
>>> >> >>>>> >> than
>>> >> >>>>> >> having to concatenate into one big string.
>>> >> >>>>> >>
>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>> >> >>>>> >> <mi...@simplegeo.com>
>>> >> >>>>> >> wrote:
>>> >> >>>>> >>>
>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>> >> >>>>> >>> pointed
>>> >> >>>>> >>> out to
>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano)
>>> is
>>> >> >>>>> >>> that you can
>>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>>> >> >>>>> >>> guess..?
>>> >> >>>>> >>> But you
>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>> Column
>>> >> >>>>> >>> comparator.
>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>> convenience
>>> >> >>>>> >>> mechanism, as
>>> >> >>>>> >>> far as I can tell.
>>> >> >>>>> >>> Mike
>>> >> >>>>> >
>>> >> >>>>> >
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> --
>>> >> >>>>> Jonathan Ellis
>>> >> >>>>> Project Chair, Apache Cassandra
>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>> support
>>> >> >>>>> http://riptano.com
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

>
> Mike just suggested to concate comment id with each of the comment field
> names so that the above data can be stored in normal column family. It looks
> fine except that I'm not sure the time sorting on comments still works or
> not.
>

In the case of time you can just use lexicographically sortable strings that
represent your timestamp (e.g., RFC 3339). You're right, I don't think
TimeUUID does that. For more complicated things (e.g., TimeUUIDs or packed
numerics that you don't want to zero pad) you'd have to implement a custom
comparator. So the "convenience" mechanisms that would have to be
implemented (and, in fact, Stu and Ed have pretty much already implemented)
would take care of concatenating the column names and doing the chained
comparisons for you.

Mike


>
>
> On Mon, May 10, 2010 at 5:36 PM, William Ashley <wa...@gmail.com> wrote:
>
>> I'm having a difficult time understanding your syntax. Could you provide
>> an example with actual data?
>>
>> On May 10, 2010, at 5:25 PM, AJ Chen wrote:
>>
>> your suggestion works for fixed supercolumn name. the blog example now
>> becomes:
>> { blog-id {name, title, ...}
>>   blog-id-comments {time:commenter}
>> }
>>
>> what about supercolumn names that are not fixed? for example, I want to
>> store comment's details with the blog like this:
>> { blog-id { blog { name, title, ...}
>>               comments {comment-id:commenter}
>>               comment-id {commenter, time, text, ...}
>> }
>>
>> a comment-id is generated on-the-fly when the comment is made.  how do you
>> flatten the comment-id supercolumn to normal column?  just for brain
>> exercise, not meant to pick on you.
>>
>> thanks,
>> -aj
>>
>>
>>
>> On Mon, May 10, 2010 at 4:39 PM, William Ashley <wa...@gmail.com>wrote:
>>
>>> If you're storing your super column under a fixed name, you could just
>>> concatenate that name with the row key and use normal columns. Then you get
>>> your paging and sorting the way you want it.
>>>
>>>
>>> On May 10, 2010, at 4:31 PM, AJ Chen wrote:
>>>
>>> supercolumn is good for modeling profile type of data. simple example is
>>> blog:
>>> blog { blog {author,  title, ...}
>>>          comments   {time: commenter}  //sort by TimeUUID
>>> }
>>> when retrieving a blog, you get all the comments sorted by time already.
>>> without supercolumn, you would need to concatenate multiple comment times
>>> together as you suggested.
>>>
>>> requiring user to concatenating data fields together is not only an extra
>>> burden on user but also a less clean design.  there will be cases where the
>>> list property of a profile data is a long list (say a million items). in
>>> such cases, user wants to be able to directly insert/delete an item in that
>>> list because it's more efficient.  Retrieving the whole list, updating it,
>>> concatenating again, and then putting it back to datastore is awkward and
>>> less efficient.
>>>
>>> -aj
>>>
>>>
>>> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
>>>
>>>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org>wrote:
>>>>
>>>>> Could someone confirm this discussion is not about abandoning
>>>>> supercolumn family? I have found modeling data with supercolumn family is
>>>>> actually an advantage of cassadra compared to relational database. Hope you
>>>>> are going to drop this important concept.  How it's implemented internally
>>>>> is a different matter.
>>>>>
>>>>
>>>> SuperColumns are useful as a convenience mechanism. That's pretty much
>>>> it. There's _nothing_ (as far as I can tell) that you can do with
>>>> SuperColumns that you can't do by manually concatenating key names with a
>>>> separator on the client side and implementing a custom comparator on the
>>>> server (as ugly as that is).
>>>>
>>>> This discussion is about getting rid of SuperColumns and adding a more
>>>> generic mechanism that will actually be useful and interesting and will
>>>> continue to be convenient for the types of use cases for which people use
>>>> SuperColumns.
>>>>
>>>> If there's a particular use case that you feel you can only implement
>>>> with SuperColumns, please share! I honestly can't think of any.
>>>>
>>>> Mike
>>>>
>>>>
>>>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>>>>
>>>>>> Agreed
>>>>>>
>>>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>>>>> wrote:
>>>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> I have to disagree about the naming of things. The name of
>>>>>> something
>>>>>> >> isn't just a literal identifier. It affects the way people think
>>>>>> about
>>>>>> >> it. For new users, the whole naming thing has been a persistent
>>>>>> >> barrier.
>>>>>> >
>>>>>> > I'm saying we shouldn't be worried too much about coming up with
>>>>>> names and
>>>>>> > analogies until we've decided what it is we're naming.
>>>>>> >
>>>>>> >>
>>>>>> >> As for your suggestions, I'm all for simplifying or generalizing
>>>>>> the
>>>>>> >> "how it works" part down to a more generalized set of operations.
>>>>>> I'm
>>>>>> >> not sure it's a good idea to require users to think in terms
>>>>>> building
>>>>>> >> up a fluffy query structure just to thread it through a needle of
>>>>>> an
>>>>>> >> API, even for the simplest of queries. At some point, the level of
>>>>>> >> generic boilerplate takes away from the semantic hand rails that
>>>>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>>>>> >> "how we use it" are not always exactly the same. At least they
>>>>>> should
>>>>>> >> both hinge on a common conceptual model, which is where the naming
>>>>>> >> becomes an important anchoring point.
>>>>>> >
>>>>>> > If things are done properly, client libraries could expose
>>>>>> simplified query
>>>>>> > interfaces without much effort. Most ORMs these days work by
>>>>>> building a
>>>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>>>> would
>>>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>>>> > Mike
>>>>>> >
>>>>>> >>
>>>>>> >> Jonathan
>>>>>> >>
>>>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>>>>> wrote:
>>>>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>>>>> interface
>>>>>> >> > at
>>>>>> >> > all. I'm more interested in thinking about how the system should
>>>>>> work
>>>>>> >> > than
>>>>>> >> > what things are called. Naming things are important, but that can
>>>>>> happen
>>>>>> >> > later.
>>>>>> >> > Does anyone have any thoughts or comments on the architecture I
>>>>>> >> > suggested
>>>>>> >> > earlier?
>>>>>> >> >
>>>>>> >> > Mike
>>>>>> >> >
>>>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <
>>>>>> zsongbo@gmail.com>
>>>>>> >> > wrote:
>>>>>> >> >>
>>>>>> >> >> Yes, the "column" here is not appropriate.
>>>>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>>>>> term
>>>>>> >> >> "qualifier" is a good one.
>>>>>> >> >>
>>>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>>>> david@lookin2.com>
>>>>>> >> >> wrote:
>>>>>> >> >>>
>>>>>> >> >>> That would be a good time to get rid of the confusing "column"
>>>>>> term,
>>>>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>>>>> >> >>>
>>>>>> >> >>> Suggestions:
>>>>>> >> >>>
>>>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>>>>> "key"
>>>>>> >> >>> and
>>>>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>>>>> >> >>>
>>>>>> >> >>> 2. A file system: replace "key" and "column" with "directory"
>>>>>> and
>>>>>> >> >>> "subdirectory"
>>>>>> >> >>>
>>>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>>>>> whose
>>>>>> >> >>> value
>>>>>> >> >>> is the set of keys, whose value is the set of supercolumns of
>>>>>> the key,
>>>>>> >> >>> whose
>>>>>> >> >>> value is the set of columns for the supercolumn, etc.
>>>>>> >> >>>
>>>>>> >> >>> 4. Etc.
>>>>>> >> >>>
>>>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <
>>>>>> mike@simplegeo.com>
>>>>>> >> >>> wrote:
>>>>>> >> >>>>
>>>>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>>>>> >> >>>> Now replace all of the various methods for querying with a
>>>>>> simple
>>>>>> >> >>>> query
>>>>>> >> >>>> interface that takes a Predicate, allow the user to specify
>>>>>> (in
>>>>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>>>>> indexed,
>>>>>> >> >>>> and
>>>>>> >> >>>> completely remove Comparators and have people subclass Column
>>>>>> /
>>>>>> >> >>>> implement
>>>>>> >> >>>> IColumn and we'd really be on to something ;).
>>>>>> >> >>>> Mock storage-conf.xml:
>>>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>>>> DiskPartitioned="True"
>>>>>> >> >>>> Type="UTF8">
>>>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>>>>> >> >>>> Type="ASCII">
>>>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>>>> >> >>>>         </Column>
>>>>>> >> >>>>       </Column>
>>>>>> >> >>>>     </Column>
>>>>>> >> >>>>   </Column>
>>>>>> >> >>>> Thrift:
>>>>>> >> >>>>   struct NamePredicate {
>>>>>> >> >>>>     1: required list<binary> column_names,
>>>>>> >> >>>>   }
>>>>>> >> >>>>   struct SlicePredicate {
>>>>>> >> >>>>     1: required binary start,
>>>>>> >> >>>>     2: required binary end,
>>>>>> >> >>>>   }
>>>>>> >> >>>>   struct CountPredicate {
>>>>>> >> >>>>     1: required struct predicate,
>>>>>> >> >>>>     2: required i32 count=100,
>>>>>> >> >>>>   }
>>>>>> >> >>>>   struct AndPredicate {
>>>>>> >> >>>>     1: required Predicate left,
>>>>>> >> >>>>     2: required Predicate right,
>>>>>> >> >>>>   }
>>>>>> >> >>>>   struct SubColumnsPredicate {
>>>>>> >> >>>>     1: required Predicate columns,
>>>>>> >> >>>>     2: required Predicate subcolumns,
>>>>>> >> >>>>   }
>>>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>>>> >> >>>>   query(predicate, count, consistency_level) # Count here
>>>>>> would be
>>>>>> >> >>>> total
>>>>>> >> >>>> count of leaf values returned, whereas CountPredicate
>>>>>> specifies a
>>>>>> >> >>>> column
>>>>>> >> >>>> count for a particular sub-slice.
>>>>>> >> >>>> Not fully baked... but I think this could really simplify
>>>>>> stuff and
>>>>>> >> >>>> make
>>>>>> >> >>>> it more flexible. Downside is it may give people enough rope
>>>>>> to hang
>>>>>> >> >>>> themselves, but at least the predicate stuff is easily
>>>>>> distributable.
>>>>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>>>>> stuff
>>>>>> >> >>>> myself if I have any free time in the near future.
>>>>>> >> >>>> Mike
>>>>>> >> >>>>
>>>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>>>>> jbellis@gmail.com>
>>>>>> >> >>>> wrote:
>>>>>> >> >>>>>
>>>>>> >> >>>>> Very interesting, thanks!
>>>>>> >> >>>>>
>>>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com>
>>>>>> wrote:
>>>>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing
>>>>>> around
>>>>>> >> >>>>> > with a
>>>>>> >> >>>>> > simple
>>>>>> >> >>>>> > column comparator for composite column names that I put up
>>>>>> on
>>>>>> >> >>>>> > github.  I'd
>>>>>> >> >>>>> > be interested to hear what people think of this approach.
>>>>>> >> >>>>> >
>>>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>>>> >> >>>>> >
>>>>>> >> >>>>> > Ed
>>>>>> >> >>>>> >
>>>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>>>>> wrote:
>>>>>> >> >>>>> >>
>>>>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>>>>> >> >>>>> >> AbstractType for
>>>>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>>>>> >> >>>>> >> "composite"
>>>>>> >> >>>>> >> column
>>>>>> >> >>>>> >> names so that if you could more easily do that sort of
>>>>>> thing
>>>>>> >> >>>>> >> rather
>>>>>> >> >>>>> >> than
>>>>>> >> >>>>> >> having to concatenate into one big string.
>>>>>> >> >>>>> >>
>>>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>>>>> >> >>>>> >> <mi...@simplegeo.com>
>>>>>> >> >>>>> >> wrote:
>>>>>> >> >>>>> >>>
>>>>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>>>>> >> >>>>> >>> pointed
>>>>>> >> >>>>> >>> out to
>>>>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric
>>>>>> Florenzano) is
>>>>>> >> >>>>> >>> that you can
>>>>>> >> >>>>> >>> use different comparator types for the Super/SubColumns,
>>>>>> I
>>>>>> >> >>>>> >>> guess..?
>>>>>> >> >>>>> >>> But you
>>>>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>>>>> Column
>>>>>> >> >>>>> >>> comparator.
>>>>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>>>>> convenience
>>>>>> >> >>>>> >>> mechanism, as
>>>>>> >> >>>>> >>> far as I can tell.
>>>>>> >> >>>>> >>> Mike
>>>>>> >> >>>>> >
>>>>>> >> >>>>> >
>>>>>> >> >>>>>
>>>>>> >> >>>>>
>>>>>> >> >>>>>
>>>>>> >> >>>>> --
>>>>>> >> >>>>> Jonathan Ellis
>>>>>> >> >>>>> Project Chair, Apache Cassandra
>>>>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>>>>> support
>>>>>> >> >>>>> http://riptano.com
>>>>>> >> >>>>
>>>>>> >> >>>
>>>>>> >> >>
>>>>>> >> >
>>>>>> >> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> AJ Chen, PhD
>>>>> Chair, Semantic Web SIG, sdforum.org
>>>>> http://web2express.org
>>>>> twitter @web2express
>>>>> Palo Alto, CA, USA
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>

Re: Is SuperColumn necessary?

Posted by AJ Chen <aj...@web2express.org>.

{
"b1"  { blog-id: b1
          author: ba1
          tittle: bt1
          comment-timeuuid-1: {author: ca1
                                           id: comment-timeuuid-1
                                           text: text 1
          comment-timeuuid-2: {author: ca2
                                           id: comment-timeuuid-2
                                           text: text 2
                                          }
}

Mike just suggested to concate comment id with each of the comment field
names so that the above data can be stored in normal column family. It looks
fine except that I'm not sure the time sorting on comments still works or
not.

-aj

On Mon, May 10, 2010 at 5:36 PM, William Ashley <wa...@gmail.com> wrote:

> I'm having a difficult time understanding your syntax. Could you provide an
> example with actual data?
>
> On May 10, 2010, at 5:25 PM, AJ Chen wrote:
>
> your suggestion works for fixed supercolumn name. the blog example now
> becomes:
> { blog-id {name, title, ...}
>   blog-id-comments {time:commenter}
> }
>
> what about supercolumn names that are not fixed? for example, I want to
> store comment's details with the blog like this:
> { blog-id { blog { name, title, ...}
>               comments {comment-id:commenter}
>               comment-id {commenter, time, text, ...}
> }
>
> a comment-id is generated on-the-fly when the comment is made.  how do you
> flatten the comment-id supercolumn to normal column?  just for brain
> exercise, not meant to pick on you.
>
> thanks,
> -aj
>
>
>
> On Mon, May 10, 2010 at 4:39 PM, William Ashley <wa...@gmail.com> wrote:
>
>> If you're storing your super column under a fixed name, you could just
>> concatenate that name with the row key and use normal columns. Then you get
>> your paging and sorting the way you want it.
>>
>>
>> On May 10, 2010, at 4:31 PM, AJ Chen wrote:
>>
>> supercolumn is good for modeling profile type of data. simple example is
>> blog:
>> blog { blog {author,  title, ...}
>>          comments   {time: commenter}  //sort by TimeUUID
>> }
>> when retrieving a blog, you get all the comments sorted by time already.
>> without supercolumn, you would need to concatenate multiple comment times
>> together as you suggested.
>>
>> requiring user to concatenating data fields together is not only an extra
>> burden on user but also a less clean design.  there will be cases where the
>> list property of a profile data is a long list (say a million items). in
>> such cases, user wants to be able to directly insert/delete an item in that
>> list because it's more efficient.  Retrieving the whole list, updating it,
>> concatenating again, and then putting it back to datastore is awkward and
>> less efficient.
>>
>> -aj
>>
>>
>> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
>>
>>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>>>
>>>> Could someone confirm this discussion is not about abandoning
>>>> supercolumn family? I have found modeling data with supercolumn family is
>>>> actually an advantage of cassadra compared to relational database. Hope you
>>>> are going to drop this important concept.  How it's implemented internally
>>>> is a different matter.
>>>>
>>>
>>> SuperColumns are useful as a convenience mechanism. That's pretty much
>>> it. There's _nothing_ (as far as I can tell) that you can do with
>>> SuperColumns that you can't do by manually concatenating key names with a
>>> separator on the client side and implementing a custom comparator on the
>>> server (as ugly as that is).
>>>
>>> This discussion is about getting rid of SuperColumns and adding a more
>>> generic mechanism that will actually be useful and interesting and will
>>> continue to be convenient for the types of use cases for which people use
>>> SuperColumns.
>>>
>>> If there's a particular use case that you feel you can only implement
>>> with SuperColumns, please share! I honestly can't think of any.
>>>
>>> Mike
>>>
>>>
>>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>>>
>>>>> Agreed
>>>>>
>>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>>>> wrote:
>>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I have to disagree about the naming of things. The name of something
>>>>> >> isn't just a literal identifier. It affects the way people think
>>>>> about
>>>>> >> it. For new users, the whole naming thing has been a persistent
>>>>> >> barrier.
>>>>> >
>>>>> > I'm saying we shouldn't be worried too much about coming up with
>>>>> names and
>>>>> > analogies until we've decided what it is we're naming.
>>>>> >
>>>>> >>
>>>>> >> As for your suggestions, I'm all for simplifying or generalizing the
>>>>> >> "how it works" part down to a more generalized set of operations.
>>>>> I'm
>>>>> >> not sure it's a good idea to require users to think in terms
>>>>> building
>>>>> >> up a fluffy query structure just to thread it through a needle of an
>>>>> >> API, even for the simplest of queries. At some point, the level of
>>>>> >> generic boilerplate takes away from the semantic hand rails that
>>>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>>>> >> "how we use it" are not always exactly the same. At least they
>>>>> should
>>>>> >> both hinge on a common conceptual model, which is where the naming
>>>>> >> becomes an important anchoring point.
>>>>> >
>>>>> > If things are done properly, client libraries could expose simplified
>>>>> query
>>>>> > interfaces without much effort. Most ORMs these days work by building
>>>>> a
>>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>>> would
>>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>>> > Mike
>>>>> >
>>>>> >>
>>>>> >> Jonathan
>>>>> >>
>>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>>>> wrote:
>>>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>>>> interface
>>>>> >> > at
>>>>> >> > all. I'm more interested in thinking about how the system should
>>>>> work
>>>>> >> > than
>>>>> >> > what things are called. Naming things are important, but that can
>>>>> happen
>>>>> >> > later.
>>>>> >> > Does anyone have any thoughts or comments on the architecture I
>>>>> >> > suggested
>>>>> >> > earlier?
>>>>> >> >
>>>>> >> > Mike
>>>>> >> >
>>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <
>>>>> zsongbo@gmail.com>
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Yes, the "column" here is not appropriate.
>>>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>>>> term
>>>>> >> >> "qualifier" is a good one.
>>>>> >> >>
>>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>>> david@lookin2.com>
>>>>> >> >> wrote:
>>>>> >> >>>
>>>>> >> >>> That would be a good time to get rid of the confusing "column"
>>>>> term,
>>>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>>>> >> >>>
>>>>> >> >>> Suggestions:
>>>>> >> >>>
>>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>>>> "key"
>>>>> >> >>> and
>>>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>>>> >> >>>
>>>>> >> >>> 2. A file system: replace "key" and "column" with "directory"
>>>>> and
>>>>> >> >>> "subdirectory"
>>>>> >> >>>
>>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>>>> whose
>>>>> >> >>> value
>>>>> >> >>> is the set of keys, whose value is the set of supercolumns of
>>>>> the key,
>>>>> >> >>> whose
>>>>> >> >>> value is the set of columns for the supercolumn, etc.
>>>>> >> >>>
>>>>> >> >>> 4. Etc.
>>>>> >> >>>
>>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mike@simplegeo.com
>>>>> >
>>>>> >> >>> wrote:
>>>>> >> >>>>
>>>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>>>> >> >>>> Now replace all of the various methods for querying with a
>>>>> simple
>>>>> >> >>>> query
>>>>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>>>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>>>> indexed,
>>>>> >> >>>> and
>>>>> >> >>>> completely remove Comparators and have people subclass Column /
>>>>> >> >>>> implement
>>>>> >> >>>> IColumn and we'd really be on to something ;).
>>>>> >> >>>> Mock storage-conf.xml:
>>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>>> DiskPartitioned="True"
>>>>> >> >>>> Type="UTF8">
>>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>>>> >> >>>> Type="ASCII">
>>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>>> >> >>>>         </Column>
>>>>> >> >>>>       </Column>
>>>>> >> >>>>     </Column>
>>>>> >> >>>>   </Column>
>>>>> >> >>>> Thrift:
>>>>> >> >>>>   struct NamePredicate {
>>>>> >> >>>>     1: required list<binary> column_names,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct SlicePredicate {
>>>>> >> >>>>     1: required binary start,
>>>>> >> >>>>     2: required binary end,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct CountPredicate {
>>>>> >> >>>>     1: required struct predicate,
>>>>> >> >>>>     2: required i32 count=100,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct AndPredicate {
>>>>> >> >>>>     1: required Predicate left,
>>>>> >> >>>>     2: required Predicate right,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct SubColumnsPredicate {
>>>>> >> >>>>     1: required Predicate columns,
>>>>> >> >>>>     2: required Predicate subcolumns,
>>>>> >> >>>>   }
>>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>>> >> >>>>   query(predicate, count, consistency_level) # Count here would
>>>>> be
>>>>> >> >>>> total
>>>>> >> >>>> count of leaf values returned, whereas CountPredicate specifies
>>>>> a
>>>>> >> >>>> column
>>>>> >> >>>> count for a particular sub-slice.
>>>>> >> >>>> Not fully baked... but I think this could really simplify stuff
>>>>> and
>>>>> >> >>>> make
>>>>> >> >>>> it more flexible. Downside is it may give people enough rope to
>>>>> hang
>>>>> >> >>>> themselves, but at least the predicate stuff is easily
>>>>> distributable.
>>>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>>>> stuff
>>>>> >> >>>> myself if I have any free time in the near future.
>>>>> >> >>>> Mike
>>>>> >> >>>>
>>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>>>> jbellis@gmail.com>
>>>>> >> >>>> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> Very interesting, thanks!
>>>>> >> >>>>>
>>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com>
>>>>> wrote:
>>>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing
>>>>> around
>>>>> >> >>>>> > with a
>>>>> >> >>>>> > simple
>>>>> >> >>>>> > column comparator for composite column names that I put up
>>>>> on
>>>>> >> >>>>> > github.  I'd
>>>>> >> >>>>> > be interested to hear what people think of this approach.
>>>>> >> >>>>> >
>>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>>> >> >>>>> >
>>>>> >> >>>>> > Ed
>>>>> >> >>>>> >
>>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>>>> wrote:
>>>>> >> >>>>> >>
>>>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>>>> >> >>>>> >> AbstractType for
>>>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>>>> >> >>>>> >> "composite"
>>>>> >> >>>>> >> column
>>>>> >> >>>>> >> names so that if you could more easily do that sort of
>>>>> thing
>>>>> >> >>>>> >> rather
>>>>> >> >>>>> >> than
>>>>> >> >>>>> >> having to concatenate into one big string.
>>>>> >> >>>>> >>
>>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>>>> >> >>>>> >> <mi...@simplegeo.com>
>>>>> >> >>>>> >> wrote:
>>>>> >> >>>>> >>>
>>>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>>>> >> >>>>> >>> pointed
>>>>> >> >>>>> >>> out to
>>>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric
>>>>> Florenzano) is
>>>>> >> >>>>> >>> that you can
>>>>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>>>>> >> >>>>> >>> guess..?
>>>>> >> >>>>> >>> But you
>>>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>>>> Column
>>>>> >> >>>>> >>> comparator.
>>>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>>>> convenience
>>>>> >> >>>>> >>> mechanism, as
>>>>> >> >>>>> >>> far as I can tell.
>>>>> >> >>>>> >>> Mike
>>>>> >> >>>>> >
>>>>> >> >>>>> >
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>>>> --
>>>>> >> >>>>> Jonathan Ellis
>>>>> >> >>>>> Project Chair, Apache Cassandra
>>>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>>>> support
>>>>> >> >>>>> http://riptano.com
>>>>> >> >>>>
>>>>> >> >>>
>>>>> >> >>
>>>>> >> >
>>>>> >> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> AJ Chen, PhD
>>>> Chair, Semantic Web SIG, sdforum.org
>>>> http://web2express.org
>>>> twitter @web2express
>>>> Palo Alto, CA, USA
>>>>
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by William Ashley <wa...@gmail.com>.

I'm having a difficult time understanding your syntax. Could you provide an example with actual data?

On May 10, 2010, at 5:25 PM, AJ Chen wrote:

> your suggestion works for fixed supercolumn name. the blog example now becomes:
> { blog-id {name, title, ...}
>   blog-id-comments {time:commenter}
> }
> 
> what about supercolumn names that are not fixed? for example, I want to store comment's details with the blog like this:
> { blog-id { blog { name, title, ...}
>               comments {comment-id:commenter}
>               comment-id {commenter, time, text, ...}
> }
> 
> a comment-id is generated on-the-fly when the comment is made.  how do you flatten the comment-id supercolumn to normal column?  just for brain exercise, not meant to pick on you.
> 
> thanks,
> -aj
>   
> 
> 
> On Mon, May 10, 2010 at 4:39 PM, William Ashley <wa...@gmail.com> wrote:
> If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it.
> 
> 
> On May 10, 2010, at 4:31 PM, AJ Chen wrote:
> 
>> supercolumn is good for modeling profile type of data. simple example is blog:
>> blog { blog {author,  title, ...}
>>          comments   {time: commenter}  //sort by TimeUUID
>> }
>> when retrieving a blog, you get all the comments sorted by time already.
>> without supercolumn, you would need to concatenate multiple comment times together as you suggested. 
>> 
>> requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design.  there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient.  Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient.
>> 
>> -aj
>> 
>> 
>> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>> Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept.  How it's implemented internally is a different matter.
>> 
>> SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is).
>> 
>> This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns.
>> 
>> If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any.
>> 
>> Mike
>> 
>> 
>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com> wrote:
>> Agreed
>> 
>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com> wrote:
>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com> wrote:
>> >>
>> >> I have to disagree about the naming of things. The name of something
>> >> isn't just a literal identifier. It affects the way people think about
>> >> it. For new users, the whole naming thing has been a persistent
>> >> barrier.
>> >
>> > I'm saying we shouldn't be worried too much about coming up with names and
>> > analogies until we've decided what it is we're naming.
>> >
>> >>
>> >> As for your suggestions, I'm all for simplifying or generalizing the
>> >> "how it works" part down to a more generalized set of operations. I'm
>> >> not sure it's a good idea to require users to think in terms building
>> >> up a fluffy query structure just to thread it through a needle of an
>> >> API, even for the simplest of queries. At some point, the level of
>> >> generic boilerplate takes away from the semantic hand rails that
>> >> developers like. So I guess I'm suggesting that "how it works" and
>> >> "how we use it" are not always exactly the same. At least they should
>> >> both hinge on a common conceptual model, which is where the naming
>> >> becomes an important anchoring point.
>> >
>> > If things are done properly, client libraries could expose simplified query
>> > interfaces without much effort. Most ORMs these days work by building a
>> > propositional directed acyclic graph that's serialized to SQL. This would
>> > work the same way, but it wouldn't be converted into a 4GL.
>> > Mike
>> >
>> >>
>> >> Jonathan
>> >>
>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com> wrote:
>> >> > Maybe... but honestly, it doesn't affect the architecture or interface
>> >> > at
>> >> > all. I'm more interested in thinking about how the system should work
>> >> > than
>> >> > what things are called. Naming things are important, but that can happen
>> >> > later.
>> >> > Does anyone have any thoughts or comments on the architecture I
>> >> > suggested
>> >> > earlier?
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Yes, the "column" here is not appropriate.
>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the term
>> >> >> "qualifier" is a good one.
>> >> >>
>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> That would be a good time to get rid of the confusing "column" term,
>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>> >> >>>
>> >> >>> Suggestions:
>> >> >>>
>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key"
>> >> >>> and
>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>> >> >>>
>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>> >> >>> "subdirectory"
>> >> >>>
>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
>> >> >>> value
>> >> >>> is the set of keys, whose value is the set of supercolumns of the key,
>> >> >>> whose
>> >> >>> value is the set of columns for the supercolumn, etc.
>> >> >>>
>> >> >>> 4. Etc.
>> >> >>>
>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>> >> >>>> Now replace all of the various methods for querying with a simple
>> >> >>>> query
>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>> >> >>>> storage-conf) which levels of the nested Columns should be indexed,
>> >> >>>> and
>> >> >>>> completely remove Comparators and have people subclass Column /
>> >> >>>> implement
>> >> >>>> IColumn and we'd really be on to something ;).
>> >> >>>> Mock storage-conf.xml:
>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>> >> >>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>> >> >>>> Type="UTF8">
>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> >> >>>> Type="ASCII">
>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>> >> >>>>         </Column>
>> >> >>>>       </Column>
>> >> >>>>     </Column>
>> >> >>>>   </Column>
>> >> >>>> Thrift:
>> >> >>>>   struct NamePredicate {
>> >> >>>>     1: required list<binary> column_names,
>> >> >>>>   }
>> >> >>>>   struct SlicePredicate {
>> >> >>>>     1: required binary start,
>> >> >>>>     2: required binary end,
>> >> >>>>   }
>> >> >>>>   struct CountPredicate {
>> >> >>>>     1: required struct predicate,
>> >> >>>>     2: required i32 count=100,
>> >> >>>>   }
>> >> >>>>   struct AndPredicate {
>> >> >>>>     1: required Predicate left,
>> >> >>>>     2: required Predicate right,
>> >> >>>>   }
>> >> >>>>   struct SubColumnsPredicate {
>> >> >>>>     1: required Predicate columns,
>> >> >>>>     2: required Predicate subcolumns,
>> >> >>>>   }
>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>> >> >>>>   query(predicate, count, consistency_level) # Count here would be
>> >> >>>> total
>> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
>> >> >>>> column
>> >> >>>> count for a particular sub-slice.
>> >> >>>> Not fully baked... but I think this could really simplify stuff and
>> >> >>>> make
>> >> >>>> it more flexible. Downside is it may give people enough rope to hang
>> >> >>>> themselves, but at least the predicate stuff is easily distributable.
>> >> >>>> I'm thinking I'll play around with implementing some of this stuff
>> >> >>>> myself if I have any free time in the near future.
>> >> >>>> Mike
>> >> >>>>
>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> Very interesting, thanks!
>> >> >>>>>
>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
>> >> >>>>> > with a
>> >> >>>>> > simple
>> >> >>>>> > column comparator for composite column names that I put up on
>> >> >>>>> > github.  I'd
>> >> >>>>> > be interested to hear what people think of this approach.
>> >> >>>>> >
>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>> >> >>>>> >
>> >> >>>>> > Ed
>> >> >>>>> >
>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >> >>>>> >>
>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>> >> >>>>> >> AbstractType for
>> >> >>>>> >> the purpose of constructing and comparing these types of
>> >> >>>>> >> "composite"
>> >> >>>>> >> column
>> >> >>>>> >> names so that if you could more easily do that sort of thing
>> >> >>>>> >> rather
>> >> >>>>> >> than
>> >> >>>>> >> having to concatenate into one big string.
>> >> >>>>> >>
>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>> >> >>>>> >> <mi...@simplegeo.com>
>> >> >>>>> >> wrote:
>> >> >>>>> >>>
>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>> >> >>>>> >>> pointed
>> >> >>>>> >>> out to
>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
>> >> >>>>> >>> that you can
>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>> >> >>>>> >>> guess..?
>> >> >>>>> >>> But you
>> >> >>>>> >>> should be able to do the same thing by creating your own Column
>> >> >>>>> >>> comparator.
>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a convenience
>> >> >>>>> >>> mechanism, as
>> >> >>>>> >>> far as I can tell.
>> >> >>>>> >>> Mike
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Jonathan Ellis
>> >> >>>>> Project Chair, Apache Cassandra
>> >> >>>>> co-founder of Riptano, the source for professional Cassandra support
>> >> >>>>> http://riptano.com
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >
>> >
>> 
>> 
>> 
>> -- 
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>> 
>> 
>> 
>> 
>> -- 
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
> 
> 
> 
> 
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by AJ Chen <aj...@web2express.org>.

your suggestion works for fixed supercolumn name. the blog example now
becomes:
{ blog-id {name, title, ...}
  blog-id-comments {time:commenter}
}

what about supercolumn names that are not fixed? for example, I want to
store comment's details with the blog like this:
{ blog-id { blog { name, title, ...}
              comments {comment-id:commenter}
              comment-id {commenter, time, text, ...}
}

a comment-id is generated on-the-fly when the comment is made.  how do you
flatten the comment-id supercolumn to normal column?  just for brain
exercise, not meant to pick on you.

thanks,
-aj



On Mon, May 10, 2010 at 4:39 PM, William Ashley <wa...@gmail.com> wrote:

> If you're storing your super column under a fixed name, you could just
> concatenate that name with the row key and use normal columns. Then you get
> your paging and sorting the way you want it.
>
>
> On May 10, 2010, at 4:31 PM, AJ Chen wrote:
>
> supercolumn is good for modeling profile type of data. simple example is
> blog:
> blog { blog {author,  title, ...}
>          comments   {time: commenter}  //sort by TimeUUID
> }
> when retrieving a blog, you get all the comments sorted by time already.
> without supercolumn, you would need to concatenate multiple comment times
> together as you suggested.
>
> requiring user to concatenating data fields together is not only an extra
> burden on user but also a less clean design.  there will be cases where the
> list property of a profile data is a long list (say a million items). in
> such cases, user wants to be able to directly insert/delete an item in that
> list because it's more efficient.  Retrieving the whole list, updating it,
> concatenating again, and then putting it back to datastore is awkward and
> less efficient.
>
> -aj
>
>
> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
>
>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>>
>>> Could someone confirm this discussion is not about abandoning supercolumn
>>> family? I have found modeling data with supercolumn family is actually an
>>> advantage of cassadra compared to relational database. Hope you are going to
>>> drop this important concept.  How it's implemented internally is a different
>>> matter.
>>>
>>
>> SuperColumns are useful as a convenience mechanism. That's pretty much it.
>> There's _nothing_ (as far as I can tell) that you can do with SuperColumns
>> that you can't do by manually concatenating key names with a separator on
>> the client side and implementing a custom comparator on the server (as ugly
>> as that is).
>>
>> This discussion is about getting rid of SuperColumns and adding a more
>> generic mechanism that will actually be useful and interesting and will
>> continue to be convenient for the types of use cases for which people use
>> SuperColumns.
>>
>> If there's a particular use case that you feel you can only implement with
>> SuperColumns, please share! I honestly can't think of any.
>>
>> Mike
>>
>>
>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>>
>>>> Agreed
>>>>
>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>>> wrote:
>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I have to disagree about the naming of things. The name of something
>>>> >> isn't just a literal identifier. It affects the way people think
>>>> about
>>>> >> it. For new users, the whole naming thing has been a persistent
>>>> >> barrier.
>>>> >
>>>> > I'm saying we shouldn't be worried too much about coming up with names
>>>> and
>>>> > analogies until we've decided what it is we're naming.
>>>> >
>>>> >>
>>>> >> As for your suggestions, I'm all for simplifying or generalizing the
>>>> >> "how it works" part down to a more generalized set of operations. I'm
>>>> >> not sure it's a good idea to require users to think in terms building
>>>> >> up a fluffy query structure just to thread it through a needle of an
>>>> >> API, even for the simplest of queries. At some point, the level of
>>>> >> generic boilerplate takes away from the semantic hand rails that
>>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>>> >> "how we use it" are not always exactly the same. At least they should
>>>> >> both hinge on a common conceptual model, which is where the naming
>>>> >> becomes an important anchoring point.
>>>> >
>>>> > If things are done properly, client libraries could expose simplified
>>>> query
>>>> > interfaces without much effort. Most ORMs these days work by building
>>>> a
>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>> would
>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>> > Mike
>>>> >
>>>> >>
>>>> >> Jonathan
>>>> >>
>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>>> wrote:
>>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>>> interface
>>>> >> > at
>>>> >> > all. I'm more interested in thinking about how the system should
>>>> work
>>>> >> > than
>>>> >> > what things are called. Naming things are important, but that can
>>>> happen
>>>> >> > later.
>>>> >> > Does anyone have any thoughts or comments on the architecture I
>>>> >> > suggested
>>>> >> > earlier?
>>>> >> >
>>>> >> > Mike
>>>> >> >
>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zsongbo@gmail.com
>>>> >
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Yes, the "column" here is not appropriate.
>>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>>> term
>>>> >> >> "qualifier" is a good one.
>>>> >> >>
>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>> david@lookin2.com>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> That would be a good time to get rid of the confusing "column"
>>>> term,
>>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>>> >> >>>
>>>> >> >>> Suggestions:
>>>> >> >>>
>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>>> "key"
>>>> >> >>> and
>>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>>> >> >>>
>>>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>>>> >> >>> "subdirectory"
>>>> >> >>>
>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>>> whose
>>>> >> >>> value
>>>> >> >>> is the set of keys, whose value is the set of supercolumns of the
>>>> key,
>>>> >> >>> whose
>>>> >> >>> value is the set of columns for the supercolumn, etc.
>>>> >> >>>
>>>> >> >>> 4. Etc.
>>>> >> >>>
>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>>> >> >>>> Now replace all of the various methods for querying with a
>>>> simple
>>>> >> >>>> query
>>>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>>> indexed,
>>>> >> >>>> and
>>>> >> >>>> completely remove Comparators and have people subclass Column /
>>>> >> >>>> implement
>>>> >> >>>> IColumn and we'd really be on to something ;).
>>>> >> >>>> Mock storage-conf.xml:
>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>> DiskPartitioned="True"
>>>> >> >>>> Type="UTF8">
>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>>> >> >>>> Type="ASCII">
>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>> >> >>>>         </Column>
>>>> >> >>>>       </Column>
>>>> >> >>>>     </Column>
>>>> >> >>>>   </Column>
>>>> >> >>>> Thrift:
>>>> >> >>>>   struct NamePredicate {
>>>> >> >>>>     1: required list<binary> column_names,
>>>> >> >>>>   }
>>>> >> >>>>   struct SlicePredicate {
>>>> >> >>>>     1: required binary start,
>>>> >> >>>>     2: required binary end,
>>>> >> >>>>   }
>>>> >> >>>>   struct CountPredicate {
>>>> >> >>>>     1: required struct predicate,
>>>> >> >>>>     2: required i32 count=100,
>>>> >> >>>>   }
>>>> >> >>>>   struct AndPredicate {
>>>> >> >>>>     1: required Predicate left,
>>>> >> >>>>     2: required Predicate right,
>>>> >> >>>>   }
>>>> >> >>>>   struct SubColumnsPredicate {
>>>> >> >>>>     1: required Predicate columns,
>>>> >> >>>>     2: required Predicate subcolumns,
>>>> >> >>>>   }
>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>> >> >>>>   query(predicate, count, consistency_level) # Count here would
>>>> be
>>>> >> >>>> total
>>>> >> >>>> count of leaf values returned, whereas CountPredicate specifies
>>>> a
>>>> >> >>>> column
>>>> >> >>>> count for a particular sub-slice.
>>>> >> >>>> Not fully baked... but I think this could really simplify stuff
>>>> and
>>>> >> >>>> make
>>>> >> >>>> it more flexible. Downside is it may give people enough rope to
>>>> hang
>>>> >> >>>> themselves, but at least the predicate stuff is easily
>>>> distributable.
>>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>>> stuff
>>>> >> >>>> myself if I have any free time in the near future.
>>>> >> >>>> Mike
>>>> >> >>>>
>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>>> jbellis@gmail.com>
>>>> >> >>>> wrote:
>>>> >> >>>>>
>>>> >> >>>>> Very interesting, thanks!
>>>> >> >>>>>
>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing
>>>> around
>>>> >> >>>>> > with a
>>>> >> >>>>> > simple
>>>> >> >>>>> > column comparator for composite column names that I put up on
>>>> >> >>>>> > github.  I'd
>>>> >> >>>>> > be interested to hear what people think of this approach.
>>>> >> >>>>> >
>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>> >> >>>>> >
>>>> >> >>>>> > Ed
>>>> >> >>>>> >
>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>>> wrote:
>>>> >> >>>>> >>
>>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>>> >> >>>>> >> AbstractType for
>>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>>> >> >>>>> >> "composite"
>>>> >> >>>>> >> column
>>>> >> >>>>> >> names so that if you could more easily do that sort of thing
>>>> >> >>>>> >> rather
>>>> >> >>>>> >> than
>>>> >> >>>>> >> having to concatenate into one big string.
>>>> >> >>>>> >>
>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>>> >> >>>>> >> <mi...@simplegeo.com>
>>>> >> >>>>> >> wrote:
>>>> >> >>>>> >>>
>>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>>> >> >>>>> >>> pointed
>>>> >> >>>>> >>> out to
>>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric
>>>> Florenzano) is
>>>> >> >>>>> >>> that you can
>>>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>>>> >> >>>>> >>> guess..?
>>>> >> >>>>> >>> But you
>>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>>> Column
>>>> >> >>>>> >>> comparator.
>>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>>> convenience
>>>> >> >>>>> >>> mechanism, as
>>>> >> >>>>> >>> far as I can tell.
>>>> >> >>>>> >>> Mike
>>>> >> >>>>> >
>>>> >> >>>>> >
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>> --
>>>> >> >>>>> Jonathan Ellis
>>>> >> >>>>> Project Chair, Apache Cassandra
>>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>>> support
>>>> >> >>>>> http://riptano.com
>>>> >> >>>>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by William Ashley <wa...@gmail.com>.

If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it.


On May 10, 2010, at 4:31 PM, AJ Chen wrote:

> supercolumn is good for modeling profile type of data. simple example is blog:
> blog { blog {author,  title, ...}
>          comments   {time: commenter}  //sort by TimeUUID
> }
> when retrieving a blog, you get all the comments sorted by time already.
> without supercolumn, you would need to concatenate multiple comment times together as you suggested. 
> 
> requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design.  there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient.  Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient.
> 
> -aj
> 
> 
> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
> Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept.  How it's implemented internally is a different matter.
> 
> SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is).
> 
> This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns.
> 
> If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any.
> 
> Mike
> 
> 
> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com> wrote:
> Agreed
> 
> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com> wrote:
> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com> wrote:
> >>
> >> I have to disagree about the naming of things. The name of something
> >> isn't just a literal identifier. It affects the way people think about
> >> it. For new users, the whole naming thing has been a persistent
> >> barrier.
> >
> > I'm saying we shouldn't be worried too much about coming up with names and
> > analogies until we've decided what it is we're naming.
> >
> >>
> >> As for your suggestions, I'm all for simplifying or generalizing the
> >> "how it works" part down to a more generalized set of operations. I'm
> >> not sure it's a good idea to require users to think in terms building
> >> up a fluffy query structure just to thread it through a needle of an
> >> API, even for the simplest of queries. At some point, the level of
> >> generic boilerplate takes away from the semantic hand rails that
> >> developers like. So I guess I'm suggesting that "how it works" and
> >> "how we use it" are not always exactly the same. At least they should
> >> both hinge on a common conceptual model, which is where the naming
> >> becomes an important anchoring point.
> >
> > If things are done properly, client libraries could expose simplified query
> > interfaces without much effort. Most ORMs these days work by building a
> > propositional directed acyclic graph that's serialized to SQL. This would
> > work the same way, but it wouldn't be converted into a 4GL.
> > Mike
> >
> >>
> >> Jonathan
> >>
> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com> wrote:
> >> > Maybe... but honestly, it doesn't affect the architecture or interface
> >> > at
> >> > all. I'm more interested in thinking about how the system should work
> >> > than
> >> > what things are called. Naming things are important, but that can happen
> >> > later.
> >> > Does anyone have any thoughts or comments on the architecture I
> >> > suggested
> >> > earlier?
> >> >
> >> > Mike
> >> >
> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Yes, the "column" here is not appropriate.
> >> >> Maybe we need not to create new terms, in Google's Bigtable, the term
> >> >> "qualifier" is a good one.
> >> >>
> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
> >> >> wrote:
> >> >>>
> >> >>> That would be a good time to get rid of the confusing "column" term,
> >> >>> which incorrectly suggests a two-dimensional tabular structure.
> >> >>>
> >> >>> Suggestions:
> >> >>>
> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key"
> >> >>> and
> >> >>> "column" with "1st dimension", "2nd dimension", etc.
> >> >>>
> >> >>> 2. A file system: replace "key" and "column" with "directory" and
> >> >>> "subdirectory"
> >> >>>
> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
> >> >>> value
> >> >>> is the set of keys, whose value is the set of supercolumns of the key,
> >> >>> whose
> >> >>> value is the set of columns for the supercolumn, etc.
> >> >>>
> >> >>> 4. Etc.
> >> >>>
> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Nice, Ed, we're doing something very similar but less generic.
> >> >>>> Now replace all of the various methods for querying with a simple
> >> >>>> query
> >> >>>> interface that takes a Predicate, allow the user to specify (in
> >> >>>> storage-conf) which levels of the nested Columns should be indexed,
> >> >>>> and
> >> >>>> completely remove Comparators and have people subclass Column /
> >> >>>> implement
> >> >>>> IColumn and we'd really be on to something ;).
> >> >>>> Mock storage-conf.xml:
> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
> >> >>>> ClusterPartitioned="True" Type="UTF8">
> >> >>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >> >>>> Type="UTF8">
> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> >> >>>> Type="ASCII">
> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >> >>>>         </Column>
> >> >>>>       </Column>
> >> >>>>     </Column>
> >> >>>>   </Column>
> >> >>>> Thrift:
> >> >>>>   struct NamePredicate {
> >> >>>>     1: required list<binary> column_names,
> >> >>>>   }
> >> >>>>   struct SlicePredicate {
> >> >>>>     1: required binary start,
> >> >>>>     2: required binary end,
> >> >>>>   }
> >> >>>>   struct CountPredicate {
> >> >>>>     1: required struct predicate,
> >> >>>>     2: required i32 count=100,
> >> >>>>   }
> >> >>>>   struct AndPredicate {
> >> >>>>     1: required Predicate left,
> >> >>>>     2: required Predicate right,
> >> >>>>   }
> >> >>>>   struct SubColumnsPredicate {
> >> >>>>     1: required Predicate columns,
> >> >>>>     2: required Predicate subcolumns,
> >> >>>>   }
> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
> >> >>>>   query(predicate, count, consistency_level) # Count here would be
> >> >>>> total
> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
> >> >>>> column
> >> >>>> count for a particular sub-slice.
> >> >>>> Not fully baked... but I think this could really simplify stuff and
> >> >>>> make
> >> >>>> it more flexible. Downside is it may give people enough rope to hang
> >> >>>> themselves, but at least the predicate stuff is easily distributable.
> >> >>>> I'm thinking I'll play around with implementing some of this stuff
> >> >>>> myself if I have any free time in the near future.
> >> >>>> Mike
> >> >>>>
> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Very interesting, thanks!
> >> >>>>>
> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
> >> >>>>> > with a
> >> >>>>> > simple
> >> >>>>> > column comparator for composite column names that I put up on
> >> >>>>> > github.  I'd
> >> >>>>> > be interested to hear what people think of this approach.
> >> >>>>> >
> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
> >> >>>>> >
> >> >>>>> > Ed
> >> >>>>> >
> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
> >> >>>>> >>
> >> >>>>> >> It might make sense to create a CompositeType subclass of
> >> >>>>> >> AbstractType for
> >> >>>>> >> the purpose of constructing and comparing these types of
> >> >>>>> >> "composite"
> >> >>>>> >> column
> >> >>>>> >> names so that if you could more easily do that sort of thing
> >> >>>>> >> rather
> >> >>>>> >> than
> >> >>>>> >> having to concatenate into one big string.
> >> >>>>> >>
> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
> >> >>>>> >> <mi...@simplegeo.com>
> >> >>>>> >> wrote:
> >> >>>>> >>>
> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
> >> >>>>> >>> pointed
> >> >>>>> >>> out to
> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
> >> >>>>> >>> that you can
> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
> >> >>>>> >>> guess..?
> >> >>>>> >>> But you
> >> >>>>> >>> should be able to do the same thing by creating your own Column
> >> >>>>> >>> comparator.
> >> >>>>> >>> I guess my point is that SuperColumns are mostly a convenience
> >> >>>>> >>> mechanism, as
> >> >>>>> >>> far as I can tell.
> >> >>>>> >>> Mike
> >> >>>>> >
> >> >>>>> >
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Jonathan Ellis
> >> >>>>> Project Chair, Apache Cassandra
> >> >>>>> co-founder of Riptano, the source for professional Cassandra support
> >> >>>>> http://riptano.com
> >> >>>>
> >> >>>
> >> >>
> >> >
> >> >
> >
> >
> 
> 
> 
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
> 
> 
> 
> 
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by AJ Chen <aj...@web2express.org>.

in your implementation, is the comment still sorted by TIME?  Will UTF8Type
sort <TimeUUID>:author by time?
thanks,
-aj

On Mon, May 10, 2010 at 5:02 PM, Mike Malone <mi...@simplegeo.com> wrote:

> On Mon, May 10, 2010 at 4:31 PM, AJ Chen <aj...@web2express.org> wrote:
>
>> supercolumn is good for modeling profile type of data. simple example is
>> blog:
>> blog { blog {author,  title, ...}
>>          comments   {time: commenter}  //sort by TimeUUID
>> }
>> when retrieving a blog, you get all the comments sorted by time already.
>> without supercolumn, you would need to concatenate multiple comment times
>> together as you suggested.
>>
>> requiring user to concatenating data fields together is not only an extra
>> burden on user but also a less clean design.  there will be cases where the
>> list property of a profile data is a long list (say a million items). in
>> such cases, user wants to be able to directly insert/delete an item in that
>> list because it's more efficient.  Retrieving the whole list, updating it,
>> concatenating again, and then putting it back to datastore is awkward and
>> less efficient.
>>
>
> There's nothing you said here that can't be implemented efficiently using
> columns. You can slice rows and get a subset of Columns. In fact, this
> example is particularly easy to implement. If you have a Blog with Entries
> and Comments you'd do:
>
>   <ColumnFamily Name="Blog" CompareWith="UTF8Type" />
>
>   Insert blog post:
>     batch_mutate(key=<blog post id>, [{name="~post:author",
> value=<author>}, {name="~post:title", value=<title>, ...))
>   Insert comment:
>     batch_mutate(key=<blog post id>, [{name=<TimeUUID> + ":author", ... }]
>
> Then you can get the Post only (slice for ["~", ""]), the comments only
> (slice for ["", "~"]), or the post _and_ comments (slice for ["", ""]).
> Inserting a comment does _not_ require a get/concatenate/insert.
>
> Yes, concatenating the names on the client side is hacky, clunky, and
> inconvenient. That's why we _should_ build an interface that doesn't require
> the client to concatenate names. But SuperColumns aren't the right way to do
> it. They add no value. They could be implemented in client libraries, for
> example, and nobody would know the difference.
>
> To really understand the problem with SuperColumns, though, you need to
> look at the Cassandra source. Removing SuperColumns would make the code-base
> much cleaner and tighter, and would probably reduce SLOC by 20%. I think a
> replacement that assumed nested Columns (or Entries, or Thingies) would be
> much cleaner. That's what Stu is working on.
>
> Mike
>
> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
>>
>>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>>>
>>>> Could someone confirm this discussion is not about abandoning
>>>> supercolumn family? I have found modeling data with supercolumn family is
>>>> actually an advantage of cassadra compared to relational database. Hope you
>>>> are going to drop this important concept.  How it's implemented internally
>>>> is a different matter.
>>>>
>>>
>>> SuperColumns are useful as a convenience mechanism. That's pretty much
>>> it. There's _nothing_ (as far as I can tell) that you can do with
>>> SuperColumns that you can't do by manually concatenating key names with a
>>> separator on the client side and implementing a custom comparator on the
>>> server (as ugly as that is).
>>>
>>> This discussion is about getting rid of SuperColumns and adding a more
>>> generic mechanism that will actually be useful and interesting and will
>>> continue to be convenient for the types of use cases for which people use
>>> SuperColumns.
>>>
>>> If there's a particular use case that you feel you can only implement
>>> with SuperColumns, please share! I honestly can't think of any.
>>>
>>> Mike
>>>
>>>
>>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>>>
>>>>> Agreed
>>>>>
>>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>>>> wrote:
>>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I have to disagree about the naming of things. The name of something
>>>>> >> isn't just a literal identifier. It affects the way people think
>>>>> about
>>>>> >> it. For new users, the whole naming thing has been a persistent
>>>>> >> barrier.
>>>>> >
>>>>> > I'm saying we shouldn't be worried too much about coming up with
>>>>> names and
>>>>> > analogies until we've decided what it is we're naming.
>>>>> >
>>>>> >>
>>>>> >> As for your suggestions, I'm all for simplifying or generalizing the
>>>>> >> "how it works" part down to a more generalized set of operations.
>>>>> I'm
>>>>> >> not sure it's a good idea to require users to think in terms
>>>>> building
>>>>> >> up a fluffy query structure just to thread it through a needle of an
>>>>> >> API, even for the simplest of queries. At some point, the level of
>>>>> >> generic boilerplate takes away from the semantic hand rails that
>>>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>>>> >> "how we use it" are not always exactly the same. At least they
>>>>> should
>>>>> >> both hinge on a common conceptual model, which is where the naming
>>>>> >> becomes an important anchoring point.
>>>>> >
>>>>> > If things are done properly, client libraries could expose simplified
>>>>> query
>>>>> > interfaces without much effort. Most ORMs these days work by building
>>>>> a
>>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>>> would
>>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>>> > Mike
>>>>> >
>>>>> >>
>>>>> >> Jonathan
>>>>> >>
>>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>>>> wrote:
>>>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>>>> interface
>>>>> >> > at
>>>>> >> > all. I'm more interested in thinking about how the system should
>>>>> work
>>>>> >> > than
>>>>> >> > what things are called. Naming things are important, but that can
>>>>> happen
>>>>> >> > later.
>>>>> >> > Does anyone have any thoughts or comments on the architecture I
>>>>> >> > suggested
>>>>> >> > earlier?
>>>>> >> >
>>>>> >> > Mike
>>>>> >> >
>>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <
>>>>> zsongbo@gmail.com>
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Yes, the "column" here is not appropriate.
>>>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>>>> term
>>>>> >> >> "qualifier" is a good one.
>>>>> >> >>
>>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>>> david@lookin2.com>
>>>>> >> >> wrote:
>>>>> >> >>>
>>>>> >> >>> That would be a good time to get rid of the confusing "column"
>>>>> term,
>>>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>>>> >> >>>
>>>>> >> >>> Suggestions:
>>>>> >> >>>
>>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>>>> "key"
>>>>> >> >>> and
>>>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>>>> >> >>>
>>>>> >> >>> 2. A file system: replace "key" and "column" with "directory"
>>>>> and
>>>>> >> >>> "subdirectory"
>>>>> >> >>>
>>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>>>> whose
>>>>> >> >>> value
>>>>> >> >>> is the set of keys, whose value is the set of supercolumns of
>>>>> the key,
>>>>> >> >>> whose
>>>>> >> >>> value is the set of columns for the supercolumn, etc.
>>>>> >> >>>
>>>>> >> >>> 4. Etc.
>>>>> >> >>>
>>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mike@simplegeo.com
>>>>> >
>>>>> >> >>> wrote:
>>>>> >> >>>>
>>>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>>>> >> >>>> Now replace all of the various methods for querying with a
>>>>> simple
>>>>> >> >>>> query
>>>>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>>>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>>>> indexed,
>>>>> >> >>>> and
>>>>> >> >>>> completely remove Comparators and have people subclass Column /
>>>>> >> >>>> implement
>>>>> >> >>>> IColumn and we'd really be on to something ;).
>>>>> >> >>>> Mock storage-conf.xml:
>>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>>> DiskPartitioned="True"
>>>>> >> >>>> Type="UTF8">
>>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>>>> >> >>>> Type="ASCII">
>>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>>> >> >>>>         </Column>
>>>>> >> >>>>       </Column>
>>>>> >> >>>>     </Column>
>>>>> >> >>>>   </Column>
>>>>> >> >>>> Thrift:
>>>>> >> >>>>   struct NamePredicate {
>>>>> >> >>>>     1: required list<binary> column_names,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct SlicePredicate {
>>>>> >> >>>>     1: required binary start,
>>>>> >> >>>>     2: required binary end,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct CountPredicate {
>>>>> >> >>>>     1: required struct predicate,
>>>>> >> >>>>     2: required i32 count=100,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct AndPredicate {
>>>>> >> >>>>     1: required Predicate left,
>>>>> >> >>>>     2: required Predicate right,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct SubColumnsPredicate {
>>>>> >> >>>>     1: required Predicate columns,
>>>>> >> >>>>     2: required Predicate subcolumns,
>>>>> >> >>>>   }
>>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>>> >> >>>>   query(predicate, count, consistency_level) # Count here would
>>>>> be
>>>>> >> >>>> total
>>>>> >> >>>> count of leaf values returned, whereas CountPredicate specifies
>>>>> a
>>>>> >> >>>> column
>>>>> >> >>>> count for a particular sub-slice.
>>>>> >> >>>> Not fully baked... but I think this could really simplify stuff
>>>>> and
>>>>> >> >>>> make
>>>>> >> >>>> it more flexible. Downside is it may give people enough rope to
>>>>> hang
>>>>> >> >>>> themselves, but at least the predicate stuff is easily
>>>>> distributable.
>>>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>>>> stuff
>>>>> >> >>>> myself if I have any free time in the near future.
>>>>> >> >>>> Mike
>>>>> >> >>>>
>>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>>>> jbellis@gmail.com>
>>>>> >> >>>> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> Very interesting, thanks!
>>>>> >> >>>>>
>>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com>
>>>>> wrote:
>>>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing
>>>>> around
>>>>> >> >>>>> > with a
>>>>> >> >>>>> > simple
>>>>> >> >>>>> > column comparator for composite column names that I put up
>>>>> on
>>>>> >> >>>>> > github.  I'd
>>>>> >> >>>>> > be interested to hear what people think of this approach.
>>>>> >> >>>>> >
>>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>>> >> >>>>> >
>>>>> >> >>>>> > Ed
>>>>> >> >>>>> >
>>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>>>> wrote:
>>>>> >> >>>>> >>
>>>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>>>> >> >>>>> >> AbstractType for
>>>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>>>> >> >>>>> >> "composite"
>>>>> >> >>>>> >> column
>>>>> >> >>>>> >> names so that if you could more easily do that sort of
>>>>> thing
>>>>> >> >>>>> >> rather
>>>>> >> >>>>> >> than
>>>>> >> >>>>> >> having to concatenate into one big string.
>>>>> >> >>>>> >>
>>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>>>> >> >>>>> >> <mi...@simplegeo.com>
>>>>> >> >>>>> >> wrote:
>>>>> >> >>>>> >>>
>>>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>>>> >> >>>>> >>> pointed
>>>>> >> >>>>> >>> out to
>>>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric
>>>>> Florenzano) is
>>>>> >> >>>>> >>> that you can
>>>>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>>>>> >> >>>>> >>> guess..?
>>>>> >> >>>>> >>> But you
>>>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>>>> Column
>>>>> >> >>>>> >>> comparator.
>>>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>>>> convenience
>>>>> >> >>>>> >>> mechanism, as
>>>>> >> >>>>> >>> far as I can tell.
>>>>> >> >>>>> >>> Mike
>>>>> >> >>>>> >
>>>>> >> >>>>> >
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>>>> --
>>>>> >> >>>>> Jonathan Ellis
>>>>> >> >>>>> Project Chair, Apache Cassandra
>>>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>>>> support
>>>>> >> >>>>> http://riptano.com
>>>>> >> >>>>
>>>>> >> >>>
>>>>> >> >>
>>>>> >> >
>>>>> >> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> AJ Chen, PhD
>>>> Chair, Semantic Web SIG, sdforum.org
>>>> http://web2express.org
>>>> twitter @web2express
>>>> Palo Alto, CA, USA
>>>>
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Mon, May 10, 2010 at 4:31 PM, AJ Chen <aj...@web2express.org> wrote:

> supercolumn is good for modeling profile type of data. simple example is
> blog:
> blog { blog {author,  title, ...}
>          comments   {time: commenter}  //sort by TimeUUID
> }
> when retrieving a blog, you get all the comments sorted by time already.
> without supercolumn, you would need to concatenate multiple comment times
> together as you suggested.
>
> requiring user to concatenating data fields together is not only an extra
> burden on user but also a less clean design.  there will be cases where the
> list property of a profile data is a long list (say a million items). in
> such cases, user wants to be able to directly insert/delete an item in that
> list because it's more efficient.  Retrieving the whole list, updating it,
> concatenating again, and then putting it back to datastore is awkward and
> less efficient.
>

There's nothing you said here that can't be implemented efficiently using
columns. You can slice rows and get a subset of Columns. In fact, this
example is particularly easy to implement. If you have a Blog with Entries
and Comments you'd do:

  <ColumnFamily Name="Blog" CompareWith="UTF8Type" />

  Insert blog post:
    batch_mutate(key=<blog post id>, [{name="~post:author", value=<author>},
{name="~post:title", value=<title>, ...))
  Insert comment:
    batch_mutate(key=<blog post id>, [{name=<TimeUUID> + ":author", ... }]

Then you can get the Post only (slice for ["~", ""]), the comments only
(slice for ["", "~"]), or the post _and_ comments (slice for ["", ""]).
Inserting a comment does _not_ require a get/concatenate/insert.

Yes, concatenating the names on the client side is hacky, clunky, and
inconvenient. That's why we _should_ build an interface that doesn't require
the client to concatenate names. But SuperColumns aren't the right way to do
it. They add no value. They could be implemented in client libraries, for
example, and nobody would know the difference.

To really understand the problem with SuperColumns, though, you need to look
at the Cassandra source. Removing SuperColumns would make the code-base much
cleaner and tighter, and would probably reduce SLOC by 20%. I think a
replacement that assumed nested Columns (or Entries, or Thingies) would be
much cleaner. That's what Stu is working on.

Mike

On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:
>
>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>>
>>> Could someone confirm this discussion is not about abandoning supercolumn
>>> family? I have found modeling data with supercolumn family is actually an
>>> advantage of cassadra compared to relational database. Hope you are going to
>>> drop this important concept.  How it's implemented internally is a different
>>> matter.
>>>
>>
>> SuperColumns are useful as a convenience mechanism. That's pretty much it.
>> There's _nothing_ (as far as I can tell) that you can do with SuperColumns
>> that you can't do by manually concatenating key names with a separator on
>> the client side and implementing a custom comparator on the server (as ugly
>> as that is).
>>
>> This discussion is about getting rid of SuperColumns and adding a more
>> generic mechanism that will actually be useful and interesting and will
>> continue to be convenient for the types of use cases for which people use
>> SuperColumns.
>>
>> If there's a particular use case that you feel you can only implement with
>> SuperColumns, please share! I honestly can't think of any.
>>
>> Mike
>>
>>
>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>>
>>>> Agreed
>>>>
>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>>> wrote:
>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I have to disagree about the naming of things. The name of something
>>>> >> isn't just a literal identifier. It affects the way people think
>>>> about
>>>> >> it. For new users, the whole naming thing has been a persistent
>>>> >> barrier.
>>>> >
>>>> > I'm saying we shouldn't be worried too much about coming up with names
>>>> and
>>>> > analogies until we've decided what it is we're naming.
>>>> >
>>>> >>
>>>> >> As for your suggestions, I'm all for simplifying or generalizing the
>>>> >> "how it works" part down to a more generalized set of operations. I'm
>>>> >> not sure it's a good idea to require users to think in terms building
>>>> >> up a fluffy query structure just to thread it through a needle of an
>>>> >> API, even for the simplest of queries. At some point, the level of
>>>> >> generic boilerplate takes away from the semantic hand rails that
>>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>>> >> "how we use it" are not always exactly the same. At least they should
>>>> >> both hinge on a common conceptual model, which is where the naming
>>>> >> becomes an important anchoring point.
>>>> >
>>>> > If things are done properly, client libraries could expose simplified
>>>> query
>>>> > interfaces without much effort. Most ORMs these days work by building
>>>> a
>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>> would
>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>> > Mike
>>>> >
>>>> >>
>>>> >> Jonathan
>>>> >>
>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>>> wrote:
>>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>>> interface
>>>> >> > at
>>>> >> > all. I'm more interested in thinking about how the system should
>>>> work
>>>> >> > than
>>>> >> > what things are called. Naming things are important, but that can
>>>> happen
>>>> >> > later.
>>>> >> > Does anyone have any thoughts or comments on the architecture I
>>>> >> > suggested
>>>> >> > earlier?
>>>> >> >
>>>> >> > Mike
>>>> >> >
>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zsongbo@gmail.com
>>>> >
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Yes, the "column" here is not appropriate.
>>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>>> term
>>>> >> >> "qualifier" is a good one.
>>>> >> >>
>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>> david@lookin2.com>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> That would be a good time to get rid of the confusing "column"
>>>> term,
>>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>>> >> >>>
>>>> >> >>> Suggestions:
>>>> >> >>>
>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>>> "key"
>>>> >> >>> and
>>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>>> >> >>>
>>>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>>>> >> >>> "subdirectory"
>>>> >> >>>
>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>>> whose
>>>> >> >>> value
>>>> >> >>> is the set of keys, whose value is the set of supercolumns of the
>>>> key,
>>>> >> >>> whose
>>>> >> >>> value is the set of columns for the supercolumn, etc.
>>>> >> >>>
>>>> >> >>> 4. Etc.
>>>> >> >>>
>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>>> >> >>>> Now replace all of the various methods for querying with a
>>>> simple
>>>> >> >>>> query
>>>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>>> indexed,
>>>> >> >>>> and
>>>> >> >>>> completely remove Comparators and have people subclass Column /
>>>> >> >>>> implement
>>>> >> >>>> IColumn and we'd really be on to something ;).
>>>> >> >>>> Mock storage-conf.xml:
>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>> DiskPartitioned="True"
>>>> >> >>>> Type="UTF8">
>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>>> >> >>>> Type="ASCII">
>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>> >> >>>>         </Column>
>>>> >> >>>>       </Column>
>>>> >> >>>>     </Column>
>>>> >> >>>>   </Column>
>>>> >> >>>> Thrift:
>>>> >> >>>>   struct NamePredicate {
>>>> >> >>>>     1: required list<binary> column_names,
>>>> >> >>>>   }
>>>> >> >>>>   struct SlicePredicate {
>>>> >> >>>>     1: required binary start,
>>>> >> >>>>     2: required binary end,
>>>> >> >>>>   }
>>>> >> >>>>   struct CountPredicate {
>>>> >> >>>>     1: required struct predicate,
>>>> >> >>>>     2: required i32 count=100,
>>>> >> >>>>   }
>>>> >> >>>>   struct AndPredicate {
>>>> >> >>>>     1: required Predicate left,
>>>> >> >>>>     2: required Predicate right,
>>>> >> >>>>   }
>>>> >> >>>>   struct SubColumnsPredicate {
>>>> >> >>>>     1: required Predicate columns,
>>>> >> >>>>     2: required Predicate subcolumns,
>>>> >> >>>>   }
>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>> >> >>>>   query(predicate, count, consistency_level) # Count here would
>>>> be
>>>> >> >>>> total
>>>> >> >>>> count of leaf values returned, whereas CountPredicate specifies
>>>> a
>>>> >> >>>> column
>>>> >> >>>> count for a particular sub-slice.
>>>> >> >>>> Not fully baked... but I think this could really simplify stuff
>>>> and
>>>> >> >>>> make
>>>> >> >>>> it more flexible. Downside is it may give people enough rope to
>>>> hang
>>>> >> >>>> themselves, but at least the predicate stuff is easily
>>>> distributable.
>>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>>> stuff
>>>> >> >>>> myself if I have any free time in the near future.
>>>> >> >>>> Mike
>>>> >> >>>>
>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>>> jbellis@gmail.com>
>>>> >> >>>> wrote:
>>>> >> >>>>>
>>>> >> >>>>> Very interesting, thanks!
>>>> >> >>>>>
>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing
>>>> around
>>>> >> >>>>> > with a
>>>> >> >>>>> > simple
>>>> >> >>>>> > column comparator for composite column names that I put up on
>>>> >> >>>>> > github.  I'd
>>>> >> >>>>> > be interested to hear what people think of this approach.
>>>> >> >>>>> >
>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>> >> >>>>> >
>>>> >> >>>>> > Ed
>>>> >> >>>>> >
>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>>> wrote:
>>>> >> >>>>> >>
>>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>>> >> >>>>> >> AbstractType for
>>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>>> >> >>>>> >> "composite"
>>>> >> >>>>> >> column
>>>> >> >>>>> >> names so that if you could more easily do that sort of thing
>>>> >> >>>>> >> rather
>>>> >> >>>>> >> than
>>>> >> >>>>> >> having to concatenate into one big string.
>>>> >> >>>>> >>
>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>>> >> >>>>> >> <mi...@simplegeo.com>
>>>> >> >>>>> >> wrote:
>>>> >> >>>>> >>>
>>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>>> >> >>>>> >>> pointed
>>>> >> >>>>> >>> out to
>>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric
>>>> Florenzano) is
>>>> >> >>>>> >>> that you can
>>>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>>>> >> >>>>> >>> guess..?
>>>> >> >>>>> >>> But you
>>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>>> Column
>>>> >> >>>>> >>> comparator.
>>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>>> convenience
>>>> >> >>>>> >>> mechanism, as
>>>> >> >>>>> >>> far as I can tell.
>>>> >> >>>>> >>> Mike
>>>> >> >>>>> >
>>>> >> >>>>> >
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>> --
>>>> >> >>>>> Jonathan Ellis
>>>> >> >>>>> Project Chair, Apache Cassandra
>>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>>> support
>>>> >> >>>>> http://riptano.com
>>>> >> >>>>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>

Re: Is SuperColumn necessary?

Posted by AJ Chen <aj...@web2express.org>.

supercolumn is good for modeling profile type of data. simple example is
blog:
blog { blog {author,  title, ...}
         comments   {time: commenter}  //sort by TimeUUID
}
when retrieving a blog, you get all the comments sorted by time already.
without supercolumn, you would need to concatenate multiple comment times
together as you suggested.

requiring user to concatenating data fields together is not only an extra
burden on user but also a less clean design.  there will be cases where the
list property of a profile data is a long list (say a million items). in
such cases, user wants to be able to directly insert/delete an item in that
list because it's more efficient.  Retrieving the whole list, updating it,
concatenating again, and then putting it back to datastore is awkward and
less efficient.

-aj


On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mi...@simplegeo.com> wrote:

> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:
>
>> Could someone confirm this discussion is not about abandoning supercolumn
>> family? I have found modeling data with supercolumn family is actually an
>> advantage of cassadra compared to relational database. Hope you are going to
>> drop this important concept.  How it's implemented internally is a different
>> matter.
>>
>
> SuperColumns are useful as a convenience mechanism. That's pretty much it.
> There's _nothing_ (as far as I can tell) that you can do with SuperColumns
> that you can't do by manually concatenating key names with a separator on
> the client side and implementing a custom comparator on the server (as ugly
> as that is).
>
> This discussion is about getting rid of SuperColumns and adding a more
> generic mechanism that will actually be useful and interesting and will
> continue to be convenient for the types of use cases for which people use
> SuperColumns.
>
> If there's a particular use case that you feel you can only implement with
> SuperColumns, please share! I honestly can't think of any.
>
> Mike
>
>
>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com>wrote:
>>
>>> Agreed
>>>
>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com>
>>> wrote:
>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>>> wrote:
>>> >>
>>> >> I have to disagree about the naming of things. The name of something
>>> >> isn't just a literal identifier. It affects the way people think about
>>> >> it. For new users, the whole naming thing has been a persistent
>>> >> barrier.
>>> >
>>> > I'm saying we shouldn't be worried too much about coming up with names
>>> and
>>> > analogies until we've decided what it is we're naming.
>>> >
>>> >>
>>> >> As for your suggestions, I'm all for simplifying or generalizing the
>>> >> "how it works" part down to a more generalized set of operations. I'm
>>> >> not sure it's a good idea to require users to think in terms building
>>> >> up a fluffy query structure just to thread it through a needle of an
>>> >> API, even for the simplest of queries. At some point, the level of
>>> >> generic boilerplate takes away from the semantic hand rails that
>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>> >> "how we use it" are not always exactly the same. At least they should
>>> >> both hinge on a common conceptual model, which is where the naming
>>> >> becomes an important anchoring point.
>>> >
>>> > If things are done properly, client libraries could expose simplified
>>> query
>>> > interfaces without much effort. Most ORMs these days work by building a
>>> > propositional directed acyclic graph that's serialized to SQL. This
>>> would
>>> > work the same way, but it wouldn't be converted into a 4GL.
>>> > Mike
>>> >
>>> >>
>>> >> Jonathan
>>> >>
>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>>> wrote:
>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>> interface
>>> >> > at
>>> >> > all. I'm more interested in thinking about how the system should
>>> work
>>> >> > than
>>> >> > what things are called. Naming things are important, but that can
>>> happen
>>> >> > later.
>>> >> > Does anyone have any thoughts or comments on the architecture I
>>> >> > suggested
>>> >> > earlier?
>>> >> >
>>> >> > Mike
>>> >> >
>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Yes, the "column" here is not appropriate.
>>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>>> term
>>> >> >> "qualifier" is a good one.
>>> >> >>
>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <david@lookin2.com
>>> >
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> That would be a good time to get rid of the confusing "column"
>>> term,
>>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>>> >> >>>
>>> >> >>> Suggestions:
>>> >> >>>
>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace
>>> "key"
>>> >> >>> and
>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>> >> >>>
>>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>>> >> >>> "subdirectory"
>>> >> >>>
>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple,
>>> whose
>>> >> >>> value
>>> >> >>> is the set of keys, whose value is the set of supercolumns of the
>>> key,
>>> >> >>> whose
>>> >> >>> value is the set of columns for the supercolumn, etc.
>>> >> >>>
>>> >> >>> 4. Etc.
>>> >> >>>
>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>>> >> >>>> Now replace all of the various methods for querying with a simple
>>> >> >>>> query
>>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>>> >> >>>> storage-conf) which levels of the nested Columns should be
>>> indexed,
>>> >> >>>> and
>>> >> >>>> completely remove Comparators and have people subclass Column /
>>> >> >>>> implement
>>> >> >>>> IColumn and we'd really be on to something ;).
>>> >> >>>> Mock storage-conf.xml:
>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>> DiskPartitioned="True"
>>> >> >>>> Type="UTF8">
>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>> >> >>>> Type="ASCII">
>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>> >> >>>>         </Column>
>>> >> >>>>       </Column>
>>> >> >>>>     </Column>
>>> >> >>>>   </Column>
>>> >> >>>> Thrift:
>>> >> >>>>   struct NamePredicate {
>>> >> >>>>     1: required list<binary> column_names,
>>> >> >>>>   }
>>> >> >>>>   struct SlicePredicate {
>>> >> >>>>     1: required binary start,
>>> >> >>>>     2: required binary end,
>>> >> >>>>   }
>>> >> >>>>   struct CountPredicate {
>>> >> >>>>     1: required struct predicate,
>>> >> >>>>     2: required i32 count=100,
>>> >> >>>>   }
>>> >> >>>>   struct AndPredicate {
>>> >> >>>>     1: required Predicate left,
>>> >> >>>>     2: required Predicate right,
>>> >> >>>>   }
>>> >> >>>>   struct SubColumnsPredicate {
>>> >> >>>>     1: required Predicate columns,
>>> >> >>>>     2: required Predicate subcolumns,
>>> >> >>>>   }
>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>> >> >>>>   query(predicate, count, consistency_level) # Count here would
>>> be
>>> >> >>>> total
>>> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
>>> >> >>>> column
>>> >> >>>> count for a particular sub-slice.
>>> >> >>>> Not fully baked... but I think this could really simplify stuff
>>> and
>>> >> >>>> make
>>> >> >>>> it more flexible. Downside is it may give people enough rope to
>>> hang
>>> >> >>>> themselves, but at least the predicate stuff is easily
>>> distributable.
>>> >> >>>> I'm thinking I'll play around with implementing some of this
>>> stuff
>>> >> >>>> myself if I have any free time in the near future.
>>> >> >>>> Mike
>>> >> >>>>
>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <
>>> jbellis@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> Very interesting, thanks!
>>> >> >>>>>
>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
>>> >> >>>>> > with a
>>> >> >>>>> > simple
>>> >> >>>>> > column comparator for composite column names that I put up on
>>> >> >>>>> > github.  I'd
>>> >> >>>>> > be interested to hear what people think of this approach.
>>> >> >>>>> >
>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>> >> >>>>> >
>>> >> >>>>> > Ed
>>> >> >>>>> >
>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>>> wrote:
>>> >> >>>>> >>
>>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>>> >> >>>>> >> AbstractType for
>>> >> >>>>> >> the purpose of constructing and comparing these types of
>>> >> >>>>> >> "composite"
>>> >> >>>>> >> column
>>> >> >>>>> >> names so that if you could more easily do that sort of thing
>>> >> >>>>> >> rather
>>> >> >>>>> >> than
>>> >> >>>>> >> having to concatenate into one big string.
>>> >> >>>>> >>
>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>>> >> >>>>> >> <mi...@simplegeo.com>
>>> >> >>>>> >> wrote:
>>> >> >>>>> >>>
>>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>>> >> >>>>> >>> pointed
>>> >> >>>>> >>> out to
>>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano)
>>> is
>>> >> >>>>> >>> that you can
>>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>>> >> >>>>> >>> guess..?
>>> >> >>>>> >>> But you
>>> >> >>>>> >>> should be able to do the same thing by creating your own
>>> Column
>>> >> >>>>> >>> comparator.
>>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>>> convenience
>>> >> >>>>> >>> mechanism, as
>>> >> >>>>> >>> far as I can tell.
>>> >> >>>>> >>> Mike
>>> >> >>>>> >
>>> >> >>>>> >
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> --
>>> >> >>>>> Jonathan Ellis
>>> >> >>>>> Project Chair, Apache Cassandra
>>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>>> support
>>> >> >>>>> http://riptano.com
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Mon, May 10, 2010 at 1:38 PM, AJ Chen <aj...@web2express.org> wrote:

> Could someone confirm this discussion is not about abandoning supercolumn
> family? I have found modeling data with supercolumn family is actually an
> advantage of cassadra compared to relational database. Hope you are going to
> drop this important concept.  How it's implemented internally is a different
> matter.
>

SuperColumns are useful as a convenience mechanism. That's pretty much it.
There's _nothing_ (as far as I can tell) that you can do with SuperColumns
that you can't do by manually concatenating key names with a separator on
the client side and implementing a custom comparator on the server (as ugly
as that is).

This discussion is about getting rid of SuperColumns and adding a more
generic mechanism that will actually be useful and interesting and will
continue to be convenient for the types of use cases for which people use
SuperColumns.

If there's a particular use case that you feel you can only implement with
SuperColumns, please share! I honestly can't think of any.

Mike


> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com> wrote:
>
>> Agreed
>>
>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com> wrote:
>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>> wrote:
>> >>
>> >> I have to disagree about the naming of things. The name of something
>> >> isn't just a literal identifier. It affects the way people think about
>> >> it. For new users, the whole naming thing has been a persistent
>> >> barrier.
>> >
>> > I'm saying we shouldn't be worried too much about coming up with names
>> and
>> > analogies until we've decided what it is we're naming.
>> >
>> >>
>> >> As for your suggestions, I'm all for simplifying or generalizing the
>> >> "how it works" part down to a more generalized set of operations. I'm
>> >> not sure it's a good idea to require users to think in terms building
>> >> up a fluffy query structure just to thread it through a needle of an
>> >> API, even for the simplest of queries. At some point, the level of
>> >> generic boilerplate takes away from the semantic hand rails that
>> >> developers like. So I guess I'm suggesting that "how it works" and
>> >> "how we use it" are not always exactly the same. At least they should
>> >> both hinge on a common conceptual model, which is where the naming
>> >> becomes an important anchoring point.
>> >
>> > If things are done properly, client libraries could expose simplified
>> query
>> > interfaces without much effort. Most ORMs these days work by building a
>> > propositional directed acyclic graph that's serialized to SQL. This
>> would
>> > work the same way, but it wouldn't be converted into a 4GL.
>> > Mike
>> >
>> >>
>> >> Jonathan
>> >>
>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>> wrote:
>> >> > Maybe... but honestly, it doesn't affect the architecture or
>> interface
>> >> > at
>> >> > all. I'm more interested in thinking about how the system should work
>> >> > than
>> >> > what things are called. Naming things are important, but that can
>> happen
>> >> > later.
>> >> > Does anyone have any thoughts or comments on the architecture I
>> >> > suggested
>> >> > earlier?
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Yes, the "column" here is not appropriate.
>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>> term
>> >> >> "qualifier" is a good one.
>> >> >>
>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> That would be a good time to get rid of the confusing "column"
>> term,
>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>> >> >>>
>> >> >>> Suggestions:
>> >> >>>
>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key"
>> >> >>> and
>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>> >> >>>
>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>> >> >>> "subdirectory"
>> >> >>>
>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
>> >> >>> value
>> >> >>> is the set of keys, whose value is the set of supercolumns of the
>> key,
>> >> >>> whose
>> >> >>> value is the set of columns for the supercolumn, etc.
>> >> >>>
>> >> >>> 4. Etc.
>> >> >>>
>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>> >> >>>> Now replace all of the various methods for querying with a simple
>> >> >>>> query
>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>> >> >>>> storage-conf) which levels of the nested Columns should be
>> indexed,
>> >> >>>> and
>> >> >>>> completely remove Comparators and have people subclass Column /
>> >> >>>> implement
>> >> >>>> IColumn and we'd really be on to something ;).
>> >> >>>> Mock storage-conf.xml:
>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>> DiskPartitioned="True"
>> >> >>>> Type="UTF8">
>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> >> >>>> Type="ASCII">
>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>> >> >>>>         </Column>
>> >> >>>>       </Column>
>> >> >>>>     </Column>
>> >> >>>>   </Column>
>> >> >>>> Thrift:
>> >> >>>>   struct NamePredicate {
>> >> >>>>     1: required list<binary> column_names,
>> >> >>>>   }
>> >> >>>>   struct SlicePredicate {
>> >> >>>>     1: required binary start,
>> >> >>>>     2: required binary end,
>> >> >>>>   }
>> >> >>>>   struct CountPredicate {
>> >> >>>>     1: required struct predicate,
>> >> >>>>     2: required i32 count=100,
>> >> >>>>   }
>> >> >>>>   struct AndPredicate {
>> >> >>>>     1: required Predicate left,
>> >> >>>>     2: required Predicate right,
>> >> >>>>   }
>> >> >>>>   struct SubColumnsPredicate {
>> >> >>>>     1: required Predicate columns,
>> >> >>>>     2: required Predicate subcolumns,
>> >> >>>>   }
>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>> >> >>>>   query(predicate, count, consistency_level) # Count here would be
>> >> >>>> total
>> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
>> >> >>>> column
>> >> >>>> count for a particular sub-slice.
>> >> >>>> Not fully baked... but I think this could really simplify stuff
>> and
>> >> >>>> make
>> >> >>>> it more flexible. Downside is it may give people enough rope to
>> hang
>> >> >>>> themselves, but at least the predicate stuff is easily
>> distributable.
>> >> >>>> I'm thinking I'll play around with implementing some of this stuff
>> >> >>>> myself if I have any free time in the near future.
>> >> >>>> Mike
>> >> >>>>
>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jbellis@gmail.com
>> >
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> Very interesting, thanks!
>> >> >>>>>
>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
>> >> >>>>> > with a
>> >> >>>>> > simple
>> >> >>>>> > column comparator for composite column names that I put up on
>> >> >>>>> > github.  I'd
>> >> >>>>> > be interested to hear what people think of this approach.
>> >> >>>>> >
>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>> >> >>>>> >
>> >> >>>>> > Ed
>> >> >>>>> >
>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>> wrote:
>> >> >>>>> >>
>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>> >> >>>>> >> AbstractType for
>> >> >>>>> >> the purpose of constructing and comparing these types of
>> >> >>>>> >> "composite"
>> >> >>>>> >> column
>> >> >>>>> >> names so that if you could more easily do that sort of thing
>> >> >>>>> >> rather
>> >> >>>>> >> than
>> >> >>>>> >> having to concatenate into one big string.
>> >> >>>>> >>
>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>> >> >>>>> >> <mi...@simplegeo.com>
>> >> >>>>> >> wrote:
>> >> >>>>> >>>
>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>> >> >>>>> >>> pointed
>> >> >>>>> >>> out to
>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano)
>> is
>> >> >>>>> >>> that you can
>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>> >> >>>>> >>> guess..?
>> >> >>>>> >>> But you
>> >> >>>>> >>> should be able to do the same thing by creating your own
>> Column
>> >> >>>>> >>> comparator.
>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>> convenience
>> >> >>>>> >>> mechanism, as
>> >> >>>>> >>> far as I can tell.
>> >> >>>>> >>> Mike
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Jonathan Ellis
>> >> >>>>> Project Chair, Apache Cassandra
>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>> support
>> >> >>>>> http://riptano.com
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >
>> >
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Tue, May 11, 2010 at 7:46 AM, David Boxenhorn <da...@lookin2.com> wrote:

> I would like an API with a variable number of arguments. Using Java
> varargs, something like
>
> value = keyspace.get("articles", "cars", "John Smith", "2010-05-01",
> "comment-25");
>
> or
>
> valueArray = keyspace.get("articles", predicate1, predicate2, predicate3,
> predicate4);
>

Hrm. I haven't dug that deeply into the joys of predicate logic,
propositional DAGs, etc. but couldn't this also be represented as a nested
tree of predicates / other primitives. So it would be something like:

   SubColumns = Transformation that takes a predicate, applies it to a
Column, then gets it's SubColumns
   keyspace.get("articles", SubColumns(predicate1, SubColumns(predicate2,
SubColumns(predicate3, predicate4))));

It's more like functional programming-ish, I suppose, but I think that model
might apply more cleanly here. FP does tend to result in nice clean
algorithms for manipulating large data sets.

Mike


>
>
> The storage layout would be determined by the configuration, as below:
>
> <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
> ...
>
>
>
>
> On Tue, May 11, 2010 at 5:26 PM, Jonathan Shook <js...@gmail.com> wrote:
>
>> This is one of the sticking points with the key concatenation
>> argument. You can't simply access subpartitions of data along an
>> aggregate name using a concatenated key unless you can efficiently
>> address a range of the keys according to a property of a subset. I'm
>> hoping this will bear out with more of this discussion.
>>
>> Another facet of this issue is performance with respect to storage
>> layout. Presently columns within a row are inherently organized for
>> efficient range operations. The key space is not generally optimal in
>> this way. I'm hoping to see some discussion of this, as well.
>>
>> On Tue, May 11, 2010 at 6:17 AM, vd <vi...@gmail.com> wrote:
>> > Hi
>> >
>> > Can we make range search on ID:ID format as this would be treated as
>> > single ID by API or can it bifurcate on ':' . If now then how do can
>> > we ignore usage of supercolumns where we need to associate 'n' number
>> > of rows to a single ID.
>> > Like
>> >          CatID1-> articleID1
>> >          CatID1-> articleID2
>> >          CatID1-> articleID3
>> >          CatID1-> articleID4
>> > How can we map such scenarios with simple column families.
>> >
>> > Rgds.
>> >
>> > On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt <tc...@vafer.org>
>> wrote:
>> >> Exactly.
>> >>
>> >> On Tue, May 11, 2010 at 10:20, David Boxenhorn <da...@lookin2.com>
>> wrote:
>> >>> Don't think of it as getting rid of supercolum. Think of it as adding
>> >>> superdupercolums, supertriplecolums, etc. Or, in sparse array
>> terminology:
>> >>> array[dim1][dim2][dim3].....[dimN] = value
>> >>>
>> >>> Or, as said above:
>> >>>
>> >>>   <Column Name="ThingThatsNowKey" Indexed="True"
>> ClusterPartitioned="True"
>> >>> Type="UTF8">
>> >>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>> >>> Type="UTF8">
>> >>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>> >>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> Type="ASCII">
>> >>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>> >>>         </Column>
>> >>>       </Column>
>> >>>     </Column>
>> >>>   </Column>
>> >>
>> >
>>
>
>

Re: Is SuperColumn necessary?

Posted by David Boxenhorn <da...@lookin2.com>.

I would like an API with a variable number of arguments. Using Java varargs,
something like

value = keyspace.get("articles", "cars", "John Smith", "2010-05-01",
"comment-25");

or

valueArray = keyspace.get("articles", predicate1, predicate2, predicate3,
predicate4);


The storage layout would be determined by the configuration, as below:

<Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True" ...



On Tue, May 11, 2010 at 5:26 PM, Jonathan Shook <js...@gmail.com> wrote:

> This is one of the sticking points with the key concatenation
> argument. You can't simply access subpartitions of data along an
> aggregate name using a concatenated key unless you can efficiently
> address a range of the keys according to a property of a subset. I'm
> hoping this will bear out with more of this discussion.
>
> Another facet of this issue is performance with respect to storage
> layout. Presently columns within a row are inherently organized for
> efficient range operations. The key space is not generally optimal in
> this way. I'm hoping to see some discussion of this, as well.
>
> On Tue, May 11, 2010 at 6:17 AM, vd <vi...@gmail.com> wrote:
> > Hi
> >
> > Can we make range search on ID:ID format as this would be treated as
> > single ID by API or can it bifurcate on ':' . If now then how do can
> > we ignore usage of supercolumns where we need to associate 'n' number
> > of rows to a single ID.
> > Like
> >          CatID1-> articleID1
> >          CatID1-> articleID2
> >          CatID1-> articleID3
> >          CatID1-> articleID4
> > How can we map such scenarios with simple column families.
> >
> > Rgds.
> >
> > On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt <tc...@vafer.org> wrote:
> >> Exactly.
> >>
> >> On Tue, May 11, 2010 at 10:20, David Boxenhorn <da...@lookin2.com>
> wrote:
> >>> Don't think of it as getting rid of supercolum. Think of it as adding
> >>> superdupercolums, supertriplecolums, etc. Or, in sparse array
> terminology:
> >>> array[dim1][dim2][dim3].....[dimN] = value
> >>>
> >>> Or, as said above:
> >>>
> >>>   <Column Name="ThingThatsNowKey" Indexed="True"
> ClusterPartitioned="True"
> >>> Type="UTF8">
> >>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >>> Type="UTF8">
> >>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> Type="ASCII">
> >>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >>>         </Column>
> >>>       </Column>
> >>>     </Column>
> >>>   </Column>
> >>
> >
>

Re: Is SuperColumn necessary?

Posted by Jonathan Shook <js...@gmail.com>.

This is one of the sticking points with the key concatenation
argument. You can't simply access subpartitions of data along an
aggregate name using a concatenated key unless you can efficiently
address a range of the keys according to a property of a subset. I'm
hoping this will bear out with more of this discussion.

Another facet of this issue is performance with respect to storage
layout. Presently columns within a row are inherently organized for
efficient range operations. The key space is not generally optimal in
this way. I'm hoping to see some discussion of this, as well.

On Tue, May 11, 2010 at 6:17 AM, vd <vi...@gmail.com> wrote:
> Hi
>
> Can we make range search on ID:ID format as this would be treated as
> single ID by API or can it bifurcate on ':' . If now then how do can
> we ignore usage of supercolumns where we need to associate 'n' number
> of rows to a single ID.
> Like
>          CatID1-> articleID1
>          CatID1-> articleID2
>          CatID1-> articleID3
>          CatID1-> articleID4
> How can we map such scenarios with simple column families.
>
> Rgds.
>
> On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt <tc...@vafer.org> wrote:
>> Exactly.
>>
>> On Tue, May 11, 2010 at 10:20, David Boxenhorn <da...@lookin2.com> wrote:
>>> Don't think of it as getting rid of supercolum. Think of it as adding
>>> superdupercolums, supertriplecolums, etc. Or, in sparse array terminology:
>>> array[dim1][dim2][dim3].....[dimN] = value
>>>
>>> Or, as said above:
>>>
>>>   <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
>>> Type="UTF8">
>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>>> Type="UTF8">
>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>         <Column Name="ThingThatsNowColumnName" Indexed="True" Type="ASCII">
>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>         </Column>
>>>       </Column>
>>>     </Column>
>>>   </Column>
>>
>

Re: Is SuperColumn necessary?

Posted by vd <vi...@gmail.com>.

Hi

Can we make range search on ID:ID format as this would be treated as
single ID by API or can it bifurcate on ':' . If now then how do can
we ignore usage of supercolumns where we need to associate 'n' number
of rows to a single ID.
Like
          CatID1-> articleID1
          CatID1-> articleID2
          CatID1-> articleID3
          CatID1-> articleID4
How can we map such scenarios with simple column families.

Rgds.

On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt <tc...@vafer.org> wrote:
> Exactly.
>
> On Tue, May 11, 2010 at 10:20, David Boxenhorn <da...@lookin2.com> wrote:
>> Don't think of it as getting rid of supercolum. Think of it as adding
>> superdupercolums, supertriplecolums, etc. Or, in sparse array terminology:
>> array[dim1][dim2][dim3].....[dimN] = value
>>
>> Or, as said above:
>>
>>   <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
>> Type="UTF8">
>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>> Type="UTF8">
>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>         <Column Name="ThingThatsNowColumnName" Indexed="True" Type="ASCII">
>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>         </Column>
>>       </Column>
>>     </Column>
>>   </Column>
>

Re: Is SuperColumn necessary?

Posted by Torsten Curdt <tc...@vafer.org>.

Exactly.

On Tue, May 11, 2010 at 10:20, David Boxenhorn <da...@lookin2.com> wrote:
> Don't think of it as getting rid of supercolum. Think of it as adding
> superdupercolums, supertriplecolums, etc. Or, in sparse array terminology:
> array[dim1][dim2][dim3].....[dimN] = value
>
> Or, as said above:
>
>   <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
> Type="UTF8">
>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> Type="UTF8">
>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>         <Column Name="ThingThatsNowColumnName" Indexed="True" Type="ASCII">
>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>         </Column>
>       </Column>
>     </Column>
>   </Column>

Re: Is SuperColumn necessary?

Posted by David Boxenhorn <da...@lookin2.com>.

Don't think of it as getting rid of supercolum. Think of it as adding
superdupercolums, supertriplecolums, etc. Or, in sparse array terminology:
array[dim1][dim2][dim3].....[dimN] = value

Or, as said above:

  <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
Type="UTF8">
    <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
Type="UTF8">
      <Column Name="ThingThatsNowSuperColumnName" Type="Long">
        <Column Name="ThingThatsNowColumnName" Indexed="True" Type="ASCII">
          <Column Name="ThingThatCantCurrentlyBeRepresented"/>
        </Column>
      </Column>
    </Column>
  </Column>




On Mon, May 10, 2010 at 11:38 PM, AJ Chen <aj...@web2express.org> wrote:

> Could someone confirm this discussion is not about abandoning supercolumn
> family? I have found modeling data with supercolumn family is actually an
> advantage of cassadra compared to relational database. Hope you are going to
> drop this important concept.  How it's implemented internally is a different
> matter.
> -aj
>
> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com> wrote:
>
>> Agreed
>>
>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com> wrote:
>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
>> wrote:
>> >>
>> >> I have to disagree about the naming of things. The name of something
>> >> isn't just a literal identifier. It affects the way people think about
>> >> it. For new users, the whole naming thing has been a persistent
>> >> barrier.
>> >
>> > I'm saying we shouldn't be worried too much about coming up with names
>> and
>> > analogies until we've decided what it is we're naming.
>> >
>> >>
>> >> As for your suggestions, I'm all for simplifying or generalizing the
>> >> "how it works" part down to a more generalized set of operations. I'm
>> >> not sure it's a good idea to require users to think in terms building
>> >> up a fluffy query structure just to thread it through a needle of an
>> >> API, even for the simplest of queries. At some point, the level of
>> >> generic boilerplate takes away from the semantic hand rails that
>> >> developers like. So I guess I'm suggesting that "how it works" and
>> >> "how we use it" are not always exactly the same. At least they should
>> >> both hinge on a common conceptual model, which is where the naming
>> >> becomes an important anchoring point.
>> >
>> > If things are done properly, client libraries could expose simplified
>> query
>> > interfaces without much effort. Most ORMs these days work by building a
>> > propositional directed acyclic graph that's serialized to SQL. This
>> would
>> > work the same way, but it wouldn't be converted into a 4GL.
>> > Mike
>> >
>> >>
>> >> Jonathan
>> >>
>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
>> wrote:
>> >> > Maybe... but honestly, it doesn't affect the architecture or
>> interface
>> >> > at
>> >> > all. I'm more interested in thinking about how the system should work
>> >> > than
>> >> > what things are called. Naming things are important, but that can
>> happen
>> >> > later.
>> >> > Does anyone have any thoughts or comments on the architecture I
>> >> > suggested
>> >> > earlier?
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Yes, the "column" here is not appropriate.
>> >> >> Maybe we need not to create new terms, in Google's Bigtable, the
>> term
>> >> >> "qualifier" is a good one.
>> >> >>
>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> That would be a good time to get rid of the confusing "column"
>> term,
>> >> >>> which incorrectly suggests a two-dimensional tabular structure.
>> >> >>>
>> >> >>> Suggestions:
>> >> >>>
>> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key"
>> >> >>> and
>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>> >> >>>
>> >> >>> 2. A file system: replace "key" and "column" with "directory" and
>> >> >>> "subdirectory"
>> >> >>>
>> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
>> >> >>> value
>> >> >>> is the set of keys, whose value is the set of supercolumns of the
>> key,
>> >> >>> whose
>> >> >>> value is the set of columns for the supercolumn, etc.
>> >> >>>
>> >> >>> 4. Etc.
>> >> >>>
>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Nice, Ed, we're doing something very similar but less generic.
>> >> >>>> Now replace all of the various methods for querying with a simple
>> >> >>>> query
>> >> >>>> interface that takes a Predicate, allow the user to specify (in
>> >> >>>> storage-conf) which levels of the nested Columns should be
>> indexed,
>> >> >>>> and
>> >> >>>> completely remove Comparators and have people subclass Column /
>> >> >>>> implement
>> >> >>>> IColumn and we'd really be on to something ;).
>> >> >>>> Mock storage-conf.xml:
>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>> DiskPartitioned="True"
>> >> >>>> Type="UTF8">
>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> >> >>>> Type="ASCII">
>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>> >> >>>>         </Column>
>> >> >>>>       </Column>
>> >> >>>>     </Column>
>> >> >>>>   </Column>
>> >> >>>> Thrift:
>> >> >>>>   struct NamePredicate {
>> >> >>>>     1: required list<binary> column_names,
>> >> >>>>   }
>> >> >>>>   struct SlicePredicate {
>> >> >>>>     1: required binary start,
>> >> >>>>     2: required binary end,
>> >> >>>>   }
>> >> >>>>   struct CountPredicate {
>> >> >>>>     1: required struct predicate,
>> >> >>>>     2: required i32 count=100,
>> >> >>>>   }
>> >> >>>>   struct AndPredicate {
>> >> >>>>     1: required Predicate left,
>> >> >>>>     2: required Predicate right,
>> >> >>>>   }
>> >> >>>>   struct SubColumnsPredicate {
>> >> >>>>     1: required Predicate columns,
>> >> >>>>     2: required Predicate subcolumns,
>> >> >>>>   }
>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>> >> >>>>   query(predicate, count, consistency_level) # Count here would be
>> >> >>>> total
>> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
>> >> >>>> column
>> >> >>>> count for a particular sub-slice.
>> >> >>>> Not fully baked... but I think this could really simplify stuff
>> and
>> >> >>>> make
>> >> >>>> it more flexible. Downside is it may give people enough rope to
>> hang
>> >> >>>> themselves, but at least the predicate stuff is easily
>> distributable.
>> >> >>>> I'm thinking I'll play around with implementing some of this stuff
>> >> >>>> myself if I have any free time in the near future.
>> >> >>>> Mike
>> >> >>>>
>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jbellis@gmail.com
>> >
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> Very interesting, thanks!
>> >> >>>>>
>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
>> >> >>>>> > with a
>> >> >>>>> > simple
>> >> >>>>> > column comparator for composite column names that I put up on
>> >> >>>>> > github.  I'd
>> >> >>>>> > be interested to hear what people think of this approach.
>> >> >>>>> >
>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>> >> >>>>> >
>> >> >>>>> > Ed
>> >> >>>>> >
>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
>> wrote:
>> >> >>>>> >>
>> >> >>>>> >> It might make sense to create a CompositeType subclass of
>> >> >>>>> >> AbstractType for
>> >> >>>>> >> the purpose of constructing and comparing these types of
>> >> >>>>> >> "composite"
>> >> >>>>> >> column
>> >> >>>>> >> names so that if you could more easily do that sort of thing
>> >> >>>>> >> rather
>> >> >>>>> >> than
>> >> >>>>> >> having to concatenate into one big string.
>> >> >>>>> >>
>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>> >> >>>>> >> <mi...@simplegeo.com>
>> >> >>>>> >> wrote:
>> >> >>>>> >>>
>> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>> >> >>>>> >>> pointed
>> >> >>>>> >>> out to
>> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano)
>> is
>> >> >>>>> >>> that you can
>> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
>> >> >>>>> >>> guess..?
>> >> >>>>> >>> But you
>> >> >>>>> >>> should be able to do the same thing by creating your own
>> Column
>> >> >>>>> >>> comparator.
>> >> >>>>> >>> I guess my point is that SuperColumns are mostly a
>> convenience
>> >> >>>>> >>> mechanism, as
>> >> >>>>> >>> far as I can tell.
>> >> >>>>> >>> Mike
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Jonathan Ellis
>> >> >>>>> Project Chair, Apache Cassandra
>> >> >>>>> co-founder of Riptano, the source for professional Cassandra
>> support
>> >> >>>>> http://riptano.com
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >
>> >
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>

Re: Is SuperColumn necessary?

Posted by AJ Chen <aj...@web2express.org>.

Could someone confirm this discussion is not about abandoning supercolumn
family? I have found modeling data with supercolumn family is actually an
advantage of cassadra compared to relational database. Hope you are going to
drop this important concept.  How it's implemented internally is a different
matter.
-aj

On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <js...@gmail.com> wrote:

> Agreed
>
> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com> wrote:
> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com>
> wrote:
> >>
> >> I have to disagree about the naming of things. The name of something
> >> isn't just a literal identifier. It affects the way people think about
> >> it. For new users, the whole naming thing has been a persistent
> >> barrier.
> >
> > I'm saying we shouldn't be worried too much about coming up with names
> and
> > analogies until we've decided what it is we're naming.
> >
> >>
> >> As for your suggestions, I'm all for simplifying or generalizing the
> >> "how it works" part down to a more generalized set of operations. I'm
> >> not sure it's a good idea to require users to think in terms building
> >> up a fluffy query structure just to thread it through a needle of an
> >> API, even for the simplest of queries. At some point, the level of
> >> generic boilerplate takes away from the semantic hand rails that
> >> developers like. So I guess I'm suggesting that "how it works" and
> >> "how we use it" are not always exactly the same. At least they should
> >> both hinge on a common conceptual model, which is where the naming
> >> becomes an important anchoring point.
> >
> > If things are done properly, client libraries could expose simplified
> query
> > interfaces without much effort. Most ORMs these days work by building a
> > propositional directed acyclic graph that's serialized to SQL. This would
> > work the same way, but it wouldn't be converted into a 4GL.
> > Mike
> >
> >>
> >> Jonathan
> >>
> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com>
> wrote:
> >> > Maybe... but honestly, it doesn't affect the architecture or interface
> >> > at
> >> > all. I'm more interested in thinking about how the system should work
> >> > than
> >> > what things are called. Naming things are important, but that can
> happen
> >> > later.
> >> > Does anyone have any thoughts or comments on the architecture I
> >> > suggested
> >> > earlier?
> >> >
> >> > Mike
> >> >
> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Yes, the "column" here is not appropriate.
> >> >> Maybe we need not to create new terms, in Google's Bigtable, the term
> >> >> "qualifier" is a good one.
> >> >>
> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
> >> >> wrote:
> >> >>>
> >> >>> That would be a good time to get rid of the confusing "column" term,
> >> >>> which incorrectly suggests a two-dimensional tabular structure.
> >> >>>
> >> >>> Suggestions:
> >> >>>
> >> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key"
> >> >>> and
> >> >>> "column" with "1st dimension", "2nd dimension", etc.
> >> >>>
> >> >>> 2. A file system: replace "key" and "column" with "directory" and
> >> >>> "subdirectory"
> >> >>>
> >> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
> >> >>> value
> >> >>> is the set of keys, whose value is the set of supercolumns of the
> key,
> >> >>> whose
> >> >>> value is the set of columns for the supercolumn, etc.
> >> >>>
> >> >>> 4. Etc.
> >> >>>
> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Nice, Ed, we're doing something very similar but less generic.
> >> >>>> Now replace all of the various methods for querying with a simple
> >> >>>> query
> >> >>>> interface that takes a Predicate, allow the user to specify (in
> >> >>>> storage-conf) which levels of the nested Columns should be indexed,
> >> >>>> and
> >> >>>> completely remove Comparators and have people subclass Column /
> >> >>>> implement
> >> >>>> IColumn and we'd really be on to something ;).
> >> >>>> Mock storage-conf.xml:
> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
> >> >>>> ClusterPartitioned="True" Type="UTF8">
> >> >>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >> >>>> Type="UTF8">
> >> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> >> >>>> Type="ASCII">
> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >> >>>>         </Column>
> >> >>>>       </Column>
> >> >>>>     </Column>
> >> >>>>   </Column>
> >> >>>> Thrift:
> >> >>>>   struct NamePredicate {
> >> >>>>     1: required list<binary> column_names,
> >> >>>>   }
> >> >>>>   struct SlicePredicate {
> >> >>>>     1: required binary start,
> >> >>>>     2: required binary end,
> >> >>>>   }
> >> >>>>   struct CountPredicate {
> >> >>>>     1: required struct predicate,
> >> >>>>     2: required i32 count=100,
> >> >>>>   }
> >> >>>>   struct AndPredicate {
> >> >>>>     1: required Predicate left,
> >> >>>>     2: required Predicate right,
> >> >>>>   }
> >> >>>>   struct SubColumnsPredicate {
> >> >>>>     1: required Predicate columns,
> >> >>>>     2: required Predicate subcolumns,
> >> >>>>   }
> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
> >> >>>>   query(predicate, count, consistency_level) # Count here would be
> >> >>>> total
> >> >>>> count of leaf values returned, whereas CountPredicate specifies a
> >> >>>> column
> >> >>>> count for a particular sub-slice.
> >> >>>> Not fully baked... but I think this could really simplify stuff and
> >> >>>> make
> >> >>>> it more flexible. Downside is it may give people enough rope to
> hang
> >> >>>> themselves, but at least the predicate stuff is easily
> distributable.
> >> >>>> I'm thinking I'll play around with implementing some of this stuff
> >> >>>> myself if I have any free time in the near future.
> >> >>>> Mike
> >> >>>>
> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Very interesting, thanks!
> >> >>>>>
> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> >> >>>>> > Follow-up from last weeks discussion, I've been playing around
> >> >>>>> > with a
> >> >>>>> > simple
> >> >>>>> > column comparator for composite column names that I put up on
> >> >>>>> > github.  I'd
> >> >>>>> > be interested to hear what people think of this approach.
> >> >>>>> >
> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
> >> >>>>> >
> >> >>>>> > Ed
> >> >>>>> >
> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com>
> wrote:
> >> >>>>> >>
> >> >>>>> >> It might make sense to create a CompositeType subclass of
> >> >>>>> >> AbstractType for
> >> >>>>> >> the purpose of constructing and comparing these types of
> >> >>>>> >> "composite"
> >> >>>>> >> column
> >> >>>>> >> names so that if you could more easily do that sort of thing
> >> >>>>> >> rather
> >> >>>>> >> than
> >> >>>>> >> having to concatenate into one big string.
> >> >>>>> >>
> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
> >> >>>>> >> <mi...@simplegeo.com>
> >> >>>>> >> wrote:
> >> >>>>> >>>
> >> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
> >> >>>>> >>> pointed
> >> >>>>> >>> out to
> >> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano)
> is
> >> >>>>> >>> that you can
> >> >>>>> >>> use different comparator types for the Super/SubColumns, I
> >> >>>>> >>> guess..?
> >> >>>>> >>> But you
> >> >>>>> >>> should be able to do the same thing by creating your own
> Column
> >> >>>>> >>> comparator.
> >> >>>>> >>> I guess my point is that SuperColumns are mostly a convenience
> >> >>>>> >>> mechanism, as
> >> >>>>> >>> far as I can tell.
> >> >>>>> >>> Mike
> >> >>>>> >
> >> >>>>> >
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Jonathan Ellis
> >> >>>>> Project Chair, Apache Cassandra
> >> >>>>> co-founder of Riptano, the source for professional Cassandra
> support
> >> >>>>> http://riptano.com
> >> >>>>
> >> >>>
> >> >>
> >> >
> >> >
> >
> >
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Is SuperColumn necessary?

Posted by Jonathan Shook <js...@gmail.com>.

Agreed

On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mi...@simplegeo.com> wrote:
> On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com> wrote:
>>
>> I have to disagree about the naming of things. The name of something
>> isn't just a literal identifier. It affects the way people think about
>> it. For new users, the whole naming thing has been a persistent
>> barrier.
>
> I'm saying we shouldn't be worried too much about coming up with names and
> analogies until we've decided what it is we're naming.
>
>>
>> As for your suggestions, I'm all for simplifying or generalizing the
>> "how it works" part down to a more generalized set of operations. I'm
>> not sure it's a good idea to require users to think in terms building
>> up a fluffy query structure just to thread it through a needle of an
>> API, even for the simplest of queries. At some point, the level of
>> generic boilerplate takes away from the semantic hand rails that
>> developers like. So I guess I'm suggesting that "how it works" and
>> "how we use it" are not always exactly the same. At least they should
>> both hinge on a common conceptual model, which is where the naming
>> becomes an important anchoring point.
>
> If things are done properly, client libraries could expose simplified query
> interfaces without much effort. Most ORMs these days work by building a
> propositional directed acyclic graph that's serialized to SQL. This would
> work the same way, but it wouldn't be converted into a 4GL.
> Mike
>
>>
>> Jonathan
>>
>> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com> wrote:
>> > Maybe... but honestly, it doesn't affect the architecture or interface
>> > at
>> > all. I'm more interested in thinking about how the system should work
>> > than
>> > what things are called. Naming things are important, but that can happen
>> > later.
>> > Does anyone have any thoughts or comments on the architecture I
>> > suggested
>> > earlier?
>> >
>> > Mike
>> >
>> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
>> > wrote:
>> >>
>> >> Yes, the "column" here is not appropriate.
>> >> Maybe we need not to create new terms, in Google's Bigtable, the term
>> >> "qualifier" is a good one.
>> >>
>> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
>> >> wrote:
>> >>>
>> >>> That would be a good time to get rid of the confusing "column" term,
>> >>> which incorrectly suggests a two-dimensional tabular structure.
>> >>>
>> >>> Suggestions:
>> >>>
>> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key"
>> >>> and
>> >>> "column" with "1st dimension", "2nd dimension", etc.
>> >>>
>> >>> 2. A file system: replace "key" and "column" with "directory" and
>> >>> "subdirectory"
>> >>>
>> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
>> >>> value
>> >>> is the set of keys, whose value is the set of supercolumns of the key,
>> >>> whose
>> >>> value is the set of columns for the supercolumn, etc.
>> >>>
>> >>> 4. Etc.
>> >>>
>> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
>> >>> wrote:
>> >>>>
>> >>>> Nice, Ed, we're doing something very similar but less generic.
>> >>>> Now replace all of the various methods for querying with a simple
>> >>>> query
>> >>>> interface that takes a Predicate, allow the user to specify (in
>> >>>> storage-conf) which levels of the nested Columns should be indexed,
>> >>>> and
>> >>>> completely remove Comparators and have people subclass Column /
>> >>>> implement
>> >>>> IColumn and we'd really be on to something ;).
>> >>>> Mock storage-conf.xml:
>> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>> >>>> ClusterPartitioned="True" Type="UTF8">
>> >>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>> >>>> Type="UTF8">
>> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> >>>> Type="ASCII">
>> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>> >>>>         </Column>
>> >>>>       </Column>
>> >>>>     </Column>
>> >>>>   </Column>
>> >>>> Thrift:
>> >>>>   struct NamePredicate {
>> >>>>     1: required list<binary> column_names,
>> >>>>   }
>> >>>>   struct SlicePredicate {
>> >>>>     1: required binary start,
>> >>>>     2: required binary end,
>> >>>>   }
>> >>>>   struct CountPredicate {
>> >>>>     1: required struct predicate,
>> >>>>     2: required i32 count=100,
>> >>>>   }
>> >>>>   struct AndPredicate {
>> >>>>     1: required Predicate left,
>> >>>>     2: required Predicate right,
>> >>>>   }
>> >>>>   struct SubColumnsPredicate {
>> >>>>     1: required Predicate columns,
>> >>>>     2: required Predicate subcolumns,
>> >>>>   }
>> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>> >>>>   query(predicate, count, consistency_level) # Count here would be
>> >>>> total
>> >>>> count of leaf values returned, whereas CountPredicate specifies a
>> >>>> column
>> >>>> count for a particular sub-slice.
>> >>>> Not fully baked... but I think this could really simplify stuff and
>> >>>> make
>> >>>> it more flexible. Downside is it may give people enough rope to hang
>> >>>> themselves, but at least the predicate stuff is easily distributable.
>> >>>> I'm thinking I'll play around with implementing some of this stuff
>> >>>> myself if I have any free time in the near future.
>> >>>> Mike
>> >>>>
>> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Very interesting, thanks!
>> >>>>>
>> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >>>>> > Follow-up from last weeks discussion, I've been playing around
>> >>>>> > with a
>> >>>>> > simple
>> >>>>> > column comparator for composite column names that I put up on
>> >>>>> > github.  I'd
>> >>>>> > be interested to hear what people think of this approach.
>> >>>>> >
>> >>>>> > http://github.com/edanuff/CassandraCompositeType
>> >>>>> >
>> >>>>> > Ed
>> >>>>> >
>> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >>>>> >>
>> >>>>> >> It might make sense to create a CompositeType subclass of
>> >>>>> >> AbstractType for
>> >>>>> >> the purpose of constructing and comparing these types of
>> >>>>> >> "composite"
>> >>>>> >> column
>> >>>>> >> names so that if you could more easily do that sort of thing
>> >>>>> >> rather
>> >>>>> >> than
>> >>>>> >> having to concatenate into one big string.
>> >>>>> >>
>> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone
>> >>>>> >> <mi...@simplegeo.com>
>> >>>>> >> wrote:
>> >>>>> >>>
>> >>>>> >>> The only thing SuperColumns appear to buy you (as someone
>> >>>>> >>> pointed
>> >>>>> >>> out to
>> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
>> >>>>> >>> that you can
>> >>>>> >>> use different comparator types for the Super/SubColumns, I
>> >>>>> >>> guess..?
>> >>>>> >>> But you
>> >>>>> >>> should be able to do the same thing by creating your own Column
>> >>>>> >>> comparator.
>> >>>>> >>> I guess my point is that SuperColumns are mostly a convenience
>> >>>>> >>> mechanism, as
>> >>>>> >>> far as I can tell.
>> >>>>> >>> Mike
>> >>>>> >
>> >>>>> >
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Jonathan Ellis
>> >>>>> Project Chair, Apache Cassandra
>> >>>>> co-founder of Riptano, the source for professional Cassandra support
>> >>>>> http://riptano.com
>> >>>>
>> >>>
>> >>
>> >
>> >
>
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <js...@gmail.com> wrote:

> I have to disagree about the naming of things. The name of something
> isn't just a literal identifier. It affects the way people think about
> it. For new users, the whole naming thing has been a persistent
> barrier.
>

I'm saying we shouldn't be worried too much about coming up with names and
analogies until we've decided what it is we're naming.


> As for your suggestions, I'm all for simplifying or generalizing the
> "how it works" part down to a more generalized set of operations. I'm
> not sure it's a good idea to require users to think in terms building
> up a fluffy query structure just to thread it through a needle of an
> API, even for the simplest of queries. At some point, the level of
> generic boilerplate takes away from the semantic hand rails that
> developers like. So I guess I'm suggesting that "how it works" and
> "how we use it" are not always exactly the same. At least they should
> both hinge on a common conceptual model, which is where the naming
> becomes an important anchoring point.
>

If things are done properly, client libraries could expose simplified query
interfaces without much effort. Most ORMs these days work by building a
propositional directed acyclic graph that's serialized to SQL. This would
work the same way, but it wouldn't be converted into a 4GL.

Mike


>
> Jonathan
>
> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com> wrote:
> > Maybe... but honestly, it doesn't affect the architecture or interface at
> > all. I'm more interested in thinking about how the system should work
> than
> > what things are called. Naming things are important, but that can happen
> > later.
> > Does anyone have any thoughts or comments on the architecture I suggested
> > earlier?
> >
> > Mike
> >
> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com>
> wrote:
> >>
> >> Yes, the "column" here is not appropriate.
> >> Maybe we need not to create new terms, in Google's Bigtable, the term
> >> "qualifier" is a good one.
> >>
> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com>
> wrote:
> >>>
> >>> That would be a good time to get rid of the confusing "column" term,
> >>> which incorrectly suggests a two-dimensional tabular structure.
> >>>
> >>> Suggestions:
> >>>
> >>> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
> >>> "column" with "1st dimension", "2nd dimension", etc.
> >>>
> >>> 2. A file system: replace "key" and "column" with "directory" and
> >>> "subdirectory"
> >>>
> >>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose
> value
> >>> is the set of keys, whose value is the set of supercolumns of the key,
> whose
> >>> value is the set of columns for the supercolumn, etc.
> >>>
> >>> 4. Etc.
> >>>
> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com>
> wrote:
> >>>>
> >>>> Nice, Ed, we're doing something very similar but less generic.
> >>>> Now replace all of the various methods for querying with a simple
> query
> >>>> interface that takes a Predicate, allow the user to specify (in
> >>>> storage-conf) which levels of the nested Columns should be indexed,
> and
> >>>> completely remove Comparators and have people subclass Column /
> implement
> >>>> IColumn and we'd really be on to something ;).
> >>>> Mock storage-conf.xml:
> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
> >>>> ClusterPartitioned="True" Type="UTF8">
> >>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >>>> Type="UTF8">
> >>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> >>>> Type="ASCII">
> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >>>>         </Column>
> >>>>       </Column>
> >>>>     </Column>
> >>>>   </Column>
> >>>> Thrift:
> >>>>   struct NamePredicate {
> >>>>     1: required list<binary> column_names,
> >>>>   }
> >>>>   struct SlicePredicate {
> >>>>     1: required binary start,
> >>>>     2: required binary end,
> >>>>   }
> >>>>   struct CountPredicate {
> >>>>     1: required struct predicate,
> >>>>     2: required i32 count=100,
> >>>>   }
> >>>>   struct AndPredicate {
> >>>>     1: required Predicate left,
> >>>>     2: required Predicate right,
> >>>>   }
> >>>>   struct SubColumnsPredicate {
> >>>>     1: required Predicate columns,
> >>>>     2: required Predicate subcolumns,
> >>>>   }
> >>>>   ... OrPredicate, OtherUsefulPredicates ...
> >>>>   query(predicate, count, consistency_level) # Count here would be
> total
> >>>> count of leaf values returned, whereas CountPredicate specifies a
> column
> >>>> count for a particular sub-slice.
> >>>> Not fully baked... but I think this could really simplify stuff and
> make
> >>>> it more flexible. Downside is it may give people enough rope to hang
> >>>> themselves, but at least the predicate stuff is easily distributable.
> >>>> I'm thinking I'll play around with implementing some of this stuff
> >>>> myself if I have any free time in the near future.
> >>>> Mike
> >>>>
> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Very interesting, thanks!
> >>>>>
> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>>>> > Follow-up from last weeks discussion, I've been playing around with
> a
> >>>>> > simple
> >>>>> > column comparator for composite column names that I put up on
> >>>>> > github.  I'd
> >>>>> > be interested to hear what people think of this approach.
> >>>>> >
> >>>>> > http://github.com/edanuff/CassandraCompositeType
> >>>>> >
> >>>>> > Ed
> >>>>> >
> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>>>> >>
> >>>>> >> It might make sense to create a CompositeType subclass of
> >>>>> >> AbstractType for
> >>>>> >> the purpose of constructing and comparing these types of
> "composite"
> >>>>> >> column
> >>>>> >> names so that if you could more easily do that sort of thing
> rather
> >>>>> >> than
> >>>>> >> having to concatenate into one big string.
> >>>>> >>
> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mike@simplegeo.com
> >
> >>>>> >> wrote:
> >>>>> >>>
> >>>>> >>> The only thing SuperColumns appear to buy you (as someone pointed
> >>>>> >>> out to
> >>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
> >>>>> >>> that you can
> >>>>> >>> use different comparator types for the Super/SubColumns, I
> guess..?
> >>>>> >>> But you
> >>>>> >>> should be able to do the same thing by creating your own Column
> >>>>> >>> comparator.
> >>>>> >>> I guess my point is that SuperColumns are mostly a convenience
> >>>>> >>> mechanism, as
> >>>>> >>> far as I can tell.
> >>>>> >>> Mike
> >>>>> >
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Jonathan Ellis
> >>>>> Project Chair, Apache Cassandra
> >>>>> co-founder of Riptano, the source for professional Cassandra support
> >>>>> http://riptano.com
> >>>>
> >>>
> >>
> >
> >
>

Re: Is SuperColumn necessary?

Posted by Jonathan Shook <js...@gmail.com>.

I have to disagree about the naming of things. The name of something
isn't just a literal identifier. It affects the way people think about
it. For new users, the whole naming thing has been a persistent
barrier.

As for your suggestions, I'm all for simplifying or generalizing the
"how it works" part down to a more generalized set of operations. I'm
not sure it's a good idea to require users to think in terms building
up a fluffy query structure just to thread it through a needle of an
API, even for the simplest of queries. At some point, the level of
generic boilerplate takes away from the semantic hand rails that
developers like. So I guess I'm suggesting that "how it works" and
"how we use it" are not always exactly the same. At least they should
both hinge on a common conceptual model, which is where the naming
becomes an important anchoring point.

Jonathan

On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mi...@simplegeo.com> wrote:
> Maybe... but honestly, it doesn't affect the architecture or interface at
> all. I'm more interested in thinking about how the system should work than
> what things are called. Naming things are important, but that can happen
> later.
> Does anyone have any thoughts or comments on the architecture I suggested
> earlier?
>
> Mike
>
> On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com> wrote:
>>
>> Yes, the "column" here is not appropriate.
>> Maybe we need not to create new terms, in Google's Bigtable, the term
>> "qualifier" is a good one.
>>
>> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com> wrote:
>>>
>>> That would be a good time to get rid of the confusing "column" term,
>>> which incorrectly suggests a two-dimensional tabular structure.
>>>
>>> Suggestions:
>>>
>>> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
>>> "column" with "1st dimension", "2nd dimension", etc.
>>>
>>> 2. A file system: replace "key" and "column" with "directory" and
>>> "subdirectory"
>>>
>>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose value
>>> is the set of keys, whose value is the set of supercolumns of the key, whose
>>> value is the set of columns for the supercolumn, etc.
>>>
>>> 4. Etc.
>>>
>>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
>>>>
>>>> Nice, Ed, we're doing something very similar but less generic.
>>>> Now replace all of the various methods for querying with a simple query
>>>> interface that takes a Predicate, allow the user to specify (in
>>>> storage-conf) which levels of the nested Columns should be indexed, and
>>>> completely remove Comparators and have people subclass Column / implement
>>>> IColumn and we'd really be on to something ;).
>>>> Mock storage-conf.xml:
>>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>> ClusterPartitioned="True" Type="UTF8">
>>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>>>> Type="UTF8">
>>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>>> Type="ASCII">
>>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>>         </Column>
>>>>       </Column>
>>>>     </Column>
>>>>   </Column>
>>>> Thrift:
>>>>   struct NamePredicate {
>>>>     1: required list<binary> column_names,
>>>>   }
>>>>   struct SlicePredicate {
>>>>     1: required binary start,
>>>>     2: required binary end,
>>>>   }
>>>>   struct CountPredicate {
>>>>     1: required struct predicate,
>>>>     2: required i32 count=100,
>>>>   }
>>>>   struct AndPredicate {
>>>>     1: required Predicate left,
>>>>     2: required Predicate right,
>>>>   }
>>>>   struct SubColumnsPredicate {
>>>>     1: required Predicate columns,
>>>>     2: required Predicate subcolumns,
>>>>   }
>>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>>   query(predicate, count, consistency_level) # Count here would be total
>>>> count of leaf values returned, whereas CountPredicate specifies a column
>>>> count for a particular sub-slice.
>>>> Not fully baked... but I think this could really simplify stuff and make
>>>> it more flexible. Downside is it may give people enough rope to hang
>>>> themselves, but at least the predicate stuff is easily distributable.
>>>> I'm thinking I'll play around with implementing some of this stuff
>>>> myself if I have any free time in the near future.
>>>> Mike
>>>>
>>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Very interesting, thanks!
>>>>>
>>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>>> > Follow-up from last weeks discussion, I've been playing around with a
>>>>> > simple
>>>>> > column comparator for composite column names that I put up on
>>>>> > github.  I'd
>>>>> > be interested to hear what people think of this approach.
>>>>> >
>>>>> > http://github.com/edanuff/CassandraCompositeType
>>>>> >
>>>>> > Ed
>>>>> >
>>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>>> >>
>>>>> >> It might make sense to create a CompositeType subclass of
>>>>> >> AbstractType for
>>>>> >> the purpose of constructing and comparing these types of "composite"
>>>>> >> column
>>>>> >> names so that if you could more easily do that sort of thing rather
>>>>> >> than
>>>>> >> having to concatenate into one big string.
>>>>> >>
>>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> The only thing SuperColumns appear to buy you (as someone pointed
>>>>> >>> out to
>>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
>>>>> >>> that you can
>>>>> >>> use different comparator types for the Super/SubColumns, I guess..?
>>>>> >>> But you
>>>>> >>> should be able to do the same thing by creating your own Column
>>>>> >>> comparator.
>>>>> >>> I guess my point is that SuperColumns are mostly a convenience
>>>>> >>> mechanism, as
>>>>> >>> far as I can tell.
>>>>> >>> Mike
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jonathan Ellis
>>>>> Project Chair, Apache Cassandra
>>>>> co-founder of Riptano, the source for professional Cassandra support
>>>>> http://riptano.com
>>>>
>>>
>>
>
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

Maybe... but honestly, it doesn't affect the architecture or interface at
all. I'm more interested in thinking about how the system should work than
what things are called. Naming things are important, but that can happen
later.

Does anyone have any thoughts or comments on the architecture I suggested
earlier?

Mike

On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zs...@gmail.com> wrote:

> Yes, the "column" here is not appropriate.
> Maybe we need not to create new terms, in Google's Bigtable, the term
> "qualifier" is a good one.
>
>
> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com> wrote:
>
>> That would be a good time to get rid of the confusing "column" term, which
>> incorrectly suggests a two-dimensional tabular structure.
>>
>> Suggestions:
>>
>> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
>> "column" with "1st dimension", "2nd dimension", etc.
>>
>> 2. A file system: replace "key" and "column" with "directory" and
>> "subdirectory"
>>
>> 3. A tuple tree: "Column family" replaced by top-level tuple, whose value
>> is the set of keys, whose value is the set of supercolumns of the key, whose
>> value is the set of columns for the supercolumn, etc.
>>
>> 4. Etc.
>>
>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
>>
>>> Nice, Ed, we're doing something very similar but less generic.
>>>
>>> Now replace all of the various methods for querying with a simple query
>>> interface that takes a Predicate, allow the user to specify (in
>>> storage-conf) which levels of the nested Columns should be indexed, and
>>> completely remove Comparators and have people subclass Column / implement
>>> IColumn and we'd really be on to something ;).
>>>
>>> Mock storage-conf.xml:
>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>> ClusterPartitioned="True" Type="UTF8">
>>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>>> Type="UTF8">
>>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>>> Type="ASCII">
>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>         </Column>
>>>       </Column>
>>>     </Column>
>>>   </Column>
>>>
>>> Thrift:
>>>   struct NamePredicate {
>>>     1: required list<binary> column_names,
>>>   }
>>>   struct SlicePredicate {
>>>     1: required binary start,
>>>     2: required binary end,
>>>   }
>>>   struct CountPredicate {
>>>     1: required struct predicate,
>>>     2: required i32 count=100,
>>>   }
>>>   struct AndPredicate {
>>>     1: required Predicate left,
>>>     2: required Predicate right,
>>>   }
>>>   struct SubColumnsPredicate {
>>>     1: required Predicate columns,
>>>     2: required Predicate subcolumns,
>>>   }
>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>   query(predicate, count, consistency_level) # Count here would be total
>>> count of leaf values returned, whereas CountPredicate specifies a column
>>> count for a particular sub-slice.
>>>
>>> Not fully baked... but I think this could really simplify stuff and make
>>> it more flexible. Downside is it may give people enough rope to hang
>>> themselves, but at least the predicate stuff is easily distributable.
>>>
>>> I'm thinking I'll play around with implementing some of this stuff myself
>>> if I have any free time in the near future.
>>>
>>> Mike
>>>
>>>
>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>wrote:
>>>
>>>> Very interesting, thanks!
>>>>
>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>> > Follow-up from last weeks discussion, I've been playing around with a
>>>> simple
>>>> > column comparator for composite column names that I put up on github.
>>>> I'd
>>>> > be interested to hear what people think of this approach.
>>>> >
>>>> > http://github.com/edanuff/CassandraCompositeType
>>>> >
>>>> > Ed
>>>> >
>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>>>> >>
>>>> >> It might make sense to create a CompositeType subclass of
>>>> AbstractType for
>>>> >> the purpose of constructing and comparing these types of "composite"
>>>> column
>>>> >> names so that if you could more easily do that sort of thing rather
>>>> than
>>>> >> having to concatenate into one big string.
>>>> >>
>>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
>>>> wrote:
>>>> >>>
>>>> >>> The only thing SuperColumns appear to buy you (as someone pointed
>>>> out to
>>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is that
>>>> you can
>>>> >>> use different comparator types for the Super/SubColumns, I guess..?
>>>> But you
>>>> >>> should be able to do the same thing by creating your own Column
>>>> comparator.
>>>> >>> I guess my point is that SuperColumns are mostly a convenience
>>>> mechanism, as
>>>> >>> far as I can tell.
>>>> >>> Mike
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder of Riptano, the source for professional Cassandra support
>>>> http://riptano.com
>>>>
>>>
>>>
>>
>

Re: Is SuperColumn necessary?

Posted by Schubert Zhang <zs...@gmail.com>.

Yes, the "column" here is not appropriate.
Maybe we need not to create new terms, in Google's Bigtable, the term
"qualifier" is a good one.

On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <da...@lookin2.com> wrote:

> That would be a good time to get rid of the confusing "column" term, which
> incorrectly suggests a two-dimensional tabular structure.
>
> Suggestions:
>
> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
> "column" with "1st dimension", "2nd dimension", etc.
>
> 2. A file system: replace "key" and "column" with "directory" and
> "subdirectory"
>
> 3. A tuple tree: "Column family" replaced by top-level tuple, whose value
> is the set of keys, whose value is the set of supercolumns of the key, whose
> value is the set of columns for the supercolumn, etc.
>
> 4. Etc.
>
> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
>
>> Nice, Ed, we're doing something very similar but less generic.
>>
>> Now replace all of the various methods for querying with a simple query
>> interface that takes a Predicate, allow the user to specify (in
>> storage-conf) which levels of the nested Columns should be indexed, and
>> completely remove Comparators and have people subclass Column / implement
>> IColumn and we'd really be on to something ;).
>>
>> Mock storage-conf.xml:
>>   <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
>> Type="UTF8">
>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>> Type="UTF8">
>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> Type="ASCII">
>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>         </Column>
>>       </Column>
>>     </Column>
>>   </Column>
>>
>> Thrift:
>>   struct NamePredicate {
>>     1: required list<binary> column_names,
>>   }
>>   struct SlicePredicate {
>>     1: required binary start,
>>     2: required binary end,
>>   }
>>   struct CountPredicate {
>>     1: required struct predicate,
>>     2: required i32 count=100,
>>   }
>>   struct AndPredicate {
>>     1: required Predicate left,
>>     2: required Predicate right,
>>   }
>>   struct SubColumnsPredicate {
>>     1: required Predicate columns,
>>     2: required Predicate subcolumns,
>>   }
>>   ... OrPredicate, OtherUsefulPredicates ...
>>   query(predicate, count, consistency_level) # Count here would be total
>> count of leaf values returned, whereas CountPredicate specifies a column
>> count for a particular sub-slice.
>>
>> Not fully baked... but I think this could really simplify stuff and make
>> it more flexible. Downside is it may give people enough rope to hang
>> themselves, but at least the predicate stuff is easily distributable.
>>
>> I'm thinking I'll play around with implementing some of this stuff myself
>> if I have any free time in the near future.
>>
>> Mike
>>
>>
>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>>> Very interesting, thanks!
>>>
>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>> > Follow-up from last weeks discussion, I've been playing around with a
>>> simple
>>> > column comparator for composite column names that I put up on github.
>>> I'd
>>> > be interested to hear what people think of this approach.
>>> >
>>> > http://github.com/edanuff/CassandraCompositeType
>>> >
>>> > Ed
>>> >
>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>>> >>
>>> >> It might make sense to create a CompositeType subclass of AbstractType
>>> for
>>> >> the purpose of constructing and comparing these types of "composite"
>>> column
>>> >> names so that if you could more easily do that sort of thing rather
>>> than
>>> >> having to concatenate into one big string.
>>> >>
>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
>>> wrote:
>>> >>>
>>> >>> The only thing SuperColumns appear to buy you (as someone pointed out
>>> to
>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is that
>>> you can
>>> >>> use different comparator types for the Super/SubColumns, I guess..?
>>> But you
>>> >>> should be able to do the same thing by creating your own Column
>>> comparator.
>>> >>> I guess my point is that SuperColumns are mostly a convenience
>>> mechanism, as
>>> >>> far as I can tell.
>>> >>> Mike
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>>>
>>
>>
>

Re: Is SuperColumn necessary?

Posted by philip andrew <ph...@gmail.com>.

Please create a new term word if the existing terms are misleading, if its
not a file system then its not good to call it a file system.

On Thu, May 6, 2010 at 3:50 PM, Torsten Curdt <tc...@vafer.org> wrote:

> +1 on all of that
>
> On Thu, May 6, 2010 at 09:04, David Boxenhorn <da...@lookin2.com> wrote:
> > That would be a good time to get rid of the confusing "column" term,
> which
> > incorrectly suggests a two-dimensional tabular structure.
> >
> > Suggestions:
> >
> > 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
> > "column" with "1st dimension", "2nd dimension", etc.
> >
> > 2. A file system: replace "key" and "column" with "directory" and
> > "subdirectory"
> >
> > 3. A tuple tree: "Column family" replaced by top-level tuple, whose value
> is
> > the set of keys, whose value is the set of supercolumns of the key, whose
> > value is the set of columns for the supercolumn, etc.
> >
> > 4. Etc.
> >
> > On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
> >>
> >> Nice, Ed, we're doing something very similar but less generic.
> >> Now replace all of the various methods for querying with a simple query
> >> interface that takes a Predicate, allow the user to specify (in
> >> storage-conf) which levels of the nested Columns should be indexed, and
> >> completely remove Comparators and have people subclass Column /
> implement
> >> IColumn and we'd really be on to something ;).
> >> Mock storage-conf.xml:
> >>   <Column Name="ThingThatsNowKey" Indexed="True"
> ClusterPartitioned="True"
> >> Type="UTF8">
> >>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> >> Type="UTF8">
> >>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
> >>         <Column Name="ThingThatsNowColumnName" Indexed="True"
> >> Type="ASCII">
> >>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
> >>         </Column>
> >>       </Column>
> >>     </Column>
> >>   </Column>
> >> Thrift:
> >>   struct NamePredicate {
> >>     1: required list<binary> column_names,
> >>   }
> >>   struct SlicePredicate {
> >>     1: required binary start,
> >>     2: required binary end,
> >>   }
> >>   struct CountPredicate {
> >>     1: required struct predicate,
> >>     2: required i32 count=100,
> >>   }
> >>   struct AndPredicate {
> >>     1: required Predicate left,
> >>     2: required Predicate right,
> >>   }
> >>   struct SubColumnsPredicate {
> >>     1: required Predicate columns,
> >>     2: required Predicate subcolumns,
> >>   }
> >>   ... OrPredicate, OtherUsefulPredicates ...
> >>   query(predicate, count, consistency_level) # Count here would be total
> >> count of leaf values returned, whereas CountPredicate specifies a column
> >> count for a particular sub-slice.
> >> Not fully baked... but I think this could really simplify stuff and make
> >> it more flexible. Downside is it may give people enough rope to hang
> >> themselves, but at least the predicate stuff is easily distributable.
> >> I'm thinking I'll play around with implementing some of this stuff
> myself
> >> if I have any free time in the near future.
> >> Mike
> >>
> >> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>>
> >>> Very interesting, thanks!
> >>>
> >>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>> > Follow-up from last weeks discussion, I've been playing around with a
> >>> > simple
> >>> > column comparator for composite column names that I put up on github.
> >>> > I'd
> >>> > be interested to hear what people think of this approach.
> >>> >
> >>> > http://github.com/edanuff/CassandraCompositeType
> >>> >
> >>> > Ed
> >>> >
> >>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>> >>
> >>> >> It might make sense to create a CompositeType subclass of
> AbstractType
> >>> >> for
> >>> >> the purpose of constructing and comparing these types of "composite"
> >>> >> column
> >>> >> names so that if you could more easily do that sort of thing rather
> >>> >> than
> >>> >> having to concatenate into one big string.
> >>> >>
> >>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> The only thing SuperColumns appear to buy you (as someone pointed
> out
> >>> >>> to
> >>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is
> that
> >>> >>> you can
> >>> >>> use different comparator types for the Super/SubColumns, I guess..?
> >>> >>> But you
> >>> >>> should be able to do the same thing by creating your own Column
> >>> >>> comparator.
> >>> >>> I guess my point is that SuperColumns are mostly a convenience
> >>> >>> mechanism, as
> >>> >>> far as I can tell.
> >>> >>> Mike
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of Riptano, the source for professional Cassandra support
> >>> http://riptano.com
> >>
> >
> >
>

Re: Is SuperColumn necessary?

Posted by Torsten Curdt <tc...@vafer.org>.

+1 on all of that

On Thu, May 6, 2010 at 09:04, David Boxenhorn <da...@lookin2.com> wrote:
> That would be a good time to get rid of the confusing "column" term, which
> incorrectly suggests a two-dimensional tabular structure.
>
> Suggestions:
>
> 1. A hypercube (or hypocube, if only two dimensions): replace "key" and
> "column" with "1st dimension", "2nd dimension", etc.
>
> 2. A file system: replace "key" and "column" with "directory" and
> "subdirectory"
>
> 3. A tuple tree: "Column family" replaced by top-level tuple, whose value is
> the set of keys, whose value is the set of supercolumns of the key, whose
> value is the set of columns for the supercolumn, etc.
>
> 4. Etc.
>
> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:
>>
>> Nice, Ed, we're doing something very similar but less generic.
>> Now replace all of the various methods for querying with a simple query
>> interface that takes a Predicate, allow the user to specify (in
>> storage-conf) which levels of the nested Columns should be indexed, and
>> completely remove Comparators and have people subclass Column / implement
>> IColumn and we'd really be on to something ;).
>> Mock storage-conf.xml:
>>   <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
>> Type="UTF8">
>>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
>> Type="UTF8">
>>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>>         <Column Name="ThingThatsNowColumnName" Indexed="True"
>> Type="ASCII">
>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>         </Column>
>>       </Column>
>>     </Column>
>>   </Column>
>> Thrift:
>>   struct NamePredicate {
>>     1: required list<binary> column_names,
>>   }
>>   struct SlicePredicate {
>>     1: required binary start,
>>     2: required binary end,
>>   }
>>   struct CountPredicate {
>>     1: required struct predicate,
>>     2: required i32 count=100,
>>   }
>>   struct AndPredicate {
>>     1: required Predicate left,
>>     2: required Predicate right,
>>   }
>>   struct SubColumnsPredicate {
>>     1: required Predicate columns,
>>     2: required Predicate subcolumns,
>>   }
>>   ... OrPredicate, OtherUsefulPredicates ...
>>   query(predicate, count, consistency_level) # Count here would be total
>> count of leaf values returned, whereas CountPredicate specifies a column
>> count for a particular sub-slice.
>> Not fully baked... but I think this could really simplify stuff and make
>> it more flexible. Downside is it may give people enough rope to hang
>> themselves, but at least the predicate stuff is easily distributable.
>> I'm thinking I'll play around with implementing some of this stuff myself
>> if I have any free time in the near future.
>> Mike
>>
>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>
>>> Very interesting, thanks!
>>>
>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>>> > Follow-up from last weeks discussion, I've been playing around with a
>>> > simple
>>> > column comparator for composite column names that I put up on github.
>>> > I'd
>>> > be interested to hear what people think of this approach.
>>> >
>>> > http://github.com/edanuff/CassandraCompositeType
>>> >
>>> > Ed
>>> >
>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>>> >>
>>> >> It might make sense to create a CompositeType subclass of AbstractType
>>> >> for
>>> >> the purpose of constructing and comparing these types of "composite"
>>> >> column
>>> >> names so that if you could more easily do that sort of thing rather
>>> >> than
>>> >> having to concatenate into one big string.
>>> >>
>>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
>>> >> wrote:
>>> >>>
>>> >>> The only thing SuperColumns appear to buy you (as someone pointed out
>>> >>> to
>>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is that
>>> >>> you can
>>> >>> use different comparator types for the Super/SubColumns, I guess..?
>>> >>> But you
>>> >>> should be able to do the same thing by creating your own Column
>>> >>> comparator.
>>> >>> I guess my point is that SuperColumns are mostly a convenience
>>> >>> mechanism, as
>>> >>> far as I can tell.
>>> >>> Mike
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>>
>
>

Re: Is SuperColumn necessary?

Posted by David Boxenhorn <da...@lookin2.com>.

That would be a good time to get rid of the confusing "column" term, which
incorrectly suggests a two-dimensional tabular structure.

Suggestions:

1. A hypercube (or hypocube, if only two dimensions): replace "key" and
"column" with "1st dimension", "2nd dimension", etc.

2. A file system: replace "key" and "column" with "directory" and
"subdirectory"

3. A tuple tree: "Column family" replaced by top-level tuple, whose value is
the set of keys, whose value is the set of supercolumns of the key, whose
value is the set of columns for the supercolumn, etc.

4. Etc.

On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mi...@simplegeo.com> wrote:

> Nice, Ed, we're doing something very similar but less generic.
>
> Now replace all of the various methods for querying with a simple query
> interface that takes a Predicate, allow the user to specify (in
> storage-conf) which levels of the nested Columns should be indexed, and
> completely remove Comparators and have people subclass Column / implement
> IColumn and we'd really be on to something ;).
>
> Mock storage-conf.xml:
>   <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
> Type="UTF8">
>     <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
> Type="UTF8">
>       <Column Name="ThingThatsNowSuperColumnName" Type="Long">
>         <Column Name="ThingThatsNowColumnName" Indexed="True" Type="ASCII">
>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>         </Column>
>       </Column>
>     </Column>
>   </Column>
>
> Thrift:
>   struct NamePredicate {
>     1: required list<binary> column_names,
>   }
>   struct SlicePredicate {
>     1: required binary start,
>     2: required binary end,
>   }
>   struct CountPredicate {
>     1: required struct predicate,
>     2: required i32 count=100,
>   }
>   struct AndPredicate {
>     1: required Predicate left,
>     2: required Predicate right,
>   }
>   struct SubColumnsPredicate {
>     1: required Predicate columns,
>     2: required Predicate subcolumns,
>   }
>   ... OrPredicate, OtherUsefulPredicates ...
>   query(predicate, count, consistency_level) # Count here would be total
> count of leaf values returned, whereas CountPredicate specifies a column
> count for a particular sub-slice.
>
> Not fully baked... but I think this could really simplify stuff and make it
> more flexible. Downside is it may give people enough rope to hang
> themselves, but at least the predicate stuff is easily distributable.
>
> I'm thinking I'll play around with implementing some of this stuff myself
> if I have any free time in the near future.
>
> Mike
>
>
> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> Very interesting, thanks!
>>
>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
>> > Follow-up from last weeks discussion, I've been playing around with a
>> simple
>> > column comparator for composite column names that I put up on github.
>> I'd
>> > be interested to hear what people think of this approach.
>> >
>> > http://github.com/edanuff/CassandraCompositeType
>> >
>> > Ed
>> >
>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>> >>
>> >> It might make sense to create a CompositeType subclass of AbstractType
>> for
>> >> the purpose of constructing and comparing these types of "composite"
>> column
>> >> names so that if you could more easily do that sort of thing rather
>> than
>> >> having to concatenate into one big string.
>> >>
>> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
>> wrote:
>> >>>
>> >>> The only thing SuperColumns appear to buy you (as someone pointed out
>> to
>> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is that
>> you can
>> >>> use different comparator types for the Super/SubColumns, I guess..?
>> But you
>> >>> should be able to do the same thing by creating your own Column
>> comparator.
>> >>> I guess my point is that SuperColumns are mostly a convenience
>> mechanism, as
>> >>> far as I can tell.
>> >>> Mike
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

Nice, Ed, we're doing something very similar but less generic.

Now replace all of the various methods for querying with a simple query
interface that takes a Predicate, allow the user to specify (in
storage-conf) which levels of the nested Columns should be indexed, and
completely remove Comparators and have people subclass Column / implement
IColumn and we'd really be on to something ;).

Mock storage-conf.xml:
  <Column Name="ThingThatsNowKey" Indexed="True" ClusterPartitioned="True"
Type="UTF8">
    <Column Name="ThingThatsNowColumnFamily" DiskPartitioned="True"
Type="UTF8">
      <Column Name="ThingThatsNowSuperColumnName" Type="Long">
        <Column Name="ThingThatsNowColumnName" Indexed="True" Type="ASCII">
          <Column Name="ThingThatCantCurrentlyBeRepresented"/>
        </Column>
      </Column>
    </Column>
  </Column>

Thrift:
  struct NamePredicate {
    1: required list<binary> column_names,
  }
  struct SlicePredicate {
    1: required binary start,
    2: required binary end,
  }
  struct CountPredicate {
    1: required struct predicate,
    2: required i32 count=100,
  }
  struct AndPredicate {
    1: required Predicate left,
    2: required Predicate right,
  }
  struct SubColumnsPredicate {
    1: required Predicate columns,
    2: required Predicate subcolumns,
  }
  ... OrPredicate, OtherUsefulPredicates ...
  query(predicate, count, consistency_level) # Count here would be total
count of leaf values returned, whereas CountPredicate specifies a column
count for a particular sub-slice.

Not fully baked... but I think this could really simplify stuff and make it
more flexible. Downside is it may give people enough rope to hang
themselves, but at least the predicate stuff is easily distributable.

I'm thinking I'll play around with implementing some of this stuff myself if
I have any free time in the near future.

Mike

On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Very interesting, thanks!
>
> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> > Follow-up from last weeks discussion, I've been playing around with a
> simple
> > column comparator for composite column names that I put up on github.
> I'd
> > be interested to hear what people think of this approach.
> >
> > http://github.com/edanuff/CassandraCompositeType
> >
> > Ed
> >
> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
> >>
> >> It might make sense to create a CompositeType subclass of AbstractType
> for
> >> the purpose of constructing and comparing these types of "composite"
> column
> >> names so that if you could more easily do that sort of thing rather than
> >> having to concatenate into one big string.
> >>
> >> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com>
> wrote:
> >>>
> >>> The only thing SuperColumns appear to buy you (as someone pointed out
> to
> >>> me at the Cassandra meetup - I think it was Eric Florenzano) is that
> you can
> >>> use different comparator types for the Super/SubColumns, I guess..? But
> you
> >>> should be able to do the same thing by creating your own Column
> comparator.
> >>> I guess my point is that SuperColumns are mostly a convenience
> mechanism, as
> >>> far as I can tell.
> >>> Mike
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Is SuperColumn necessary?

Posted by Jonathan Ellis <jb...@gmail.com>.

Very interesting, thanks!

On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed...@anuff.com> wrote:
> Follow-up from last weeks discussion, I've been playing around with a simple
> column comparator for composite column names that I put up on github.  I'd
> be interested to hear what people think of this approach.
>
> http://github.com/edanuff/CassandraCompositeType
>
> Ed
>
> On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>>
>> It might make sense to create a CompositeType subclass of AbstractType for
>> the purpose of constructing and comparing these types of "composite" column
>> names so that if you could more easily do that sort of thing rather than
>> having to concatenate into one big string.
>>
>> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com> wrote:
>>>
>>> The only thing SuperColumns appear to buy you (as someone pointed out to
>>> me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
>>> use different comparator types for the Super/SubColumns, I guess..? But you
>>> should be able to do the same thing by creating your own Column comparator.
>>> I guess my point is that SuperColumns are mostly a convenience mechanism, as
>>> far as I can tell.
>>> Mike
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Is SuperColumn necessary?

Posted by Stu Hood <st...@rackspace.com>.

Hey Ed,

I've been working on a similar approach for arbitarily nested/compound column names in #998. See: http://github.com/stuhood/cassandra/blob/998/src/java/org/apache/cassandra/db/ColumnKey.java

The goal is to provide native support and potentially (in the very long term), API support for nested/compound names. The difference between our approaches boils down to needing to define a comparator for every level in #998, versus having dynamic types per name in your approach.

Thanks,
Stu

-----Original Message-----
From: "Ed Anuff" <ed...@anuff.com>
Sent: Wednesday, May 5, 2010 1:31pm
To: user@cassandra.apache.org
Subject: Re: Is SuperColumn necessary?

Follow-up from last weeks discussion, I've been playing around with a simple
column comparator for composite column names that I put up on github.  I'd
be interested to hear what people think of this approach.

http://github.com/edanuff/CassandraCompositeType

Ed

On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:

> It might make sense to create a CompositeType subclass of AbstractType for
> the purpose of constructing and comparing these types of "composite" column
> names so that if you could more easily do that sort of thing rather than
> having to concatenate into one big string.
>
>
> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com> wrote:
>
>> The only thing SuperColumns appear to buy you (as someone pointed out to
>> me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
>> use different comparator types for the Super/SubColumns, I guess..? But you
>> should be able to do the same thing by creating your own Column comparator.
>> I guess my point is that SuperColumns are mostly a convenience mechanism, as
>> far as I can tell.
>>
>> Mike
>>
>
>

Re: Is SuperColumn necessary?

Posted by Jonathan Shook <js...@gmail.com>.

I'm not sure this is much of an improvement. It does illustrate,
however, the desire to couch the concepts in terms that each is
already comfortable with. Nearly every set of terms which come from an
existing system will have baggage which doesn't map appropriately. Not
that the "sparse multidimensional arrays" is an unfamiliar construct.
It's more that "sparse" may or may not apply depending on the part of
your data you are describing. "Multidimensional" implies uniformity of
structure, which is not to be taken for granted. Arrays are just one
way to think of the structures. They also serve well as maps and sets
(Which can be modeled using arrays as well). There are certain
semantics of sets, lists, and maps which people have wired into their
brains, and reducing it all to "arrays" is likely to create more
confusion.

I think if we want to borrow terms form another system, it shouldn't
be a computing system, or at least should be so different or
fundamental that the terms have to be re-understood free of baggage.

On Sun, May 9, 2010 at 1:30 AM, David Boxenhorn <da...@lookin2.com> wrote:
> Guys, this is beginning to sound like MUMPS!
> http://en.wikipedia.org/wiki/MUMPS
>
> In MUMPS, all variables are sparse, multidimensional arrays, which can be
> stored to disk.
>
> It is an arcane, and archaic, language (does anyone but me remember it?),
> but it has been used successfully for years. Maybe we can learn something
> from it.
>
> I like the terminology of sparse multidimensional arrays very much - it
> really clarifies my thinking. A column family would just be a variable.
>
> On Fri, May 7, 2010 at 7:06 PM, Ed Anuff <ed...@anuff.com> wrote:
>>
>> On Thu, May 6, 2010 at 11:10 PM, Mike Malone <mi...@simplegeo.com> wrote:
>>>
>>> The upshot is, the Cassandra data model would go from being "it's a
>>> nested
>>> dictionary, just kidding no it's not!" to being "it's a nested
>>> dictionary,
>>> for serious." Again, these are all just ideas... but I think this
>>> simplified
>>> data model would allow you to express pretty much any query in a graph of
>>> simple primitives like Predicates, Filters, Aggregations,
>>> Transformations,
>>> etc. The indexes would allow you to cheat when evaluating certain types
>>> of
>>> queries - if you get a SlicePredicate on an indexed "thingy" you don't
>>> have
>>> to enumerate the entire set of "sub-thingies" for example.
>>>
>>
>> This would be my dream implementation. I'm working an an application that
>> needs that sort of capability.  SuperColumns lead you to thinking that
>> should be done in the cassandra tier but then fall short, so my thought was
>> that I was just going to do everything that was in Cassandra as regular
>> columnfamilies and columns using composite keys and composite column names
>> ala the code I shared above, and then implement the n-level hierarchy in the
>> app tier.  It looks like your suggestion is to take it in the other
>> direction and make it part of the fundamental data model, which would be
>> very useful if it could be made to work without big tradeoffs.
>>
>>
>

Re: Is SuperColumn necessary?

Posted by Jonathan Shook <js...@gmail.com>.

I'm not sure this is much of an improvement. It does illustrate,
however, the desire to couch the concepts in terms that each is
already comfortable with. Nearly every set of terms which come from an
existing system will have baggage which doesn't map appropriately. Not
that the "sparse multidimensional arrays" is an unfamiliar construct.
It's more that "sparse" may or may not apply depending on the part of
your data you are describing. "Multidimensional" implies uniformity of
structure, which is not to be taken for granted. Arrays are just one
way to think of the structures. They also serve well as maps and sets
(Which can be modeled using arrays as well). There are certain
semantics of sets, lists, and maps which people have wired into their
brains, and reducing it all to "arrays" is likely to create more
confusion.

I think if we want to borrow terms form another system, it shouldn't
be a computing system, or at least should be so different or
fundamental that the terms have to be re-understood free of baggage.

On Sun, May 9, 2010 at 1:30 AM, David Boxenhorn <da...@lookin2.com> wrote:
> Guys, this is beginning to sound like MUMPS!
> http://en.wikipedia.org/wiki/MUMPS
>
> In MUMPS, all variables are sparse, multidimensional arrays, which can be
> stored to disk.
>
> It is an arcane, and archaic, language (does anyone but me remember it?),
> but it has been used successfully for years. Maybe we can learn something
> from it.
>
> I like the terminology of sparse multidimensional arrays very much - it
> really clarifies my thinking. A column family would just be a variable.
>
> On Fri, May 7, 2010 at 7:06 PM, Ed Anuff <ed...@anuff.com> wrote:
>>
>> On Thu, May 6, 2010 at 11:10 PM, Mike Malone <mi...@simplegeo.com> wrote:
>>>
>>> The upshot is, the Cassandra data model would go from being "it's a
>>> nested
>>> dictionary, just kidding no it's not!" to being "it's a nested
>>> dictionary,
>>> for serious." Again, these are all just ideas... but I think this
>>> simplified
>>> data model would allow you to express pretty much any query in a graph of
>>> simple primitives like Predicates, Filters, Aggregations,
>>> Transformations,
>>> etc. The indexes would allow you to cheat when evaluating certain types
>>> of
>>> queries - if you get a SlicePredicate on an indexed "thingy" you don't
>>> have
>>> to enumerate the entire set of "sub-thingies" for example.
>>>
>>
>> This would be my dream implementation. I'm working an an application that
>> needs that sort of capability.  SuperColumns lead you to thinking that
>> should be done in the cassandra tier but then fall short, so my thought was
>> that I was just going to do everything that was in Cassandra as regular
>> columnfamilies and columns using composite keys and composite column names
>> ala the code I shared above, and then implement the n-level hierarchy in the
>> app tier.  It looks like your suggestion is to take it in the other
>> direction and make it part of the fundamental data model, which would be
>> very useful if it could be made to work without big tradeoffs.
>>
>>
>

Re: Is SuperColumn necessary?

Posted by David Boxenhorn <da...@lookin2.com>.

Guys, this is beginning to sound like MUMPS!
http://en.wikipedia.org/wiki/MUMPS

In MUMPS, all variables are sparse, multidimensional arrays, which can be
stored to disk.

It is an arcane, and archaic, language (does anyone but me remember it?),
but it has been used successfully for years. Maybe we can learn something
from it.

I like the terminology of sparse multidimensional arrays very much - it
really clarifies my thinking. A column family would just be a variable.

On Fri, May 7, 2010 at 7:06 PM, Ed Anuff <ed...@anuff.com> wrote:

> On Thu, May 6, 2010 at 11:10 PM, Mike Malone <mi...@simplegeo.com> wrote:
>
>>
>> The upshot is, the Cassandra data model would go from being "it's a nested
>> dictionary, just kidding no it's not!" to being "it's a nested dictionary,
>> for serious." Again, these are all just ideas... but I think this
>> simplified
>> data model would allow you to express pretty much any query in a graph of
>> simple primitives like Predicates, Filters, Aggregations, Transformations,
>> etc. The indexes would allow you to cheat when evaluating certain types of
>> queries - if you get a SlicePredicate on an indexed "thingy" you don't
>> have
>> to enumerate the entire set of "sub-thingies" for example.
>>
>>
> This would be my dream implementation. I'm working an an application that
> needs that sort of capability.  SuperColumns lead you to thinking that
> should be done in the cassandra tier but then fall short, so my thought was
> that I was just going to do everything that was in Cassandra as regular
> columnfamilies and columns using composite keys and composite column names
> ala the code I shared above, and then implement the n-level hierarchy in the
> app tier.  It looks like your suggestion is to take it in the other
> direction and make it part of the fundamental data model, which would be
> very useful if it could be made to work without big tradeoffs.
>
>
>

Re: Is SuperColumn necessary?

Posted by David Boxenhorn <da...@lookin2.com>.

Guys, this is beginning to sound like MUMPS!
http://en.wikipedia.org/wiki/MUMPS

In MUMPS, all variables are sparse, multidimensional arrays, which can be
stored to disk.

It is an arcane, and archaic, language (does anyone but me remember it?),
but it has been used successfully for years. Maybe we can learn something
from it.

I like the terminology of sparse multidimensional arrays very much - it
really clarifies my thinking. A column family would just be a variable.

On Fri, May 7, 2010 at 7:06 PM, Ed Anuff <ed...@anuff.com> wrote:

> On Thu, May 6, 2010 at 11:10 PM, Mike Malone <mi...@simplegeo.com> wrote:
>
>>
>> The upshot is, the Cassandra data model would go from being "it's a nested
>> dictionary, just kidding no it's not!" to being "it's a nested dictionary,
>> for serious." Again, these are all just ideas... but I think this
>> simplified
>> data model would allow you to express pretty much any query in a graph of
>> simple primitives like Predicates, Filters, Aggregations, Transformations,
>> etc. The indexes would allow you to cheat when evaluating certain types of
>> queries - if you get a SlicePredicate on an indexed "thingy" you don't
>> have
>> to enumerate the entire set of "sub-thingies" for example.
>>
>>
> This would be my dream implementation. I'm working an an application that
> needs that sort of capability.  SuperColumns lead you to thinking that
> should be done in the cassandra tier but then fall short, so my thought was
> that I was just going to do everything that was in Cassandra as regular
> columnfamilies and columns using composite keys and composite column names
> ala the code I shared above, and then implement the n-level hierarchy in the
> app tier.  It looks like your suggestion is to take it in the other
> direction and make it part of the fundamental data model, which would be
> very useful if it could be made to work without big tradeoffs.
>
>
>

Re: Is SuperColumn necessary?

Posted by Ed Anuff <ed...@anuff.com>.

On Thu, May 6, 2010 at 11:10 PM, Mike Malone <mi...@simplegeo.com> wrote:

>
> The upshot is, the Cassandra data model would go from being "it's a nested
> dictionary, just kidding no it's not!" to being "it's a nested dictionary,
> for serious." Again, these are all just ideas... but I think this
> simplified
> data model would allow you to express pretty much any query in a graph of
> simple primitives like Predicates, Filters, Aggregations, Transformations,
> etc. The indexes would allow you to cheat when evaluating certain types of
> queries - if you get a SlicePredicate on an indexed "thingy" you don't have
> to enumerate the entire set of "sub-thingies" for example.
>
>
This would be my dream implementation. I'm working an an application that
needs that sort of capability.  SuperColumns lead you to thinking that
should be done in the cassandra tier but then fall short, so my thought was
that I was just going to do everything that was in Cassandra as regular
columnfamilies and columns using composite keys and composite column names
ala the code I shared above, and then implement the n-level hierarchy in the
app tier.  It looks like your suggestion is to take it in the other
direction and make it part of the fundamental data model, which would be
very useful if it could be made to work without big tradeoffs.

Re: Is SuperColumn necessary?

Posted by Ed Anuff <ed...@anuff.com>.

On Thu, May 6, 2010 at 11:10 PM, Mike Malone <mi...@simplegeo.com> wrote:

>
> The upshot is, the Cassandra data model would go from being "it's a nested
> dictionary, just kidding no it's not!" to being "it's a nested dictionary,
> for serious." Again, these are all just ideas... but I think this
> simplified
> data model would allow you to express pretty much any query in a graph of
> simple primitives like Predicates, Filters, Aggregations, Transformations,
> etc. The indexes would allow you to cheat when evaluating certain types of
> queries - if you get a SlicePredicate on an indexed "thingy" you don't have
> to enumerate the entire set of "sub-thingies" for example.
>
>
This would be my dream implementation. I'm working an an application that
needs that sort of capability.  SuperColumns lead you to thinking that
should be done in the cassandra tier but then fall short, so my thought was
that I was just going to do everything that was in Cassandra as regular
columnfamilies and columns using composite keys and composite column names
ala the code I shared above, and then implement the n-level hierarchy in the
app tier.  It looks like your suggestion is to take it in the other
direction and make it part of the fundamental data model, which would be
very useful if it could be made to work without big tradeoffs.

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Thu, May 6, 2010 at 5:38 PM, Vijay <vi...@gmail.com> wrote:

> I would rather be interested in Tree type structure where supercolumns have
> supercolumns in it..... you dont need to compare all the columns to find a
> set of columns and will also reduce the bytes transfered for separator, at
> least string concatenation (Or something like that) for read and write
> column name generation. it is more logically stored and structured by this
> way.... and also we can make caching work better by selectively caching the
> tree (User defined if you will)....
>
> But nothing wrong in supporting both :)
>

I'm 99% sure we're talking about the same thing and we don't need to support
both. How names/values are separated is pretty irrelevant. It has to happen
somewhere. I agree that it'd be nice if it happened on the server, but doing
it in the client makes it easier to explore ideas.

On Thu, May 6, 2010 at 5:27 PM, philip andrew <ph...@gmail.com> wrote:

> Please create a new term word if the existing terms are misleading, if its
> not a file system then its not good to call it a file system.

While it's seriously bikesheddy, I guess you're right.

Let's call them "thingies" for now, then. So you can have a top-level
"thingy" and it can have an arbitrarily nested tree of sub-"thingies." Each
"thingy" has a "thingy type" [1]. You can also tell Cassandra if you want a
particular level of "thingy" to be indexed. At one (or maybe more) levels
you can tell Cassandra you want your "thingies" to be split onto separate
nodes in your cluster. At one (or maybe more) levels you could also tell
Cassandra that you want your "thingies" split into separate files [2].

The upshot is, the Cassandra data model would go from being "it's a nested
dictionary, just kidding no it's not!" to being "it's a nested dictionary,
for serious." Again, these are all just ideas... but I think this simplified
data model would allow you to express pretty much any query in a graph of
simple primitives like Predicates, Filters, Aggregations, Transformations,
etc. The indexes would allow you to cheat when evaluating certain types of
queries - if you get a SlicePredicate on an indexed "thingy" you don't have
to enumerate the entire set of "sub-thingies" for example.

So, you'd query your "thingies" by building out a predicate,
transformations, filters, etc., serializing the graph of primitives, and
sending it over the wire to Cassandra. Cassandra would rebuild the graph and
run it over your dataset.

So instead of:

  Cassandra.get_range_slices(
    keyspace="AwesomeApp",
    column_parent=ColumnParent(column_family="user"),
    slice_predicate=SlicePredicate(column_names=['username', 'dob']),
    range=KeyRange(start_key='a', end_key='m'),
    consistency_level=ONE
  )

You'd do something like:

  Cassandra.query(
    SubThingyTransformer(
        NamePredicate(names=["AwesomeApp"],
        SubThingyTransformer(
            NamePredicate(names=["user"]),
            SubThingyTransformer(
                SlicePredicate(start="a", end="m"),
                NamePredicate(names=["username", "dob"])
            )
        )
    ),
    consistency_level=ONE
  )

Which seems complicated, but it's basically just [(user['username'],
user['dob']) for user in Cassandra['AwesomeApp']['user'].slice('a', 'm')]
and could probably be expressed that way in a client library.

I think batch_mutate is awesome the way it is and should be the only way to
insert/update data. I'd rename it mutate. So our interface becomes:

  Cassandra.query(query, consistency_level)
  Cassandra.mutate(mutation, consistency_level)

Ta-da.

Anyways, I was trying to avoid writing all of this out in prose and try
mocking some of it up in code instead. I guess this this works too. Either
way, I do think something like this would simplify the codebase, simplify
the data model, simplify the interface, make the entire system more
flexible, and be generally awesome.

Mike

[1] These can be subclasses of Thingy in Java... or maybe they'd implement
IThingy. But either way they'd handle serialization and probably implement
compareTo to define natural ordering. So you'd have classes like
ASCIIThingy, UTF8Thingy, and LongThingy (ahem) - these would replace
comparators.

[2] I think there's another simplification here. Splitting into separate
files is really very similar to splitting onto separate nodes. There might
be a way around some of the row size limitations with this sort of concept.
And we may be able to get better utilization of multiple disks by giving
each disk (or data directory) a subset of the node's token range. Caveat:
thought not fully baked.

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Thu, May 6, 2010 at 5:38 PM, Vijay <vi...@gmail.com> wrote:

> I would rather be interested in Tree type structure where supercolumns have
> supercolumns in it..... you dont need to compare all the columns to find a
> set of columns and will also reduce the bytes transfered for separator, at
> least string concatenation (Or something like that) for read and write
> column name generation. it is more logically stored and structured by this
> way.... and also we can make caching work better by selectively caching the
> tree (User defined if you will)....
>
> But nothing wrong in supporting both :)
>

I'm 99% sure we're talking about the same thing and we don't need to support
both. How names/values are separated is pretty irrelevant. It has to happen
somewhere. I agree that it'd be nice if it happened on the server, but doing
it in the client makes it easier to explore ideas.

On Thu, May 6, 2010 at 5:27 PM, philip andrew <ph...@gmail.com> wrote:

> Please create a new term word if the existing terms are misleading, if its
> not a file system then its not good to call it a file system.

While it's seriously bikesheddy, I guess you're right.

Let's call them "thingies" for now, then. So you can have a top-level
"thingy" and it can have an arbitrarily nested tree of sub-"thingies." Each
"thingy" has a "thingy type" [1]. You can also tell Cassandra if you want a
particular level of "thingy" to be indexed. At one (or maybe more) levels
you can tell Cassandra you want your "thingies" to be split onto separate
nodes in your cluster. At one (or maybe more) levels you could also tell
Cassandra that you want your "thingies" split into separate files [2].

The upshot is, the Cassandra data model would go from being "it's a nested
dictionary, just kidding no it's not!" to being "it's a nested dictionary,
for serious." Again, these are all just ideas... but I think this simplified
data model would allow you to express pretty much any query in a graph of
simple primitives like Predicates, Filters, Aggregations, Transformations,
etc. The indexes would allow you to cheat when evaluating certain types of
queries - if you get a SlicePredicate on an indexed "thingy" you don't have
to enumerate the entire set of "sub-thingies" for example.

So, you'd query your "thingies" by building out a predicate,
transformations, filters, etc., serializing the graph of primitives, and
sending it over the wire to Cassandra. Cassandra would rebuild the graph and
run it over your dataset.

So instead of:

  Cassandra.get_range_slices(
    keyspace="AwesomeApp",
    column_parent=ColumnParent(column_family="user"),
    slice_predicate=SlicePredicate(column_names=['username', 'dob']),
    range=KeyRange(start_key='a', end_key='m'),
    consistency_level=ONE
  )

You'd do something like:

  Cassandra.query(
    SubThingyTransformer(
        NamePredicate(names=["AwesomeApp"],
        SubThingyTransformer(
            NamePredicate(names=["user"]),
            SubThingyTransformer(
                SlicePredicate(start="a", end="m"),
                NamePredicate(names=["username", "dob"])
            )
        )
    ),
    consistency_level=ONE
  )

Which seems complicated, but it's basically just [(user['username'],
user['dob']) for user in Cassandra['AwesomeApp']['user'].slice('a', 'm')]
and could probably be expressed that way in a client library.

I think batch_mutate is awesome the way it is and should be the only way to
insert/update data. I'd rename it mutate. So our interface becomes:

  Cassandra.query(query, consistency_level)
  Cassandra.mutate(mutation, consistency_level)

Ta-da.

Anyways, I was trying to avoid writing all of this out in prose and try
mocking some of it up in code instead. I guess this this works too. Either
way, I do think something like this would simplify the codebase, simplify
the data model, simplify the interface, make the entire system more
flexible, and be generally awesome.

Mike

[1] These can be subclasses of Thingy in Java... or maybe they'd implement
IThingy. But either way they'd handle serialization and probably implement
compareTo to define natural ordering. So you'd have classes like
ASCIIThingy, UTF8Thingy, and LongThingy (ahem) - these would replace
comparators.

[2] I think there's another simplification here. Splitting into separate
files is really very similar to splitting onto separate nodes. There might
be a way around some of the row size limitations with this sort of concept.
And we may be able to get better utilization of multiple disks by giving
each disk (or data directory) a subset of the node's token range. Caveat:
thought not fully baked.

Re: Is SuperColumn necessary?

Posted by Vijay <vi...@gmail.com>.

I would rather be interested in Tree type structure where supercolumns have
supercolumns in it..... you dont need to compare all the columns to find a
set of columns and will also reduce the bytes transfered for separator, at
least string concatenation (Or something like that) for read and write
column name generation. it is more logically stored and structured by this
way.... and also we can make caching work better by selectively caching the
tree (User defined if you will)....

But nothing wrong in supporting both :)

Regards,
</VJ>

On Wed, May 5, 2010 at 11:31 AM, Ed Anuff <ed...@anuff.com> wrote:

> Follow-up from last weeks discussion, I've been playing around with a
> simple column comparator for composite column names that I put up on
> github.  I'd be interested to hear what people think of this approach.
>
> http://github.com/edanuff/CassandraCompositeType
>
> Ed
>
> On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:
>
>> It might make sense to create a CompositeType subclass of AbstractType for
>> the purpose of constructing and comparing these types of "composite" column
>> names so that if you could more easily do that sort of thing rather than
>> having to concatenate into one big string.
>>
>>
>> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com> wrote:
>>
>>> The only thing SuperColumns appear to buy you (as someone pointed out to
>>> me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
>>> use different comparator types for the Super/SubColumns, I guess..? But you
>>> should be able to do the same thing by creating your own Column comparator.
>>> I guess my point is that SuperColumns are mostly a convenience mechanism, as
>>> far as I can tell.
>>>
>>> Mike
>>>
>>
>>
>

Re: Is SuperColumn necessary?

Posted by Ed Anuff <ed...@anuff.com>.

Follow-up from last weeks discussion, I've been playing around with a simple
column comparator for composite column names that I put up on github.  I'd
be interested to hear what people think of this approach.

http://github.com/edanuff/CassandraCompositeType

Ed

On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff <ed...@anuff.com> wrote:

> It might make sense to create a CompositeType subclass of AbstractType for
> the purpose of constructing and comparing these types of "composite" column
> names so that if you could more easily do that sort of thing rather than
> having to concatenate into one big string.
>
>
> On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com> wrote:
>
>> The only thing SuperColumns appear to buy you (as someone pointed out to
>> me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
>> use different comparator types for the Super/SubColumns, I guess..? But you
>> should be able to do the same thing by creating your own Column comparator.
>> I guess my point is that SuperColumns are mostly a convenience mechanism, as
>> far as I can tell.
>>
>> Mike
>>
>
>

Re: Is SuperColumn necessary?

Posted by Ed Anuff <ed...@anuff.com>.

It might make sense to create a CompositeType subclass of AbstractType for
the purpose of constructing and comparing these types of "composite" column
names so that if you could more easily do that sort of thing rather than
having to concatenate into one big string.

On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone <mi...@simplegeo.com> wrote:

> The only thing SuperColumns appear to buy you (as someone pointed out to me
> at the Cassandra meetup - I think it was Eric Florenzano) is that you can
> use different comparator types for the Super/SubColumns, I guess..? But you
> should be able to do the same thing by creating your own Column comparator.
> I guess my point is that SuperColumns are mostly a convenience mechanism, as
> far as I can tell.
>
> Mike
>

Re: Is SuperColumn necessary?

Posted by Mike Malone <mi...@simplegeo.com>.

On Wed, Apr 28, 2010 at 5:24 AM, David Boxenhorn <da...@lookin2.com> wrote:

> If I understand correctly, the distinction between supercolumns and
> subcolumns is critical to good database design if you want to use random
> partitioning: you can do range queries on subcolumns but not on
> supercolumns.
>
> Is this correct?
>

You can do efficient range queries of normal (not super) columns in a
ColumnFamily. I think SuperColumn's are not indexed, so it's less efficient
to do a slice of subcolumns from a column, if there are lots of subcolumns.

I agree that SuperColumns are technically unnecessary. There aren't any use
cases I can come up with that a SuperColumn satisfies that normal Columns
can't. You can simulate SuperColumn behavior by concatenating key parts with
a separator and using the concatenated key as your column name, then doing a
slice. So if you had a SuperColumn that stored usernames, and sub-columns
that stored document IDs, you could instead have a normal CF that stores
<username>:<document-id>.

The only thing SuperColumns appear to buy you (as someone pointed out to me
at the Cassandra meetup - I think it was Eric Florenzano) is that you can
use different comparator types for the Super/SubColumns, I guess..? But you
should be able to do the same thing by creating your own Column comparator.
I guess my point is that SuperColumns are mostly a convenience mechanism, as
far as I can tell.

Mike

Re: Is SuperColumn necessary?

Posted by David Boxenhorn <da...@lookin2.com>.

If I understand correctly, the distinction between supercolumns and
subcolumns is critical to good database design if you want to use random
partitioning: you can do range queries on subcolumns but not on
supercolumns.

Is this correct?

On Mon, Apr 26, 2010 at 7:11 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> I think that once we have built-in indexing (CASSANDRA-749) you can
> make a good case for dropping supercolumns (at least, dropping them
> from the public API and reserving them for internal use).
>
> On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang <zs...@gmail.com>
> wrote:
> > I don't think the SuperColumn is so necessary.
> > I think this level of logic can be leaved to application.
> >
> > Do you think so?
> >
> > If SuperColumn is needed,  as
> > https://issues.apache.org/jira/browse/CASSANDRA-598, we should build
> index
> > in SuperColumns level and SubColumns level.
> > Thus, the levels of index is too many.
> >
> >
>

Re: Is SuperColumn necessary?

Posted by Jonathan Ellis <jb...@gmail.com>.

I think that once we have built-in indexing (CASSANDRA-749) you can
make a good case for dropping supercolumns (at least, dropping them
from the public API and reserving them for internal use).

On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang <zs...@gmail.com> wrote:
> I don't think the SuperColumn is so necessary.
> I think this level of logic can be leaved to application.
>
> Do you think so?
>
> If SuperColumn is needed,  as
> https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index
> in SuperColumns level and SubColumns level.
> Thus, the levels of index is too many.
>
>