You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Evan Weaver <ew...@gmail.com> on 2009/08/11 19:37:29 UTC

Fixing the data model names

Dear Cassandra Developers,

In my experience, the naming of the data model has been a huge barrier
to entry for users of Cassandra. This goes both for people familiar
with SQL, and for people familiar with BigTable. I would like to
change this before 0.4, since the 0.3 to 0.4 transition is the Great
API Breakening.

I (that is, all of us at Twitter) are willing to write all the patches
and update the wiki, if I get the necessary community buy-in. I hoped
that I could do one patch per each external interface change, and then
after those are complete, a patch for each internal interface change
as a phase 2.

So technically this is not a bikeshed, because I'm happy to do all the
work. I'll even submit a patch for Digg's Python client. Since there
are no production deployments of ASF, and only a couple
well-maintained clients, now is the time to break the world. A few
hours of work now will pay off richly in terms of community
involvement and reduced noob-explanation-time.

In general, I think the data model names should have the following goals:

 * Use existing, widely understood terms.
 * Do not use terms that have conflicting meanings.
 * Express analogies in the data model, where useful.
 * Be unambiguous.

Are these goals valid? Clearly I think they are, because I wrote you a
very long email about it. Also, I don't think the current names meet
these goals. Currently, we have:

  Cluster, contains keyspaces:

  This is fine.

  Keyspace: contains column families.

There was some discussion of this change on the list a while back.
Keyspace beats Table by a mile, due to the "conflicting existing
usage" rule, but I think we can do better.

  Column family: containing a name, keys, column type, column sort,
and sub column sort.

  This name is from BigTable, and not in wide usage. It does not
express the hierarchy of storage, rather referring to a side effect of
the storage hierarchy by talking about the most granular data objects.
Confusing.

  Key: associated with columns.

  Since there's no word for the entire
key-and-columns-in-a-column-family thing ("row"), it's hard to talk
about this level of the data model clearly.

  Column: containing a name, value, and timestamp.

  This is from BigTable. In most cases, except when contained within a
super column, the data is row-oriented. There is nothing inherently
columnar about the storage. Furthermore, column is widely understood
from SQL to mean a table-enforced, strongly typed slot. Since
Cassandra does not have a tabular model, this is straight-up wrong.
Timestamps are an additional unexpected innovation in the normal use
of "column".

  Super column, containing a name and columns.

  This is a container of columns. However, the name expresses some
kind of priority order, but nothing about the container nature, even
though that's the most important property. This is not in any other
usage anywhere, and will always require explanation. Despite being a
type of column, it cannot be updated or overwritten like a standard
column, and does not have a timestamp.

Try to approach the naming with the mind of a beginner. For what it's
worth, it took me at least 6 weeks to become comfortable with the
current Cassandra terminology, and I had many false assumptions based
on the names. I remember it took far less than that when starting out
with SQL. At least there you can defer the confusing parts until
later; Cassandra hits you with the confusion all up front. Just
because we are comfortable now, doesn't mean that the current names
are a good thing.

So, on to the new proposed naming. In Cassandra's implementation, each
level of the data model contains the totality of the lower levels.
I've tried to express that in the new names.

  Cluster.

  No change.

  Database (formerly keyspace formerly table).

  Since this is quite literally the same as a database in an RDMBS,
there's no reason to change the term. It's a namespace with a specific
set of storage flags flipped. Its usage is analogous to the same usage
in an RDBMS.

  Record collection (formerly column family).

  This expresses the container nature--an ordered set. The word
"collection" is used in document databases to mean the same thing.

  Record (formerly a-thing-without-a-name)

  This is the row itself. It has a key, and attributes, but the thing
itself is not a key. It is not a "document" because it does not
arbitrarily nest, and it's not "row" because that might imply the
tabular nature of an RDBMS. Record has a history in databases which is
reasonable in this context. It does not imply that a record
necessarily corresponds to a complete object in the application, but
it doesn't rule it out. Since this is the only thing that has a key,
it's still valid to refer to a "key" in isolation, when convenient.

 Attribute (formerly column).

 It has a name, value, and a timestamp. It does not imply anything
about the storage. It does not imply a tabular model. It's more
specific then "tuple", but easier to talk about than "timestamped
key/value pair". It's the same as attributes in any object system.

 Attribute collection (formerly super column).

 This is clearly a container of attributes. That is all it implies,
and that is what it is. It is analogous to record collection.

In short:

  Cluster
  Database
  Record collection
  Record
  Attribute collection
  Attribute

We could call the cluster "database collection", but even I'm not
going to go that far. I realize that each level is merely a collection
of the collections under it, but an "attribute collection collection
collection collection" is no help to day-to-day usage. ;-)

As a heuristic, do the current names help, or get in the way? I'm not
married to the new proposal, but I want us to move in the right
direction, and not act like the current unusual naming is a badge of
honor, or forget our own difficulties in getting started.

Keep in mind that BigTable, as an internal Google project, did not
have API clarity as a primary goal; witness the colon-string-API that
got copied by Cassandra originally.

Comments please!

Thanks,

Evan

-- 
Evan Weaver

Re: Fixing the data model names

Posted by Eric Evans <ee...@rackspace.com>.

On Tue, 2009-08-11 at 22:48 +0100, Bill de hOra wrote:
> "Cluster" has a lot of meaning in the Java world already (a
> collection 
> of app servers) and is tied to the physical model - all the others
> are 
> tied to the logical model of the data.
> 
> Putting "Database" underneath "Cluster" misses the point that the 
> database is distributed across the cluster - - even it it's not right 
> for Cassandra, "BigTable" captures this concept well. For me, that
> the 
> database remains the uppermost concept even after physical
> distribution 
> is largely the point of Cassandra.

+1

-- 
Eric Evans
eevans@rackspace.com

Re: Fixing the data model names

Posted by Bill de hOra <bi...@dehora.net>.

Evan Weaver wrote:
> So technically this is not a bikeshed, because I'm happy to do all the
> work. I'll even submit a patch for Digg's Python client. Since there
> are no production deployments of ASF, and only a couple
> well-maintained clients, now is the time to break the world. A few
> hours of work now will pay off richly in terms of community
> involvement and reduced noob-explanation-time.

Post-keyspace we have this situation

1: objects with table in their name:

   http://www.flickr.com/photos/dehora/3812812718/sizes/l/

2: objects with keyspace in their name

   http://www.flickr.com/photos/dehora/3812812498/sizes/l/

What I take from this either is the code is dissonant with the current 
consensus or current consensus is an hallucination :)

So if we go through this, the community needs to commit to renaming 
objects and clearing out dead concepts. IME patch based processes resist 
this kind of high level rework unless the community and especially the 
reviewers are up for it.

> In short:
> 
>   Cluster
>   Database
>   Record collection
>   Record
>   Attribute collection
>   Attribute
> 
> We could call the cluster "database collection", but even I'm not
> going to go that far. I realize that each level is merely a collection
> of the collections under it, but an "attribute collection collection
> collection collection" is no help to day-to-day usage. ;-)

"Cluster" has a lot of meaning in the Java world already (a collection 
of app servers) and is tied to the physical model - all the others are 
tied to the logical model of the data.

Putting "Database" underneath "Cluster" misses the point that the 
database is distributed across the cluster - - even it it's not right 
for Cassandra, "BigTable" captures this concept well. For me, that the 
database remains the uppermost concept even after physical distribution 
is largely the point of Cassandra.

 > A few
 > hours of work now will pay off richly in terms of community
 > involvement and reduced noob-explanation-time.

For usage and the API, there are other concepts that need to be 
articulated properly for Cassandra users, such as slice, reverse, range, 
  mutation, consistency, path, parent. I'd like to believe these matter 
in the domain and are not fallout from using thrift/rpc ;)

Bill

Re: Fixing the data model names

Posted by Eric Evans <ee...@rackspace.com>.

On Wed, 2009-08-12 at 01:09 -0400, Curt Micol wrote:
> In response to Mr. Evan's comment regarding the Bigtable paper, does
> the Cassandra community want this to be a requirement for using the
> software? I would think not.  Sure, most early adopters are coming
> from that paper, but it shouldn't be a source of entry to use the
> database, but rather to develop it.

I never meant to imply that it was a requirement[0], only that the
construct in question has already been named, that name has gained a
certain amount of acceptance/traction, and that name is "Column family".

Jonathan's argument for renaming "table" to "keyspace" (which I found
compelling), was that "table" was loaded with meaning for people coming
from a relational database background. The same is true here, "Column
family" comes loaded with meaning, only in this case it means exactly
what it's supposed to.

[0] Though now that you mention it, I would *strongly* recommend that
people read the BigTable and Dynamo papers before getting started.

-- 
Eric Evans
eevans@rackspace.com

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Aug 12, 2009 at 7:23 PM, Evan Weaver<ew...@gmail.com> wrote:
> Re. Jonathan on "database": oracle/sqlserver/mysql/postgres call it a
> database.

No.  With a database (ignoring things like user accounts that don't
apply) the difference is that you decide at connection time what
database you want to talk to and once you do that there is no way* to
access data from other databases.  With schemas you have namespaced
objects but you can operate on other schemas w/in the same db.  So
that is the appropriate analogue to cassandra, where once connected
you can operate on any keyspace.

*Yes, I know there are some vendor-specific hacks to make this less true.

 > Re. Jonathan on "columns": it would make more sense if "column family"
> was actually called "sparse table". But super columns break the
> tabular model, so I don't think pretending to be tabular is a good
> answer. Personally I prefer the terms borrowed from document databases
> (I didn't realize that "attribute" was the relational-theory term).
> Maybe "field" and "field set" is better.

Maybe.  Maybe not.  That's my point; if you can't pick something
that's Clearly Better then you might as well leave well enough alone.

> I agree that individually, the current names are technically accurate
> in their specific contexts. But taken as a whole, they make
> practically no sense to someone starting out, as Ryan mentions. I'll
> poke around try to come up with some other possible term sets. The
> point isn't that they are *this* specific set, just that they are
> internally consistent, and analogous to things widely understood.

Analogies are dangerous things.  See the concept-formerly-known-as-table. :)

I'd rather focus on documenting a Pretty Good set of terminology than
try to bikeshed my way to perfection.

-Jonathan

Re: Fixing the data model names

Posted by Michael Greene <mi...@gmail.com>.

The internals of Thrift are not scary.  They're lexx/yacc so they're a
little opaque at first, but once you understand the model once it
applies to many parser generators.  Really, try hacking something
together for the Ruby generator if there's something you see missing.

With regards to recursive structures, unfortunately *that* would be
difficult in Thrift because of a decision made for the C++ library.
You could get parser support for it, and implement it for many of the
other language libraries, but it cannot be done with the current C++
library for reasons best found searching the Thrift mailing archives
or asking again from someone who knows more about it.

Avro is not nearly there.  Good work is being done on it, but only
C++, Java, and Python implementations have any reasonable progress,
and it is still being hashed out.  It could fit Cassandra well for
longer-held connections, once it's mature.

Michael

On Wed, Aug 12, 2009 at 11:39 PM, Evan Weaver<ew...@gmail.com> wrote:
> PS. How's Avro these days? Or could we patch Thrift? Haven't looked at
> the internals but assume they're scary.
>
> On Thu, Aug 13, 2009 at 12:23 AM, Evan Weaver<ew...@gmail.com> wrote:
>> Incidentally, is there any specific reason the collation has to be
>> pre-defined at the CF? What if any column could be an optional
>> supercolumn with a collation set at runtime? Then all CFs would be the
>> same.
>>
>> Evan
>>
>> On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> If thrift were sane it would look something like
>>>
>>> struct Column {
>>>  byte[] name,
>>>  optional list<Column> subcolumns,
>>>  optional int64 timestamp,
>>>  optional byte[] value
>>> }
>>>
>>> "you can either have the subcolumns, or the timestamp and value" seems
>>> reasonable to me.
>>>
>>> of course in the real world, thrift can't do recursive structures, so
>>> we'd have to go with Column/SubColumn like SuperColumn/Column today.
>>> So... maybe not really an improvement after all. :)
>>>
>>> (Why am I not surprised to find out that protocol buffers does support
>>> this?  Sigh.)
>>>
>>> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>> Hmm, my Ruby client internally refers to columns and subcolumns,
>>>> rather than supercolumns and columns...mainly because the subcolumn
>>>> position is optional, but the column_or_supercolumn position is not.
>>>> So there is something we agree on.
>>>>
>>>> Do you think the lack of a timestamp in the supercolumn is confusing?
>>>> It's still not exactly a kind of column.
>>>>
>>>> Evan
>>>>
>>>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>> I agree with the proposition that the SuperColumn name is weak.
>>>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>>>> go with schema over keyspace.
>>>>>
>>>>> One option to deal with SC would be to excise the term SC (and SCF
>>>>> from the config) and instead just have Columns, which may or may not
>>>>> have SubColumns.  You would define this as
>>>>>
>>>>> <ColumnFamily withSubColumns="true" .../>
>>>>>
>>>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>>>> with how I think of things working.  And now you just have Rows and
>>>>> Columns!  Just like a RDB! :P
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>>>> Points taken, and I agree, except in my experience the current names
>>>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>>>> column family and super column.
>>>>>>
>>>>>> If we go by the shorter-is-better principle, we might get:
>>>>>>
>>>>>> Cluster
>>>>>> Schema
>>>>>> Row set
>>>>>> Row w/key
>>>>>> Field set
>>>>>> Field
>>>>>>
>>>>>> "You take the user's key, and use that to insert into the Row Set
>>>>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>>>>> time-based UUID representing now, and with a value of the new tweet's
>>>>>> key."
>>>>>>
>>>>>> But let me study for a while and come up with a more researched proposal.
>>>>>>
>>>>>> Evan
>>>>>>
>>>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>>>>> However I think it's worth considering this from a strategic
>>>>>>>> perspective, looking at how we want the project do grow and change,
>>>>>>>> rather than just as it is right now.  The key to successful adoption
>>>>>>>> is having a successful elevator pitch,  you can start using a database
>>>>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>>>>> you're talking about.
>>>>>>>
>>>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>>>
>>>>>>> table -> CF
>>>>>>> column -> column
>>>>>>> key -> key
>>>>>>> row -> row
>>>>>>>
>>>>>>> That's the simple case, then all you have is "supercolumns can contain
>>>>>>> a list of simple columns."
>>>>>>>
>>>>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>>>>
>>>>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>>>>> the future.  So imposing costs on the current userbase which will give
>>>>>>>> huge benefits to future users, should be something we're willing to
>>>>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>>>>> last few weeks.
>>>>>>>
>>>>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>>>>
>>>>>>>> Given those changes went in without debate, I'm not sure what the
>>>>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>>>>
>>>>>>> As above.
>>>>>>>
>>>>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>>>>> *still* confusing, and when I talk with people who wonder what
>>>>>>>> cassandra is all about I get blank looks when telling them what things
>>>>>>>> are called.  If you step back and want to tell someone how you'd
>>>>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>>>>
>>>>>>>>  "You just take the user's key, and use that to insert into the
>>>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>>>>> new tweet's key"
>>>>>>>>
>>>>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>>>>> cases it's different.
>>>>>>>
>>>>>>> When you're inserting something nested 3 levels deep a certain amount
>>>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>>>
>>>>>>> "You take the user's record ID, and use that to insert into the Record
>>>>>>> Collection 'user associations' at Attribute Collection
>>>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>>>> representing now, and with a value of the new tweet's key."
>>>>>>>
>>>>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>>>>> Columns.  The same objections ("one object's name contains the
>>>>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>>>>> apply to practically any system.
>>>>>>>
>>>>>>> -Jonathan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Evan Weaver
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>
>
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

A row is the data associated with a key in a given CF.

On Thu, Aug 13, 2009 at 12:17 AM, Arin Sarkissian<ar...@rspot.net> wrote:
> Row? What are you guys referring to as a row?
>
> no - this isnt a joke
>
> Arin
>
> On Wed, Aug 12, 2009 at 9:39 PM, Evan Weaver<ew...@gmail.com> wrote:
>> PS. How's Avro these days? Or could we patch Thrift? Haven't looked at
>> the internals but assume they're scary.
>>
>> On Thu, Aug 13, 2009 at 12:23 AM, Evan Weaver<ew...@gmail.com> wrote:
>>> Incidentally, is there any specific reason the collation has to be
>>> pre-defined at the CF? What if any column could be an optional
>>> supercolumn with a collation set at runtime? Then all CFs would be the
>>> same.
>>>
>>> Evan
>>>
>>> On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>> If thrift were sane it would look something like
>>>>
>>>> struct Column {
>>>>  byte[] name,
>>>>  optional list<Column> subcolumns,
>>>>  optional int64 timestamp,
>>>>  optional byte[] value
>>>> }
>>>>
>>>> "you can either have the subcolumns, or the timestamp and value" seems
>>>> reasonable to me.
>>>>
>>>> of course in the real world, thrift can't do recursive structures, so
>>>> we'd have to go with Column/SubColumn like SuperColumn/Column today.
>>>> So... maybe not really an improvement after all. :)
>>>>
>>>> (Why am I not surprised to find out that protocol buffers does support
>>>> this?  Sigh.)
>>>>
>>>> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>>> Hmm, my Ruby client internally refers to columns and subcolumns,
>>>>> rather than supercolumns and columns...mainly because the subcolumn
>>>>> position is optional, but the column_or_supercolumn position is not.
>>>>> So there is something we agree on.
>>>>>
>>>>> Do you think the lack of a timestamp in the supercolumn is confusing?
>>>>> It's still not exactly a kind of column.
>>>>>
>>>>> Evan
>>>>>
>>>>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>>> I agree with the proposition that the SuperColumn name is weak.
>>>>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>>>>> go with schema over keyspace.
>>>>>>
>>>>>> One option to deal with SC would be to excise the term SC (and SCF
>>>>>> from the config) and instead just have Columns, which may or may not
>>>>>> have SubColumns.  You would define this as
>>>>>>
>>>>>> <ColumnFamily withSubColumns="true" .../>
>>>>>>
>>>>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>>>>> with how I think of things working.  And now you just have Rows and
>>>>>> Columns!  Just like a RDB! :P
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>>>>> Points taken, and I agree, except in my experience the current names
>>>>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>>>>> column family and super column.
>>>>>>>
>>>>>>> If we go by the shorter-is-better principle, we might get:
>>>>>>>
>>>>>>> Cluster
>>>>>>> Schema
>>>>>>> Row set
>>>>>>> Row w/key
>>>>>>> Field set
>>>>>>> Field
>>>>>>>
>>>>>>> "You take the user's key, and use that to insert into the Row Set
>>>>>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>>>>>> time-based UUID representing now, and with a value of the new tweet's
>>>>>>> key."
>>>>>>>
>>>>>>> But let me study for a while and come up with a more researched proposal.
>>>>>>>
>>>>>>> Evan
>>>>>>>
>>>>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>>>>>> However I think it's worth considering this from a strategic
>>>>>>>>> perspective, looking at how we want the project do grow and change,
>>>>>>>>> rather than just as it is right now.  The key to successful adoption
>>>>>>>>> is having a successful elevator pitch,  you can start using a database
>>>>>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>>>>>> you're talking about.
>>>>>>>>
>>>>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>>>>
>>>>>>>> table -> CF
>>>>>>>> column -> column
>>>>>>>> key -> key
>>>>>>>> row -> row
>>>>>>>>
>>>>>>>> That's the simple case, then all you have is "supercolumns can contain
>>>>>>>> a list of simple columns."
>>>>>>>>
>>>>>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>>>>>
>>>>>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>>>>>> the future.  So imposing costs on the current userbase which will give
>>>>>>>>> huge benefits to future users, should be something we're willing to
>>>>>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>>>>>> last few weeks.
>>>>>>>>
>>>>>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>>>>>
>>>>>>>>> Given those changes went in without debate, I'm not sure what the
>>>>>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>>>>>
>>>>>>>> As above.
>>>>>>>>
>>>>>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>>>>>> *still* confusing, and when I talk with people who wonder what
>>>>>>>>> cassandra is all about I get blank looks when telling them what things
>>>>>>>>> are called.  If you step back and want to tell someone how you'd
>>>>>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>>>>>
>>>>>>>>>  "You just take the user's key, and use that to insert into the
>>>>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>>>>>> new tweet's key"
>>>>>>>>>
>>>>>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>>>>>> cases it's different.
>>>>>>>>
>>>>>>>> When you're inserting something nested 3 levels deep a certain amount
>>>>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>>>>
>>>>>>>> "You take the user's record ID, and use that to insert into the Record
>>>>>>>> Collection 'user associations' at Attribute Collection
>>>>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>>>>> representing now, and with a value of the new tweet's key."
>>>>>>>>
>>>>>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>>>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>>>>>> Columns.  The same objections ("one object's name contains the
>>>>>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>>>>>> apply to practically any system.
>>>>>>>>
>>>>>>>> -Jonathan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Evan Weaver
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Evan Weaver
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>

Re: Fixing the data model names

Posted by Arin Sarkissian <ar...@rspot.net>.

Row? What are you guys referring to as a row?

no - this isnt a joke

Arin

On Wed, Aug 12, 2009 at 9:39 PM, Evan Weaver<ew...@gmail.com> wrote:
> PS. How's Avro these days? Or could we patch Thrift? Haven't looked at
> the internals but assume they're scary.
>
> On Thu, Aug 13, 2009 at 12:23 AM, Evan Weaver<ew...@gmail.com> wrote:
>> Incidentally, is there any specific reason the collation has to be
>> pre-defined at the CF? What if any column could be an optional
>> supercolumn with a collation set at runtime? Then all CFs would be the
>> same.
>>
>> Evan
>>
>> On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> If thrift were sane it would look something like
>>>
>>> struct Column {
>>>  byte[] name,
>>>  optional list<Column> subcolumns,
>>>  optional int64 timestamp,
>>>  optional byte[] value
>>> }
>>>
>>> "you can either have the subcolumns, or the timestamp and value" seems
>>> reasonable to me.
>>>
>>> of course in the real world, thrift can't do recursive structures, so
>>> we'd have to go with Column/SubColumn like SuperColumn/Column today.
>>> So... maybe not really an improvement after all. :)
>>>
>>> (Why am I not surprised to find out that protocol buffers does support
>>> this?  Sigh.)
>>>
>>> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>> Hmm, my Ruby client internally refers to columns and subcolumns,
>>>> rather than supercolumns and columns...mainly because the subcolumn
>>>> position is optional, but the column_or_supercolumn position is not.
>>>> So there is something we agree on.
>>>>
>>>> Do you think the lack of a timestamp in the supercolumn is confusing?
>>>> It's still not exactly a kind of column.
>>>>
>>>> Evan
>>>>
>>>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>> I agree with the proposition that the SuperColumn name is weak.
>>>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>>>> go with schema over keyspace.
>>>>>
>>>>> One option to deal with SC would be to excise the term SC (and SCF
>>>>> from the config) and instead just have Columns, which may or may not
>>>>> have SubColumns.  You would define this as
>>>>>
>>>>> <ColumnFamily withSubColumns="true" .../>
>>>>>
>>>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>>>> with how I think of things working.  And now you just have Rows and
>>>>> Columns!  Just like a RDB! :P
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>>>> Points taken, and I agree, except in my experience the current names
>>>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>>>> column family and super column.
>>>>>>
>>>>>> If we go by the shorter-is-better principle, we might get:
>>>>>>
>>>>>> Cluster
>>>>>> Schema
>>>>>> Row set
>>>>>> Row w/key
>>>>>> Field set
>>>>>> Field
>>>>>>
>>>>>> "You take the user's key, and use that to insert into the Row Set
>>>>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>>>>> time-based UUID representing now, and with a value of the new tweet's
>>>>>> key."
>>>>>>
>>>>>> But let me study for a while and come up with a more researched proposal.
>>>>>>
>>>>>> Evan
>>>>>>
>>>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>>>>> However I think it's worth considering this from a strategic
>>>>>>>> perspective, looking at how we want the project do grow and change,
>>>>>>>> rather than just as it is right now.  The key to successful adoption
>>>>>>>> is having a successful elevator pitch,  you can start using a database
>>>>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>>>>> you're talking about.
>>>>>>>
>>>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>>>
>>>>>>> table -> CF
>>>>>>> column -> column
>>>>>>> key -> key
>>>>>>> row -> row
>>>>>>>
>>>>>>> That's the simple case, then all you have is "supercolumns can contain
>>>>>>> a list of simple columns."
>>>>>>>
>>>>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>>>>
>>>>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>>>>> the future.  So imposing costs on the current userbase which will give
>>>>>>>> huge benefits to future users, should be something we're willing to
>>>>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>>>>> last few weeks.
>>>>>>>
>>>>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>>>>
>>>>>>>> Given those changes went in without debate, I'm not sure what the
>>>>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>>>>
>>>>>>> As above.
>>>>>>>
>>>>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>>>>> *still* confusing, and when I talk with people who wonder what
>>>>>>>> cassandra is all about I get blank looks when telling them what things
>>>>>>>> are called.  If you step back and want to tell someone how you'd
>>>>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>>>>
>>>>>>>>  "You just take the user's key, and use that to insert into the
>>>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>>>>> new tweet's key"
>>>>>>>>
>>>>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>>>>> cases it's different.
>>>>>>>
>>>>>>> When you're inserting something nested 3 levels deep a certain amount
>>>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>>>
>>>>>>> "You take the user's record ID, and use that to insert into the Record
>>>>>>> Collection 'user associations' at Attribute Collection
>>>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>>>> representing now, and with a value of the new tweet's key."
>>>>>>>
>>>>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>>>>> Columns.  The same objections ("one object's name contains the
>>>>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>>>>> apply to practically any system.
>>>>>>>
>>>>>>> -Jonathan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Evan Weaver
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>
>
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

PS. How's Avro these days? Or could we patch Thrift? Haven't looked at
the internals but assume they're scary.

On Thu, Aug 13, 2009 at 12:23 AM, Evan Weaver<ew...@gmail.com> wrote:
> Incidentally, is there any specific reason the collation has to be
> pre-defined at the CF? What if any column could be an optional
> supercolumn with a collation set at runtime? Then all CFs would be the
> same.
>
> Evan
>
> On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> If thrift were sane it would look something like
>>
>> struct Column {
>>  byte[] name,
>>  optional list<Column> subcolumns,
>>  optional int64 timestamp,
>>  optional byte[] value
>> }
>>
>> "you can either have the subcolumns, or the timestamp and value" seems
>> reasonable to me.
>>
>> of course in the real world, thrift can't do recursive structures, so
>> we'd have to go with Column/SubColumn like SuperColumn/Column today.
>> So... maybe not really an improvement after all. :)
>>
>> (Why am I not surprised to find out that protocol buffers does support
>> this?  Sigh.)
>>
>> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
>>> Hmm, my Ruby client internally refers to columns and subcolumns,
>>> rather than supercolumns and columns...mainly because the subcolumn
>>> position is optional, but the column_or_supercolumn position is not.
>>> So there is something we agree on.
>>>
>>> Do you think the lack of a timestamp in the supercolumn is confusing?
>>> It's still not exactly a kind of column.
>>>
>>> Evan
>>>
>>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>> I agree with the proposition that the SuperColumn name is weak.
>>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>>> go with schema over keyspace.
>>>>
>>>> One option to deal with SC would be to excise the term SC (and SCF
>>>> from the config) and instead just have Columns, which may or may not
>>>> have SubColumns.  You would define this as
>>>>
>>>> <ColumnFamily withSubColumns="true" .../>
>>>>
>>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>>> with how I think of things working.  And now you just have Rows and
>>>> Columns!  Just like a RDB! :P
>>>>
>>>> -Jonathan
>>>>
>>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>>> Points taken, and I agree, except in my experience the current names
>>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>>> column family and super column.
>>>>>
>>>>> If we go by the shorter-is-better principle, we might get:
>>>>>
>>>>> Cluster
>>>>> Schema
>>>>> Row set
>>>>> Row w/key
>>>>> Field set
>>>>> Field
>>>>>
>>>>> "You take the user's key, and use that to insert into the Row Set
>>>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>>>> time-based UUID representing now, and with a value of the new tweet's
>>>>> key."
>>>>>
>>>>> But let me study for a while and come up with a more researched proposal.
>>>>>
>>>>> Evan
>>>>>
>>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>>>> However I think it's worth considering this from a strategic
>>>>>>> perspective, looking at how we want the project do grow and change,
>>>>>>> rather than just as it is right now.  The key to successful adoption
>>>>>>> is having a successful elevator pitch,  you can start using a database
>>>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>>>> you're talking about.
>>>>>>
>>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>>
>>>>>> table -> CF
>>>>>> column -> column
>>>>>> key -> key
>>>>>> row -> row
>>>>>>
>>>>>> That's the simple case, then all you have is "supercolumns can contain
>>>>>> a list of simple columns."
>>>>>>
>>>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>>>
>>>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>>>> the future.  So imposing costs on the current userbase which will give
>>>>>>> huge benefits to future users, should be something we're willing to
>>>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>>>> last few weeks.
>>>>>>
>>>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>>>
>>>>>>> Given those changes went in without debate, I'm not sure what the
>>>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>>>
>>>>>> As above.
>>>>>>
>>>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>>>> *still* confusing, and when I talk with people who wonder what
>>>>>>> cassandra is all about I get blank looks when telling them what things
>>>>>>> are called.  If you step back and want to tell someone how you'd
>>>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>>>
>>>>>>>  "You just take the user's key, and use that to insert into the
>>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>>>> new tweet's key"
>>>>>>>
>>>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>>>> cases it's different.
>>>>>>
>>>>>> When you're inserting something nested 3 levels deep a certain amount
>>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>>
>>>>>> "You take the user's record ID, and use that to insert into the Record
>>>>>> Collection 'user associations' at Attribute Collection
>>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>>> representing now, and with a value of the new tweet's key."
>>>>>>
>>>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>>>> Columns.  The same objections ("one object's name contains the
>>>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>>>> apply to practically any system.
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Evan Weaver
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

The assumption that within a CF only IColumns of the same type (C or
SC) will be compared is baked in pretty deeply.

-Jonathan

On Wed, Aug 12, 2009 at 11:23 PM, Evan Weaver<ew...@gmail.com> wrote:
> Incidentally, is there any specific reason the collation has to be
> pre-defined at the CF? What if any column could be an optional
> supercolumn with a collation set at runtime? Then all CFs would be the
> same.
>
> Evan
>
> On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> If thrift were sane it would look something like
>>
>> struct Column {
>>  byte[] name,
>>  optional list<Column> subcolumns,
>>  optional int64 timestamp,
>>  optional byte[] value
>> }
>>
>> "you can either have the subcolumns, or the timestamp and value" seems
>> reasonable to me.
>>
>> of course in the real world, thrift can't do recursive structures, so
>> we'd have to go with Column/SubColumn like SuperColumn/Column today.
>> So... maybe not really an improvement after all. :)
>>
>> (Why am I not surprised to find out that protocol buffers does support
>> this?  Sigh.)
>>
>> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
>>> Hmm, my Ruby client internally refers to columns and subcolumns,
>>> rather than supercolumns and columns...mainly because the subcolumn
>>> position is optional, but the column_or_supercolumn position is not.
>>> So there is something we agree on.
>>>
>>> Do you think the lack of a timestamp in the supercolumn is confusing?
>>> It's still not exactly a kind of column.
>>>
>>> Evan
>>>
>>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>> I agree with the proposition that the SuperColumn name is weak.
>>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>>> go with schema over keyspace.
>>>>
>>>> One option to deal with SC would be to excise the term SC (and SCF
>>>> from the config) and instead just have Columns, which may or may not
>>>> have SubColumns.  You would define this as
>>>>
>>>> <ColumnFamily withSubColumns="true" .../>
>>>>
>>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>>> with how I think of things working.  And now you just have Rows and
>>>> Columns!  Just like a RDB! :P
>>>>
>>>> -Jonathan
>>>>
>>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>>> Points taken, and I agree, except in my experience the current names
>>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>>> column family and super column.
>>>>>
>>>>> If we go by the shorter-is-better principle, we might get:
>>>>>
>>>>> Cluster
>>>>> Schema
>>>>> Row set
>>>>> Row w/key
>>>>> Field set
>>>>> Field
>>>>>
>>>>> "You take the user's key, and use that to insert into the Row Set
>>>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>>>> time-based UUID representing now, and with a value of the new tweet's
>>>>> key."
>>>>>
>>>>> But let me study for a while and come up with a more researched proposal.
>>>>>
>>>>> Evan
>>>>>
>>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>>>> However I think it's worth considering this from a strategic
>>>>>>> perspective, looking at how we want the project do grow and change,
>>>>>>> rather than just as it is right now.  The key to successful adoption
>>>>>>> is having a successful elevator pitch,  you can start using a database
>>>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>>>> you're talking about.
>>>>>>
>>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>>
>>>>>> table -> CF
>>>>>> column -> column
>>>>>> key -> key
>>>>>> row -> row
>>>>>>
>>>>>> That's the simple case, then all you have is "supercolumns can contain
>>>>>> a list of simple columns."
>>>>>>
>>>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>>>
>>>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>>>> the future.  So imposing costs on the current userbase which will give
>>>>>>> huge benefits to future users, should be something we're willing to
>>>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>>>> last few weeks.
>>>>>>
>>>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>>>
>>>>>>> Given those changes went in without debate, I'm not sure what the
>>>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>>>
>>>>>> As above.
>>>>>>
>>>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>>>> *still* confusing, and when I talk with people who wonder what
>>>>>>> cassandra is all about I get blank looks when telling them what things
>>>>>>> are called.  If you step back and want to tell someone how you'd
>>>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>>>
>>>>>>>  "You just take the user's key, and use that to insert into the
>>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>>>> new tweet's key"
>>>>>>>
>>>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>>>> cases it's different.
>>>>>>
>>>>>> When you're inserting something nested 3 levels deep a certain amount
>>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>>
>>>>>> "You take the user's record ID, and use that to insert into the Record
>>>>>> Collection 'user associations' at Attribute Collection
>>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>>> representing now, and with a value of the new tweet's key."
>>>>>>
>>>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>>>> Columns.  The same objections ("one object's name contains the
>>>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>>>> apply to practically any system.
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Evan Weaver
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

Incidentally, is there any specific reason the collation has to be
pre-defined at the CF? What if any column could be an optional
supercolumn with a collation set at runtime? Then all CFs would be the
same.

Evan

On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> If thrift were sane it would look something like
>
> struct Column {
>  byte[] name,
>  optional list<Column> subcolumns,
>  optional int64 timestamp,
>  optional byte[] value
> }
>
> "you can either have the subcolumns, or the timestamp and value" seems
> reasonable to me.
>
> of course in the real world, thrift can't do recursive structures, so
> we'd have to go with Column/SubColumn like SuperColumn/Column today.
> So... maybe not really an improvement after all. :)
>
> (Why am I not surprised to find out that protocol buffers does support
> this?  Sigh.)
>
> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
>> Hmm, my Ruby client internally refers to columns and subcolumns,
>> rather than supercolumns and columns...mainly because the subcolumn
>> position is optional, but the column_or_supercolumn position is not.
>> So there is something we agree on.
>>
>> Do you think the lack of a timestamp in the supercolumn is confusing?
>> It's still not exactly a kind of column.
>>
>> Evan
>>
>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> I agree with the proposition that the SuperColumn name is weak.
>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>> go with schema over keyspace.
>>>
>>> One option to deal with SC would be to excise the term SC (and SCF
>>> from the config) and instead just have Columns, which may or may not
>>> have SubColumns.  You would define this as
>>>
>>> <ColumnFamily withSubColumns="true" .../>
>>>
>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>> with how I think of things working.  And now you just have Rows and
>>> Columns!  Just like a RDB! :P
>>>
>>> -Jonathan
>>>
>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>>> Points taken, and I agree, except in my experience the current names
>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>> column family and super column.
>>>>
>>>> If we go by the shorter-is-better principle, we might get:
>>>>
>>>> Cluster
>>>> Schema
>>>> Row set
>>>> Row w/key
>>>> Field set
>>>> Field
>>>>
>>>> "You take the user's key, and use that to insert into the Row Set
>>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>>> time-based UUID representing now, and with a value of the new tweet's
>>>> key."
>>>>
>>>> But let me study for a while and come up with a more researched proposal.
>>>>
>>>> Evan
>>>>
>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>>> However I think it's worth considering this from a strategic
>>>>>> perspective, looking at how we want the project do grow and change,
>>>>>> rather than just as it is right now.  The key to successful adoption
>>>>>> is having a successful elevator pitch,  you can start using a database
>>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>>> you're talking about.
>>>>>
>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>
>>>>> table -> CF
>>>>> column -> column
>>>>> key -> key
>>>>> row -> row
>>>>>
>>>>> That's the simple case, then all you have is "supercolumns can contain
>>>>> a list of simple columns."
>>>>>
>>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>>
>>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>>> the future.  So imposing costs on the current userbase which will give
>>>>>> huge benefits to future users, should be something we're willing to
>>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>>> last few weeks.
>>>>>
>>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>>
>>>>>> Given those changes went in without debate, I'm not sure what the
>>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>>
>>>>> As above.
>>>>>
>>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>>> *still* confusing, and when I talk with people who wonder what
>>>>>> cassandra is all about I get blank looks when telling them what things
>>>>>> are called.  If you step back and want to tell someone how you'd
>>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>>
>>>>>>  "You just take the user's key, and use that to insert into the
>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>>> new tweet's key"
>>>>>>
>>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>>> cases it's different.
>>>>>
>>>>> When you're inserting something nested 3 levels deep a certain amount
>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>
>>>>> "You take the user's record ID, and use that to insert into the Record
>>>>> Collection 'user associations' at Attribute Collection
>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>> representing now, and with a value of the new tweet's key."
>>>>>
>>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>>> Columns.  The same objections ("one object's name contains the
>>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>>> apply to practically any system.
>>>>>
>>>>> -Jonathan
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

If thrift were sane it would look something like

struct Column {
  byte[] name,
  optional list<Column> subcolumns,
  optional int64 timestamp,
  optional byte[] value
}

"you can either have the subcolumns, or the timestamp and value" seems
reasonable to me.

of course in the real world, thrift can't do recursive structures, so
we'd have to go with Column/SubColumn like SuperColumn/Column today.
So... maybe not really an improvement after all. :)

(Why am I not surprised to find out that protocol buffers does support
this?  Sigh.)

On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<ew...@gmail.com> wrote:
> Hmm, my Ruby client internally refers to columns and subcolumns,
> rather than supercolumns and columns...mainly because the subcolumn
> position is optional, but the column_or_supercolumn position is not.
> So there is something we agree on.
>
> Do you think the lack of a timestamp in the supercolumn is confusing?
> It's still not exactly a kind of column.
>
> Evan
>
> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> I agree with the proposition that the SuperColumn name is weak.
>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>> go with schema over keyspace.
>>
>> One option to deal with SC would be to excise the term SC (and SCF
>> from the config) and instead just have Columns, which may or may not
>> have SubColumns.  You would define this as
>>
>> <ColumnFamily withSubColumns="true" .../>
>>
>> "Insert a subcolumn named A into the Column named B" fits pretty well
>> with how I think of things working.  And now you just have Rows and
>> Columns!  Just like a RDB! :P
>>
>> -Jonathan
>>
>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>>> Points taken, and I agree, except in my experience the current names
>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>> column family and super column.
>>>
>>> If we go by the shorter-is-better principle, we might get:
>>>
>>> Cluster
>>> Schema
>>> Row set
>>> Row w/key
>>> Field set
>>> Field
>>>
>>> "You take the user's key, and use that to insert into the Row Set
>>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>>> time-based UUID representing now, and with a value of the new tweet's
>>> key."
>>>
>>> But let me study for a while and come up with a more researched proposal.
>>>
>>> Evan
>>>
>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>>> However I think it's worth considering this from a strategic
>>>>> perspective, looking at how we want the project do grow and change,
>>>>> rather than just as it is right now.  The key to successful adoption
>>>>> is having a successful elevator pitch,  you can start using a database
>>>>> without understanding relational-algebra because 'table' and 'column'
>>>>> are such simple ways to reason about the tool.  As it stands
>>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>>> you're talking about.
>>>>
>>>> If you want to explain it as "sort of like a relational db" then
>>>>
>>>> table -> CF
>>>> column -> column
>>>> key -> key
>>>> row -> row
>>>>
>>>> That's the simple case, then all you have is "supercolumns can contain
>>>> a list of simple columns."
>>>>
>>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>>
>>>>> Assuming the project gets anything like the adoption it deserves, the
>>>>> users we have today will be a *tiny minority* of the users we have in
>>>>> the future.  So imposing costs on the current userbase which will give
>>>>> huge benefits to future users, should be something we're willing to
>>>>> do.  In fact it's something that has been done repeatedly over the
>>>>> last few weeks.
>>>>
>>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>>
>>>>> Given those changes went in without debate, I'm not sure what the
>>>>> reluctance is for making changes to the nomenclature for the project.
>>>>
>>>> As above.
>>>>
>>>>> Speaking as someone who's only been doing this a month, the naming is
>>>>> *still* confusing, and when I talk with people who wonder what
>>>>> cassandra is all about I get blank looks when telling them what things
>>>>> are called.  If you step back and want to tell someone how you'd
>>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>>
>>>>>  "You just take the user's key, and use that to insert into the
>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>>> ColumnName of a time based uuid representing now, and a value of the
>>>>> new tweet's key"
>>>>>
>>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>>> cases it's different.
>>>>
>>>> When you're inserting something nested 3 levels deep a certain amount
>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>
>>>> "You take the user's record ID, and use that to insert into the Record
>>>> Collection 'user associations' at Attribute Collection
>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>> representing now, and with a value of the new tweet's key."
>>>>
>>>> I think that is a negative improvement.  Yay, now we are talking about
>>>> Attribute Collections and Attributes instead of SuperColumns and
>>>> Columns.  The same objections ("one object's name contains the
>>>> other's!) apply, plus the new one of sounding so generic that it could
>>>> apply to practically any system.
>>>>
>>>> -Jonathan
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Weaver
>>>
>>
>
>
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

Hmm, my Ruby client internally refers to columns and subcolumns,
rather than supercolumns and columns...mainly because the subcolumn
position is optional, but the column_or_supercolumn position is not.
So there is something we agree on.

Do you think the lack of a timestamp in the supercolumn is confusing?
It's still not exactly a kind of column.

Evan

On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> I agree with the proposition that the SuperColumn name is weak.
> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
> go with schema over keyspace.
>
> One option to deal with SC would be to excise the term SC (and SCF
> from the config) and instead just have Columns, which may or may not
> have SubColumns.  You would define this as
>
> <ColumnFamily withSubColumns="true" .../>
>
> "Insert a subcolumn named A into the Column named B" fits pretty well
> with how I think of things working.  And now you just have Rows and
> Columns!  Just like a RDB! :P
>
> -Jonathan
>
> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>> Points taken, and I agree, except in my experience the current names
>> are not Pretty Good but rather Pretty Weird; the primary issues being
>> column family and super column.
>>
>> If we go by the shorter-is-better principle, we might get:
>>
>> Cluster
>> Schema
>> Row set
>> Row w/key
>> Field set
>> Field
>>
>> "You take the user's key, and use that to insert into the Row Set
>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>> time-based UUID representing now, and with a value of the new tweet's
>> key."
>>
>> But let me study for a while and come up with a more researched proposal.
>>
>> Evan
>>
>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>> However I think it's worth considering this from a strategic
>>>> perspective, looking at how we want the project do grow and change,
>>>> rather than just as it is right now.  The key to successful adoption
>>>> is having a successful elevator pitch,  you can start using a database
>>>> without understanding relational-algebra because 'table' and 'column'
>>>> are such simple ways to reason about the tool.  As it stands
>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>> you're talking about.
>>>
>>> If you want to explain it as "sort of like a relational db" then
>>>
>>> table -> CF
>>> column -> column
>>> key -> key
>>> row -> row
>>>
>>> That's the simple case, then all you have is "supercolumns can contain
>>> a list of simple columns."
>>>
>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>
>>>> Assuming the project gets anything like the adoption it deserves, the
>>>> users we have today will be a *tiny minority* of the users we have in
>>>> the future.  So imposing costs on the current userbase which will give
>>>> huge benefits to future users, should be something we're willing to
>>>> do.  In fact it's something that has been done repeatedly over the
>>>> last few weeks.
>>>
>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>
>>>> Given those changes went in without debate, I'm not sure what the
>>>> reluctance is for making changes to the nomenclature for the project.
>>>
>>> As above.
>>>
>>>> Speaking as someone who's only been doing this a month, the naming is
>>>> *still* confusing, and when I talk with people who wonder what
>>>> cassandra is all about I get blank looks when telling them what things
>>>> are called.  If you step back and want to tell someone how you'd
>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>
>>>>  "You just take the user's key, and use that to insert into the
>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>> ColumnName of a time based uuid representing now, and a value of the
>>>> new tweet's key"
>>>>
>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>> cases it's different.
>>>
>>> When you're inserting something nested 3 levels deep a certain amount
>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>
>>> "You take the user's record ID, and use that to insert into the Record
>>> Collection 'user associations' at Attribute Collection
>>> 'user_timeline,' an Attribute named with a time based uuid
>>> representing now, and with a value of the new tweet's key."
>>>
>>> I think that is a negative improvement.  Yay, now we are talking about
>>> Attribute Collections and Attributes instead of SuperColumns and
>>> Columns.  The same objections ("one object's name contains the
>>> other's!) apply, plus the new one of sounding so generic that it could
>>> apply to practically any system.
>>>
>>> -Jonathan
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

Keys are part of the row in a RDB. So a table has columns and rows,
which is everything...

Evan

On Thu, Aug 13, 2009 at 1:43 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 12:38 PM, Evan Weaver<ew...@gmail.com> wrote:
>> I understand the BigTable precedent issue...but it's also a group of
>> rows, and a group of keys, just as much. "Column Family" leaves out
>> the keys
>
> Well, it does, or does not, to the same degree that a Table in a rbd
> does or does not leave out the keys.
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Aug 13, 2009 at 12:38 PM, Evan Weaver<ew...@gmail.com> wrote:
> I understand the BigTable precedent issue...but it's also a group of
> rows, and a group of keys, just as much. "Column Family" leaves out
> the keys

Well, it does, or does not, to the same degree that a Table in a rbd
does or does not leave out the keys.

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

I understand the BigTable precedent issue...but it's also a group of
rows, and a group of keys, just as much. "Column Family" leaves out
the keys which I think is a big part of the noob confusion. It sounds
like it could be the same thing as a Super Column, which is also a
group of columns.

Evan

On Thu, Aug 13, 2009 at 1:32 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 12:24 PM, Evan Weaver<ew...@gmail.com> wrote:
>> What do you see as the benefit of ColumnFamily?
>
> It correctly implies "group of columns" w/o sounding excessively
> generic like ColumnCollection or something, and it means mostly the
> same thing as it does in Bigtable, which can be useful in
> understanding it.
>
> -Jonathan
>

-- 
Evan Weaver

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Aug 13, 2009 at 12:24 PM, Evan Weaver<ew...@gmail.com> wrote:
> What do you see as the benefit of ColumnFamily?

It correctly implies "group of columns" w/o sounding excessively
generic like ColumnCollection or something, and it means mostly the
same thing as it does in Bigtable, which can be useful in
understanding it.

-Jonathan

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

Hmm...

What do you see as the benefit of ColumnFamily? As you mentioned on
the -users list, it's not the common case in Cassandra for a document
to span multiple column families. But that seemed to be the motivation
for naming it that in Bigtable--a "row" usually spanned multiple files
(to get the semi-column-oriented thing going on), so you had to have
some name for the individual groups of columns across the distributed
row. This is also the reason that the API parameter order was merely
not storage order.

When anyone asks me (and they always ask) I just tell them it's like a
table. Though I am warming up to the subcolumn thing, I still don't
think it makes sense to talk about columns in a non-tabular
multidimensional space. If "attributes" mean columns in
relational-theory land, then attributes are wrong too.

It's true that in Cassandra you can talk to any keyspace. On the other
hand, you told me to design my client API as if you couldn't,  and
declared the key space at client instantiation. From a usability
perspective I agree that that is correct, but I'm confused about the
intention. So it seems like the clients should not treat the Thrift
API as a design foundation.

I'm going to make a list of data model APIs for other databases and
see if anything falls out.

Evan

On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> I agree with the proposition that the SuperColumn name is weak.
> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
> go with schema over keyspace.
>
> One option to deal with SC would be to excise the term SC (and SCF
> from the config) and instead just have Columns, which may or may not
> have SubColumns.  You would define this as
>
> <ColumnFamily withSubColumns="true" .../>
>
> "Insert a subcolumn named A into the Column named B" fits pretty well
> with how I think of things working.  And now you just have Rows and
> Columns!  Just like a RDB! :P
>
> -Jonathan
>
> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
>> Points taken, and I agree, except in my experience the current names
>> are not Pretty Good but rather Pretty Weird; the primary issues being
>> column family and super column.
>>
>> If we go by the shorter-is-better principle, we might get:
>>
>> Cluster
>> Schema
>> Row set
>> Row w/key
>> Field set
>> Field
>>
>> "You take the user's key, and use that to insert into the Row Set
>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>> time-based UUID representing now, and with a value of the new tweet's
>> key."
>>
>> But let me study for a while and come up with a more researched proposal.
>>
>> Evan
>>
>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>>> However I think it's worth considering this from a strategic
>>>> perspective, looking at how we want the project do grow and change,
>>>> rather than just as it is right now.  The key to successful adoption
>>>> is having a successful elevator pitch,  you can start using a database
>>>> without understanding relational-algebra because 'table' and 'column'
>>>> are such simple ways to reason about the tool.  As it stands
>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>> you're talking about.
>>>
>>> If you want to explain it as "sort of like a relational db" then
>>>
>>> table -> CF
>>> column -> column
>>> key -> key
>>> row -> row
>>>
>>> That's the simple case, then all you have is "supercolumns can contain
>>> a list of simple columns."
>>>
>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>
>>>> Assuming the project gets anything like the adoption it deserves, the
>>>> users we have today will be a *tiny minority* of the users we have in
>>>> the future.  So imposing costs on the current userbase which will give
>>>> huge benefits to future users, should be something we're willing to
>>>> do.  In fact it's something that has been done repeatedly over the
>>>> last few weeks.
>>>
>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>
>>>> Given those changes went in without debate, I'm not sure what the
>>>> reluctance is for making changes to the nomenclature for the project.
>>>
>>> As above.
>>>
>>>> Speaking as someone who's only been doing this a month, the naming is
>>>> *still* confusing, and when I talk with people who wonder what
>>>> cassandra is all about I get blank looks when telling them what things
>>>> are called.  If you step back and want to tell someone how you'd
>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>
>>>>  "You just take the user's key, and use that to insert into the
>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>> ColumnName of a time based uuid representing now, and a value of the
>>>> new tweet's key"
>>>>
>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>> cases it's different.
>>>
>>> When you're inserting something nested 3 levels deep a certain amount
>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>
>>> "You take the user's record ID, and use that to insert into the Record
>>> Collection 'user associations' at Attribute Collection
>>> 'user_timeline,' an Attribute named with a time based uuid
>>> representing now, and with a value of the new tweet's key."
>>>
>>> I think that is a negative improvement.  Yay, now we are talking about
>>> Attribute Collections and Attributes instead of SuperColumns and
>>> Columns.  The same objections ("one object's name contains the
>>> other's!) apply, plus the new one of sounding so generic that it could
>>> apply to practically any system.
>>>
>>> -Jonathan
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

I agree with the proposition that the SuperColumn name is weak.
(Although not, as I mentioned, Column or ColumnFamily.)  And I could
go with schema over keyspace.

One option to deal with SC would be to excise the term SC (and SCF
from the config) and instead just have Columns, which may or may not
have SubColumns.  You would define this as

<ColumnFamily withSubColumns="true" .../>

"Insert a subcolumn named A into the Column named B" fits pretty well
with how I think of things working.  And now you just have Rows and
Columns!  Just like a RDB! :P

-Jonathan

On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<ew...@gmail.com> wrote:
> Points taken, and I agree, except in my experience the current names
> are not Pretty Good but rather Pretty Weird; the primary issues being
> column family and super column.
>
> If we go by the shorter-is-better principle, we might get:
>
> Cluster
> Schema
> Row set
> Row w/key
> Field set
> Field
>
> "You take the user's key, and use that to insert into the Row Set
> 'user_associations' at Field Set 'user_timeline,' a field named with a
> time-based UUID representing now, and with a value of the new tweet's
> key."
>
> But let me study for a while and come up with a more researched proposal.
>
> Evan
>
> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>>> However I think it's worth considering this from a strategic
>>> perspective, looking at how we want the project do grow and change,
>>> rather than just as it is right now.  The key to successful adoption
>>> is having a successful elevator pitch,  you can start using a database
>>> without understanding relational-algebra because 'table' and 'column'
>>> are such simple ways to reason about the tool.  As it stands
>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>> you're talking about.
>>
>> If you want to explain it as "sort of like a relational db" then
>>
>> table -> CF
>> column -> column
>> key -> key
>> row -> row
>>
>> That's the simple case, then all you have is "supercolumns can contain
>> a list of simple columns."
>>
>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>
>>> Assuming the project gets anything like the adoption it deserves, the
>>> users we have today will be a *tiny minority* of the users we have in
>>> the future.  So imposing costs on the current userbase which will give
>>> huge benefits to future users, should be something we're willing to
>>> do.  In fact it's something that has been done repeatedly over the
>>> last few weeks.
>>
>> I agree.  But as I said before I just don't see this as being an improvement.
>>
>>> Given those changes went in without debate, I'm not sure what the
>>> reluctance is for making changes to the nomenclature for the project.
>>
>> As above.
>>
>>> Speaking as someone who's only been doing this a month, the naming is
>>> *still* confusing, and when I talk with people who wonder what
>>> cassandra is all about I get blank looks when telling them what things
>>> are called.  If you step back and want to tell someone how you'd
>>> insert a tweet into someone's timeline using evan's weblog post:
>>>
>>>  "You just take the user's key, and use that to insert into the
>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>> ColumnName of a time based uuid representing now, and a value of the
>>> new tweet's key"
>>>
>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>> cases it's different.
>>
>> When you're inserting something nested 3 levels deep a certain amount
>> of verbosity is unavoidable.  With Evan's nomenclature,
>>
>> "You take the user's record ID, and use that to insert into the Record
>> Collection 'user associations' at Attribute Collection
>> 'user_timeline,' an Attribute named with a time based uuid
>> representing now, and with a value of the new tweet's key."
>>
>> I think that is a negative improvement.  Yay, now we are talking about
>> Attribute Collections and Attributes instead of SuperColumns and
>> Columns.  The same objections ("one object's name contains the
>> other's!) apply, plus the new one of sounding so generic that it could
>> apply to practically any system.
>>
>> -Jonathan
>>
>
>
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

Points taken, and I agree, except in my experience the current names
are not Pretty Good but rather Pretty Weird; the primary issues being
column family and super column.

If we go by the shorter-is-better principle, we might get:

Cluster
Schema
Row set
Row w/key
Field set
Field

"You take the user's key, and use that to insert into the Row Set
'user_associations' at Field Set 'user_timeline,' a field named with a
time-based UUID representing now, and with a value of the new tweet's
key."

But let me study for a while and come up with a more researched proposal.

Evan

On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
>> However I think it's worth considering this from a strategic
>> perspective, looking at how we want the project do grow and change,
>> rather than just as it is right now.  The key to successful adoption
>> is having a successful elevator pitch,  you can start using a database
>> without understanding relational-algebra because 'table' and 'column'
>> are such simple ways to reason about the tool.  As it stands
>> cassandra's takes a whiteboard and 15 minutes, before people get what
>> you're talking about.
>
> If you want to explain it as "sort of like a relational db" then
>
> table -> CF
> column -> column
> key -> key
> row -> row
>
> That's the simple case, then all you have is "supercolumns can contain
> a list of simple columns."
>
> That really doesn't seem so hard to me.  I have explained this to *managers*.
>
>> Assuming the project gets anything like the adoption it deserves, the
>> users we have today will be a *tiny minority* of the users we have in
>> the future.  So imposing costs on the current userbase which will give
>> huge benefits to future users, should be something we're willing to
>> do.  In fact it's something that has been done repeatedly over the
>> last few weeks.
>
> I agree.  But as I said before I just don't see this as being an improvement.
>
>> Given those changes went in without debate, I'm not sure what the
>> reluctance is for making changes to the nomenclature for the project.
>
> As above.
>
>> Speaking as someone who's only been doing this a month, the naming is
>> *still* confusing, and when I talk with people who wonder what
>> cassandra is all about I get blank looks when telling them what things
>> are called.  If you step back and want to tell someone how you'd
>> insert a tweet into someone's timeline using evan's weblog post:
>>
>>  "You just take the user's key, and use that to insert into the
>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>> ColumnName of a time based uuid representing now, and a value of the
>> new tweet's key"
>>
>> Column is in the name of 3 of the 5 concepts expressed, and in each
>> cases it's different.
>
> When you're inserting something nested 3 levels deep a certain amount
> of verbosity is unavoidable.  With Evan's nomenclature,
>
> "You take the user's record ID, and use that to insert into the Record
> Collection 'user associations' at Attribute Collection
> 'user_timeline,' an Attribute named with a time based uuid
> representing now, and with a value of the new tweet's key."
>
> I think that is a negative improvement.  Yay, now we are talking about
> Attribute Collections and Attributes instead of SuperColumns and
> Columns.  The same objections ("one object's name contains the
> other's!) apply, plus the new one of sounding so generic that it could
> apply to practically any system.
>
> -Jonathan
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<mi...@koziarski.com> wrote:
> However I think it's worth considering this from a strategic
> perspective, looking at how we want the project do grow and change,
> rather than just as it is right now.  The key to successful adoption
> is having a successful elevator pitch,  you can start using a database
> without understanding relational-algebra because 'table' and 'column'
> are such simple ways to reason about the tool.  As it stands
> cassandra's takes a whiteboard and 15 minutes, before people get what
> you're talking about.

If you want to explain it as "sort of like a relational db" then

table -> CF
column -> column
key -> key
row -> row

That's the simple case, then all you have is "supercolumns can contain
a list of simple columns."

That really doesn't seem so hard to me.  I have explained this to *managers*.

> Assuming the project gets anything like the adoption it deserves, the
> users we have today will be a *tiny minority* of the users we have in
> the future.  So imposing costs on the current userbase which will give
> huge benefits to future users, should be something we're willing to
> do.  In fact it's something that has been done repeatedly over the
> last few weeks.

I agree.  But as I said before I just don't see this as being an improvement.

> Given those changes went in without debate, I'm not sure what the
> reluctance is for making changes to the nomenclature for the project.

As above.

> Speaking as someone who's only been doing this a month, the naming is
> *still* confusing, and when I talk with people who wonder what
> cassandra is all about I get blank looks when telling them what things
> are called.  If you step back and want to tell someone how you'd
> insert a tweet into someone's timeline using evan's weblog post:
>
>  "You just take the user's key, and use that to insert into the
> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
> ColumnName of a time based uuid representing now, and a value of the
> new tweet's key"
>
> Column is in the name of 3 of the 5 concepts expressed, and in each
> cases it's different.

When you're inserting something nested 3 levels deep a certain amount
of verbosity is unavoidable.  With Evan's nomenclature,

"You take the user's record ID, and use that to insert into the Record
Collection 'user associations' at Attribute Collection
'user_timeline,' an Attribute named with a time based uuid
representing now, and with a value of the new tweet's key."

I think that is a negative improvement.  Yay, now we are talking about
Attribute Collections and Attributes instead of SuperColumns and
Columns.  The same objections ("one object's name contains the
other's!) apply, plus the new one of sounding so generic that it could
apply to practically any system.

-Jonathan

Re: Fixing the data model names

Posted by Michael Koziarski <mi...@koziarski.com>.

> I agree that individually, the current names are technically accurate
> in their specific contexts. But taken as a whole, they make
> practically no sense to someone starting out, as Ryan mentions. I'll
> poke around try to come up with some other possible term sets. The
> point isn't that they are *this* specific set, just that they are
> internally consistent, and analogous to things widely understood.

I don't want to weigh in on the cost-side of the equation, I'm not
qualified to know how much work is involved in that.

However I think it's worth considering this from a strategic
perspective, looking at how we want the project do grow and change,
rather than just as it is right now.  The key to successful adoption
is having a successful elevator pitch,  you can start using a database
without understanding relational-algebra because 'table' and 'column'
are such simple ways to reason about the tool.  As it stands
cassandra's takes a whiteboard and 15 minutes, before people get what
you're talking about

Assuming the project gets anything like the adoption it deserves, the
users we have today will be a *tiny minority* of the users we have in
the future.  So imposing costs on the current userbase which will give
huge benefits to future users, should be something we're willing to
do.  In fact it's something that has been done repeatedly over the
last few weeks.

SuperColumnFamilies have had their behaviour *completely* change with
the addition of comparators for subcolumns, the on disk format has
changed and the configuration file format is completely different.
All of these changes have been great and are huge positive
improvements, but have imposed significant taxes on existing users.
Given those changes went in without debate, I'm not sure what the
reluctance is for making changes to the nomenclature for the project.

Speaking as someone who's only been doing this a month, the naming is
*still* confusing, and when I talk with people who wonder what
cassandra is all about I get blank looks when telling them what things
are called.  If you step back and want to tell someone how you'd
insert a tweet into someone's timeline using evan's weblog post:

  "You just take the user's key, and use that to insert into the
SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
ColumnName of a time based uuid representing now, and a value of the
new tweet's key"

Column is in the name of 3 of the 5 concepts expressed, and in each
cases it's different.  In none of the cases does it correspond to what
users coming from an RDBMS background think of a column.  Additionally
the names SuperColumnFamily and ColumnFamily don't cover the main
difference, it just makes one sound scalier than the other.

I have no idea what alternative names are, and am reluctant to try as
I'm still a newbie here and have precisely one rejected patch to my
name, but I do strongly think that at the very least we should
strongly consider renaming anything with 'Column'  in the name.



-- 
Cheers

Koz

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

Re. Jonathan on "database": oracle/sqlserver/mysql/postgres call it a
database. I guess "schema" is ok, but it seems like a case of "why be
different" (see, I can play both sides here :-p ). I didn't know that
"row" was considered a real thing in Cassandra.

Re. Jonathan on "columns": it would make more sense if "column family"
was actually called "sparse table". But super columns break the
tabular model, so I don't think pretending to be tabular is a good
answer. Personally I prefer the terms borrowed from document databases
(I didn't realize that "attribute" was the relational-theory term).
Maybe "field" and "field set" is better.

I agree that individually, the current names are technically accurate
in their specific contexts. But taken as a whole, they make
practically no sense to someone starting out, as Ryan mentions. I'll
poke around try to come up with some other possible term sets. The
point isn't that they are *this* specific set, just that they are
internally consistent, and analogous to things widely understood.

Evan

On Wed, Aug 12, 2009 at 6:05 PM, Ryan King<ry...@twitter.com> wrote:
> I'm not going to go into my full position on this issue, because I
> agree with Evan (we developed the proposal together).
>
> I would like to reiterate, one of our main motivations behind renaming
> the data model is to make it easier for people to get up to speed with
> Cassandra.
>
> Evan and I both had problems understanding the data model and we've
> seen the same struggles over and over as we try and explain the data
> model to other engineers here at twitter. So, after developing this
> proposal for a new naming scheme, we tested it with more engineers, to
> see if it was, in fact, easier to explain. We didn't do a rigorous
> study, but without a doubt it was clearer and easier to understand.
> And these are all people who've read the BigTable and Dynamo papers,
> most of whom have CS (bachelors' or masters') degrees and are
> generally smart.
>
> I'm not saying this is a definitive study, but I think we need to try
> and understand the perspective of the n00bs.
>
> On Wed, Aug 12, 2009 at 11:52 AM, Jonathan Ellis<jb...@gmail.com> wrote:
>> My brief two cents:
>>
>> I think terminology + api changes need to be a big improvement to be
>> worth breaking things at this point, and I don't think this proposal
>> meets that bar.  In fact I'm not sure any proposal could.
>>
>> On the specifics:
>>
>> * Keyspace vs Database
>>
>> Actually the right concept from the rdb world is "schema."  (Maybe it
>> is a mysql-ism to call these "databases?")
>>
>> I deliberately avoided that term though, possibly mistakenly.
>>
>> * ColumnFamily vs Record collection
>>
>> -1.  CF correctly implies "group of columns" to me without being so
>> generic it could apply to anything.
>
> But a CF isn't a "group of columns", it's a group of <thing without a
> name>'s, which contain columns. This naming caused me to believe that
> you have something (row/record) that spans multiple column families.
>
>> * Record vs Row
>>
>> I don't really care, I guess, but row never really seemed confusing to me.
>>
>> * Column vs Attribute
>>
>> Definitely -1 on this too.  Both imply "a named value" but column is
>> from the database world but attribute is from OO.  The connotations
>> are wrong.  Here the baggage from a relational background is mostly
>> correct.  As Evan notes the difference is that ColumnFamilies are
>> sparse, but that is a difference between CFs and Tables not between
>> the different concepts of Columns per se.
>
> I think my problem with using column here is that it implies that you
> can do stuff with columns from multiple rows/records.
>
>> * SuperColumn vs Attribute Collection
>>
>> SuperColumn is probably the worst name here, but calling it a
>> ColumnCollection would not be an improvement.  (I can have a
>> Collection<Column> in my own code, and do, but that is not the same
>> thing at all.)
>>
>> So having thought it through I think I would have to say I think the
>> current names, if not perfect, are underrated.  Even if making the
>> change were free, and it's obviously not, I would prefer the existing
>> terminology.
>
> I think, overall, the naming is a significant barrier to entry for new
> cassandra users. This proposal will certainly be expensive, both in
> terms of the work (which we at twitter are willing to do) and the
> disruption. However, we're still early in Cassandra's life and this
> may be our only chance to improve this situation.
>
> -ryan
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

I think we understand the concepts clearly. They're not hard once you
clear away the misconceptions and mis-assumptions. Of course it's part
of the barrier, but I don't think it's the main barrier.

Maybe we should start with a list of common misconceptions, and work
bottom up to the best way to prevent them.

Evan

On Thu, Aug 13, 2009 at 1:55 PM, Evan Weaver<ew...@gmail.com> wrote:
>> For the terminology to be considered a barrier to entry, I think you
>> need to demonstrate obviously superior terminology.
>
> I agree with that and am happy to accept that our proposal is not good
> enough. We'll work on another.
>
> Evan
>
> On Thu, Aug 13, 2009 at 1:52 PM, Eric Evans<ee...@rackspace.com> wrote:
>> On Wed, 2009-08-12 at 15:05 -0700, Ryan King wrote:
>>> I would like to reiterate, one of our main motivations behind renaming
>>> the data model is to make it easier for people to get up to speed with
>>> Cassandra.
>>
>> This has been repeated several times during this thread. I hope it's not
>> meant to imply that those opposed do not care about our users, or about
>> making Cassandra easier to understand.
>>
>>> Evan and I both had problems understanding the data model and we've
>>> seen the same struggles over and over as we try and explain the data
>>> model to other engineers here at twitter. So, after developing this
>>> proposal for a new naming scheme, we tested it with more engineers, to
>>> see if it was, in fact, easier to explain. We didn't do a rigorous
>>> study, but without a doubt it was clearer and easier to understand.
>>> And these are all people who've read the BigTable and Dynamo papers,
>>> most of whom have CS (bachelors' or masters') degrees and are
>>> generally smart.
>>
>> Yeah, that's anecdotal. I could counter with anecdotal evidence to the
>> contrary but I don't think it would be very helpful or productive.
>>
>> I honestly feel like you guys are confounding the concepts, and the
>> terminology used to describe them. Granted, the right choice of
>> terminology could certainly make it easier to convey how things work,
>> but there is a sort of minimum overhead here. In other words, you can
>> call things whatever you want, it's not going to change how they
>> actually work. At least some portion of the difficulty people have in
>> conceptualizing Cassandra, are in fact the concepts themselves.
>>
>> [ ... ]
>>
>>> > So having thought it through I think I would have to say I think the
>>> > current names, if not perfect, are underrated.  Even if making the
>>> > change were free, and it's obviously not, I would prefer the existing
>>> > terminology.
>>>
>>> I think, overall, the naming is a significant barrier to entry for new
>>> cassandra users. This proposal will certainly be expensive, both in
>>> terms of the work (which we at twitter are willing to do) and the
>>> disruption. However, we're still early in Cassandra's life and this
>>> may be our only chance to improve this situation.
>>
>> For the terminology to be considered a barrier to entry, I think you
>> need to demonstrate obviously superior terminology.
>>
>> --
>> Eric Evans
>> eevans@rackspace.com
>>
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

> For the terminology to be considered a barrier to entry, I think you
> need to demonstrate obviously superior terminology.

I agree with that and am happy to accept that our proposal is not good
enough. We'll work on another.

Evan

On Thu, Aug 13, 2009 at 1:52 PM, Eric Evans<ee...@rackspace.com> wrote:
> On Wed, 2009-08-12 at 15:05 -0700, Ryan King wrote:
>> I would like to reiterate, one of our main motivations behind renaming
>> the data model is to make it easier for people to get up to speed with
>> Cassandra.
>
> This has been repeated several times during this thread. I hope it's not
> meant to imply that those opposed do not care about our users, or about
> making Cassandra easier to understand.
>
>> Evan and I both had problems understanding the data model and we've
>> seen the same struggles over and over as we try and explain the data
>> model to other engineers here at twitter. So, after developing this
>> proposal for a new naming scheme, we tested it with more engineers, to
>> see if it was, in fact, easier to explain. We didn't do a rigorous
>> study, but without a doubt it was clearer and easier to understand.
>> And these are all people who've read the BigTable and Dynamo papers,
>> most of whom have CS (bachelors' or masters') degrees and are
>> generally smart.
>
> Yeah, that's anecdotal. I could counter with anecdotal evidence to the
> contrary but I don't think it would be very helpful or productive.
>
> I honestly feel like you guys are confounding the concepts, and the
> terminology used to describe them. Granted, the right choice of
> terminology could certainly make it easier to convey how things work,
> but there is a sort of minimum overhead here. In other words, you can
> call things whatever you want, it's not going to change how they
> actually work. At least some portion of the difficulty people have in
> conceptualizing Cassandra, are in fact the concepts themselves.
>
> [ ... ]
>
>> > So having thought it through I think I would have to say I think the
>> > current names, if not perfect, are underrated.  Even if making the
>> > change were free, and it's obviously not, I would prefer the existing
>> > terminology.
>>
>> I think, overall, the naming is a significant barrier to entry for new
>> cassandra users. This proposal will certainly be expensive, both in
>> terms of the work (which we at twitter are willing to do) and the
>> disruption. However, we're still early in Cassandra's life and this
>> may be our only chance to improve this situation.
>
> For the terminology to be considered a barrier to entry, I think you
> need to demonstrate obviously superior terminology.
>
> --
> Eric Evans
> eevans@rackspace.com
>
>



-- 
Evan Weaver

Re: Fixing the data model names

Posted by Eric Evans <ee...@rackspace.com>.

On Wed, 2009-08-12 at 15:05 -0700, Ryan King wrote:
> I would like to reiterate, one of our main motivations behind renaming
> the data model is to make it easier for people to get up to speed with
> Cassandra.

This has been repeated several times during this thread. I hope it's not
meant to imply that those opposed do not care about our users, or about
making Cassandra easier to understand.

> Evan and I both had problems understanding the data model and we've
> seen the same struggles over and over as we try and explain the data
> model to other engineers here at twitter. So, after developing this
> proposal for a new naming scheme, we tested it with more engineers, to
> see if it was, in fact, easier to explain. We didn't do a rigorous
> study, but without a doubt it was clearer and easier to understand.
> And these are all people who've read the BigTable and Dynamo papers,
> most of whom have CS (bachelors' or masters') degrees and are
> generally smart.

Yeah, that's anecdotal. I could counter with anecdotal evidence to the
contrary but I don't think it would be very helpful or productive.

I honestly feel like you guys are confounding the concepts, and the
terminology used to describe them. Granted, the right choice of
terminology could certainly make it easier to convey how things work,
but there is a sort of minimum overhead here. In other words, you can
call things whatever you want, it's not going to change how they
actually work. At least some portion of the difficulty people have in
conceptualizing Cassandra, are in fact the concepts themselves.

[ ... ]

> > So having thought it through I think I would have to say I think the
> > current names, if not perfect, are underrated.  Even if making the
> > change were free, and it's obviously not, I would prefer the existing
> > terminology.
> 
> I think, overall, the naming is a significant barrier to entry for new
> cassandra users. This proposal will certainly be expensive, both in
> terms of the work (which we at twitter are willing to do) and the
> disruption. However, we're still early in Cassandra's life and this
> may be our only chance to improve this situation.

For the terminology to be considered a barrier to entry, I think you
need to demonstrate obviously superior terminology.

-- 
Eric Evans
eevans@rackspace.com

Re: Fixing the data model names

Posted by Ryan King <ry...@twitter.com>.

I'm not going to go into my full position on this issue, because I
agree with Evan (we developed the proposal together).

I would like to reiterate, one of our main motivations behind renaming
the data model is to make it easier for people to get up to speed with
Cassandra.

Evan and I both had problems understanding the data model and we've
seen the same struggles over and over as we try and explain the data
model to other engineers here at twitter. So, after developing this
proposal for a new naming scheme, we tested it with more engineers, to
see if it was, in fact, easier to explain. We didn't do a rigorous
study, but without a doubt it was clearer and easier to understand.
And these are all people who've read the BigTable and Dynamo papers,
most of whom have CS (bachelors' or masters') degrees and are
generally smart.

I'm not saying this is a definitive study, but I think we need to try
and understand the perspective of the n00bs.

On Wed, Aug 12, 2009 at 11:52 AM, Jonathan Ellis<jb...@gmail.com> wrote:
> My brief two cents:
>
> I think terminology + api changes need to be a big improvement to be
> worth breaking things at this point, and I don't think this proposal
> meets that bar.  In fact I'm not sure any proposal could.
>
> On the specifics:
>
> * Keyspace vs Database
>
> Actually the right concept from the rdb world is "schema."  (Maybe it
> is a mysql-ism to call these "databases?")
>
> I deliberately avoided that term though, possibly mistakenly.
>
> * ColumnFamily vs Record collection
>
> -1.  CF correctly implies "group of columns" to me without being so
> generic it could apply to anything.

But a CF isn't a "group of columns", it's a group of <thing without a
name>'s, which contain columns. This naming caused me to believe that
you have something (row/record) that spans multiple column families.

> * Record vs Row
>
> I don't really care, I guess, but row never really seemed confusing to me.
>
> * Column vs Attribute
>
> Definitely -1 on this too.  Both imply "a named value" but column is
> from the database world but attribute is from OO.  The connotations
> are wrong.  Here the baggage from a relational background is mostly
> correct.  As Evan notes the difference is that ColumnFamilies are
> sparse, but that is a difference between CFs and Tables not between
> the different concepts of Columns per se.

I think my problem with using column here is that it implies that you
can do stuff with columns from multiple rows/records.

> * SuperColumn vs Attribute Collection
>
> SuperColumn is probably the worst name here, but calling it a
> ColumnCollection would not be an improvement.  (I can have a
> Collection<Column> in my own code, and do, but that is not the same
> thing at all.)
>
> So having thought it through I think I would have to say I think the
> current names, if not perfect, are underrated.  Even if making the
> change were free, and it's obviously not, I would prefer the existing
> terminology.

I think, overall, the naming is a significant barrier to entry for new
cassandra users. This proposal will certainly be expensive, both in
terms of the work (which we at twitter are willing to do) and the
disruption. However, we're still early in Cassandra's life and this
may be our only chance to improve this situation.

-ryan

Re: Fixing the data model names

Posted by Jonathan Ellis <jb...@gmail.com>.

My brief two cents:

I think terminology + api changes need to be a big improvement to be
worth breaking things at this point, and I don't think this proposal
meets that bar.  In fact I'm not sure any proposal could.

On the specifics:

* Keyspace vs Database

Actually the right concept from the rdb world is "schema."  (Maybe it
is a mysql-ism to call these "databases?")

I deliberately avoided that term though, possibly mistakenly.

* ColumnFamily vs Record collection

-1.  CF correctly implies "group of columns" to me without being so
generic it could apply to anything.

* Record vs Row

I don't really care, I guess, but row never really seemed confusing to me.

* Column vs Attribute

Definitely -1 on this too.  Both imply "a named value" but column is
from the database world but attribute is from OO.  The connotations
are wrong.  Here the baggage from a relational background is mostly
correct.  As Evan notes the difference is that ColumnFamilies are
sparse, but that is a difference between CFs and Tables not between
the different concepts of Columns per se.

* SuperColumn vs Attribute Collection

SuperColumn is probably the worst name here, but calling it a
ColumnCollection would not be an improvement.  (I can have a
Collection<Column> in my own code, and do, but that is not the same
thing at all.)

So having thought it through I think I would have to say I think the
current names, if not perfect, are underrated.  Even if making the
change were free, and it's obviously not, I would prefer the existing
terminology.

-Jonathan

Re: Fixing the data model names

Posted by Ben Standefer <be...@gmail.com>.

View from a converting user (ie, non-committing lurker): I have spent 2-3
hours having Cassandra's data model explained to me in-person at the
hackathon, and the newly proposed language makes a lot more sense to me
right off the bat.  I strongly agree that specifically the naming and
verbiage of the data model poses a high barrier to entry.  The newly
proposed naming scheme conveys the concepts of Cassandra much more clearly.
Converting the column family -> thing with no name -> super column -> column
hierarchy to record collection -> record -> attribute collection ->
attribute removes incorrect connotations and analogies to tables, making it
easier for n00bs to understand that Cassandra is a structured key-value
store with a data model somewhere between memcached/BerkeleyDB and a folder
structure, rather than a table-based storage engine.

I really think the costs of renaming the data model (which Evan has
volunteered to bear the brunt of) should be weighted carefully against the
benefits gained from ease of adoption and increased interest.  If every new
Cassandra user has to power through 4 hours of in-person question-asking
with Cassandra experts to get the data model down, it could easily gain a
reputation for being overly complex to understand and use, when it's really
not too bad.

-Ben Standefer


On Wed, Aug 12, 2009 at 10:58 AM, Evan Weaver <ew...@gmail.com> wrote:

> It seems so far we have Eric strongly against, and a few others as
> tentatively in favor, with caveats.
>
> Before I address the points specifically, I'd like to refer you to
> this API design manual from the QT team:
> http://chaos.troll.no/~shausman/api-design/api-design.pdf<http://chaos.troll.no/%7Eshausman/api-design/api-design.pdf>
> .
> Specifically, a quote: "It is better to have a system omit certain
> anomalous features and improvements, but to reﬂect one set of design
> ideas, than to have one that contains many good but independent and
> uncoordinated ideas." Right now we have the second, which is
> understandable, historically.
>
> Ok, onward.
>
> Re. Bill, I said cluster contains keyspaces/tables/databases, because
> multiple keyspaces can be defined within a cluster, as per the
> storage-conf.xml. That is all. I also mean it to refer to a physical
> collection of networked machines performing the same work.
>
> Re. Mark, I think collection is a mouthful too. Sets in math are not
> ordered, though, which makes me reluctant to support the use of the
> word "set".
>
> Re. Evans, it is true that Cassandra was influenced by Dynamo and
> BigTable. However, it is not merely a merge of those two. When I was
> getting started, everyone would say "Cassandra uses the BigTable"
> model, even though this was not actually the case. Super columns,
> local storage, and no column versioning are all significant and
> confusing diversions. Hypertable and Hbase cargo-cult^H^H^H^Hfollow
> that model strictly, so it makes more sense for them to keep the
> terminology.
>
> Database developers have read the BigTable and Dynamo papers. Database
> users have not. They will not, unless they are confused, and if they
> are confused, it will lead them further astray, because Cassandra's
> implementation has diverged.
>
> I disagree that the change would have a huge cost. A couple blog posts
> will be out of date. The Cassandra contributors (all 10 of them) will
> have to do a straightforward mental translation of terms for a few
> days before the new ones become comfortable. In my (statistically
> unsound) polls, the users, who don't even have a full grasp on the
> *current* terminology, will rejoice.
>
> BigTable's innovation was the data model, not the API. The source of
> our API problem is that in the BigTable paper, the API is directed
> towards a specific use case: a semi-column-oriented index store.
> However the data model itself is actually general, and that's what is
> interesting to our project. Things in the BigTable API that cause us
> significant problems:
>  * String-concatenated colon API (we fixed this).
>  * "Table", which prioritizes the column-oriented use, in direct
> opposition to the current use of the terminology (we fixed this,
> someplaces).
>  * Being called a "column store", again prioritizing the specific use
> case, which is falsely analogous to relational column stores (this was
> never really enshrined in Cassandra).
>  * Column "families", again prioritizing the specific use case
> (because it assumes that a document is spread across multiple
> families, and that a key, in isolation, refers to a globally unique
> document). Also a phrase used nowhere else in CS.
>  * Having "columns" which are neither tabular columns, or attributes
> stored in column-major order, but attributes stored in (surprise!)
> row-major order.
>
> Maybe "attribute" is interchangeable with "column" in the relational
> world, but it's used in the (even more widely known) object-oriented
> world too, to mean exactly what we need it to mean. In regards to
> "column family", maybe "attribute family" would be a suitable
> compromise, and be familiar to BigTable people. It's also a grouping
> of keys, and a grouping of records, so I don't know why "column
> family" makes more sense than "key family" or "record family", except
> for historical reasons. If we went with "attribute family", then we
> would have Cluster, Database, Attribute family, Record, Attribute, and
> Attribute collection. What's the difference between "Attribute family"
> and "Attribute collection"? We'd have to revert to the meaningless
> "super" to avoid a conflict, and it breaks the downward hierarchy of
> terms.
>
> For the things which do not have official names, "row", "record",
> etc., I don't think saying "you can call it what you want" is
> workable. I run across this currently at my job, trying to talk about
> things to other people. We settled on "row" but feel weird about it,
> because you can never quite be sure if someone else means the same
> thing you do. So it always requires an explanation.
>
> In regards to the better examples, I did the best possible job I could
> at
> http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/
> to give multiple clear examples. The post is very long, in a large
> part because the terminology is so foreign. Specifically for column
> and super column, I have to quite literally say "column: this is a
> tuple" and "super column: this is a named list". We should call them
> something that actually means that. More examples would help, but we
> need something else, because good examples are already there. This
> post got lots of readership, and is linked on the wiki, yet we still
> have confused users.
>
> I've been happily attacking other messed-up things in the Thrift API,
> with Jonathan's help. Those generate less controversy, so he's already
> committed tons of improvements. No more colons-concatenated API, no
> more modeling choices enshrined in the RPC names, abstracting and
> normalizing column references, etc. There's no reason this won't
> continue.
>
> Evan
>
> PS. It doesn't matter which version it goes into as long as it's before
> 1.0.
>
> On Wed, Aug 12, 2009 at 7:27 AM, Eric Evans<ee...@rackspace.com> wrote:
> > On Tue, 2009-08-11 at 22:34 -0700, Arin Sarkissian wrote:
> >> But realistically how much of this confusion could be avoided with a
> >> legit example? Once you see a good example you start getting it. A lot
> >> of people have been pointed towards the ThriftIterface page on the
> >> wiki which clears up next to nothing:
> >> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
> >> "edges", "base_attributes" etc. It's next door to nonsensical..
> >>
> >> What if we had a real example that people could relate to... a model a
> >> blog or something along those lines & update the
> >> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
> >> on the API methods would be used to accomplish basic tasks... ex: get
> >> all comments for a blog entry, list entires in time order, list
> >> entries tagged "bar", find all entries with "foo" in the body (kinda
> >> like the Facebook mail search example).
> >
> > Full ACK.
> >
> > Renaming everything carries a huge cost for (IMO) dubious benefit.
> > However, the cost-to-benefit ratio for better documentation and samples
> > seems excellent.
> >
> > --
> > Eric Evans
> > eevans@rackspace.com
> >
> >
>
>
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Posted by Evan Weaver <ew...@gmail.com>.

It seems so far we have Eric strongly against, and a few others as
tentatively in favor, with caveats.

Before I address the points specifically, I'd like to refer you to
this API design manual from the QT team:
http://chaos.troll.no/~shausman/api-design/api-design.pdf.
Specifically, a quote: "It is better to have a system omit certain
anomalous features and improvements, but to reﬂect one set of design
ideas, than to have one that contains many good but independent and
uncoordinated ideas." Right now we have the second, which is
understandable, historically.

Ok, onward.

Re. Bill, I said cluster contains keyspaces/tables/databases, because
multiple keyspaces can be defined within a cluster, as per the
storage-conf.xml. That is all. I also mean it to refer to a physical
collection of networked machines performing the same work.

Re. Mark, I think collection is a mouthful too. Sets in math are not
ordered, though, which makes me reluctant to support the use of the
word "set".

Re. Evans, it is true that Cassandra was influenced by Dynamo and
BigTable. However, it is not merely a merge of those two. When I was
getting started, everyone would say "Cassandra uses the BigTable"
model, even though this was not actually the case. Super columns,
local storage, and no column versioning are all significant and
confusing diversions. Hypertable and Hbase cargo-cult^H^H^H^Hfollow
that model strictly, so it makes more sense for them to keep the
terminology.

Database developers have read the BigTable and Dynamo papers. Database
users have not. They will not, unless they are confused, and if they
are confused, it will lead them further astray, because Cassandra's
implementation has diverged.

I disagree that the change would have a huge cost. A couple blog posts
will be out of date. The Cassandra contributors (all 10 of them) will
have to do a straightforward mental translation of terms for a few
days before the new ones become comfortable. In my (statistically
unsound) polls, the users, who don't even have a full grasp on the
*current* terminology, will rejoice.

BigTable's innovation was the data model, not the API. The source of
our API problem is that in the BigTable paper, the API is directed
towards a specific use case: a semi-column-oriented index store.
However the data model itself is actually general, and that's what is
interesting to our project. Things in the BigTable API that cause us
significant problems:
  * String-concatenated colon API (we fixed this).
  * "Table", which prioritizes the column-oriented use, in direct
opposition to the current use of the terminology (we fixed this,
someplaces).
  * Being called a "column store", again prioritizing the specific use
case, which is falsely analogous to relational column stores (this was
never really enshrined in Cassandra).
  * Column "families", again prioritizing the specific use case
(because it assumes that a document is spread across multiple
families, and that a key, in isolation, refers to a globally unique
document). Also a phrase used nowhere else in CS.
  * Having "columns" which are neither tabular columns, or attributes
stored in column-major order, but attributes stored in (surprise!)
row-major order.

Maybe "attribute" is interchangeable with "column" in the relational
world, but it's used in the (even more widely known) object-oriented
world too, to mean exactly what we need it to mean. In regards to
"column family", maybe "attribute family" would be a suitable
compromise, and be familiar to BigTable people. It's also a grouping
of keys, and a grouping of records, so I don't know why "column
family" makes more sense than "key family" or "record family", except
for historical reasons. If we went with "attribute family", then we
would have Cluster, Database, Attribute family, Record, Attribute, and
Attribute collection. What's the difference between "Attribute family"
and "Attribute collection"? We'd have to revert to the meaningless
"super" to avoid a conflict, and it breaks the downward hierarchy of
terms.

For the things which do not have official names, "row", "record",
etc., I don't think saying "you can call it what you want" is
workable. I run across this currently at my job, trying to talk about
things to other people. We settled on "row" but feel weird about it,
because you can never quite be sure if someone else means the same
thing you do. So it always requires an explanation.

In regards to the better examples, I did the best possible job I could
at http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/
to give multiple clear examples. The post is very long, in a large
part because the terminology is so foreign. Specifically for column
and super column, I have to quite literally say "column: this is a
tuple" and "super column: this is a named list". We should call them
something that actually means that. More examples would help, but we
need something else, because good examples are already there. This
post got lots of readership, and is linked on the wiki, yet we still
have confused users.

I've been happily attacking other messed-up things in the Thrift API,
with Jonathan's help. Those generate less controversy, so he's already
committed tons of improvements. No more colons-concatenated API, no
more modeling choices enshrined in the RPC names, abstracting and
normalizing column references, etc. There's no reason this won't
continue.

Evan

PS. It doesn't matter which version it goes into as long as it's before 1.0.

On Wed, Aug 12, 2009 at 7:27 AM, Eric Evans<ee...@rackspace.com> wrote:
> On Tue, 2009-08-11 at 22:34 -0700, Arin Sarkissian wrote:
>> But realistically how much of this confusion could be avoided with a
>> legit example? Once you see a good example you start getting it. A lot
>> of people have been pointed towards the ThriftIterface page on the
>> wiki which clears up next to nothing:
>> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
>> "edges", "base_attributes" etc. It's next door to nonsensical..
>>
>> What if we had a real example that people could relate to... a model a
>> blog or something along those lines & update the
>> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
>> on the API methods would be used to accomplish basic tasks... ex: get
>> all comments for a blog entry, list entires in time order, list
>> entries tagged "bar", find all entries with "foo" in the body (kinda
>> like the Facebook mail search example).
>
> Full ACK.
>
> Renaming everything carries a huge cost for (IMO) dubious benefit.
> However, the cost-to-benefit ratio for better documentation and samples
> seems excellent.
>
> --
> Eric Evans
> eevans@rackspace.com
>
>

-- 
Evan Weaver

Re: Fixing the data model names

Posted by Eric Evans <ee...@rackspace.com>.

On Tue, 2009-08-11 at 22:34 -0700, Arin Sarkissian wrote:
> But realistically how much of this confusion could be avoided with a
> legit example? Once you see a good example you start getting it. A lot
> of people have been pointed towards the ThriftIterface page on the
> wiki which clears up next to nothing:
> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
> "edges", "base_attributes" etc. It's next door to nonsensical..
> 
> What if we had a real example that people could relate to... a model a
> blog or something along those lines & update the
> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
> on the API methods would be used to accomplish basic tasks... ex: get
> all comments for a blog entry, list entires in time order, list
> entries tagged "bar", find all entries with "foo" in the body (kinda
> like the Facebook mail search example).

Full ACK.

Renaming everything carries a huge cost for (IMO) dubious benefit.
However, the cost-to-benefit ratio for better documentation and samples
seems excellent.

-- 
Eric Evans
eevans@rackspace.com

Re: Fixing the data model names

Posted by Arin Sarkissian <ar...@rspot.net>.

well, if we do a delicious clone at least we won't have to worry about
making it look good ;-)

Hit me up on IRC. We can coordinate

Arin

On Tue, Aug 11, 2009 at 11:06 PM, Curt Micol<as...@gmail.com> wrote:
> On Wed, Aug 12, 2009 at 1:40 AM, Arin Sarkissian<ar...@rspot.net> wrote:
>> Mark I can work on that with you.
>> We should do this regardless of naming changes etc.
>> I'll even volunteer to do a PHP app based on the data model we mock up.
>>
>> if you wanna coordinate some work on this you can reach me at:
>> email: arin@rspot.net (or arin@digg.com)
>> IM/Twitter/IRC/just_about_everything_online: phatduckk
>
> Arin,
>
> I saw your post in IRC, and was going to mention I am finishing up a
> client's project soon and was going to begin work on a Delicious clone
> in Cassandra.  My goal was to scrape Delicious for data and place it
> into Cassandra.  Additionally I wanted to make that data available as
> another issue I think new comers run into is that they don't have data
> to play with.  I intend to do this in Python with ieure's lazyboy
> code.
>
> I think doing something other than a blog will give another dimension
> to new comers understanding the data model of Cassandra.  I hope to
> contribute some tutorials around this also.  Hopefully by the end of
> August I have at least something thrown together to contribute.
>
> Either way, I am willing to assist with documentation (granted up to
> my level of understanding at this point :)).
>
> I can be reached at asenchi@asenchi.com if anyone wants to throw stuff
> my way.  I am a n00b in this area of tech, that may be beneficial for
> documentation.
>
> Thanks,
>
> --
> # Curt Micol
>

Re: Fixing the data model names

Posted by Curt Micol <as...@gmail.com>.

On Wed, Aug 12, 2009 at 1:40 AM, Arin Sarkissian<ar...@rspot.net> wrote:
> Mark I can work on that with you.
> We should do this regardless of naming changes etc.
> I'll even volunteer to do a PHP app based on the data model we mock up.
>
> if you wanna coordinate some work on this you can reach me at:
> email: arin@rspot.net (or arin@digg.com)
> IM/Twitter/IRC/just_about_everything_online: phatduckk

Arin,

I saw your post in IRC, and was going to mention I am finishing up a
client's project soon and was going to begin work on a Delicious clone
in Cassandra.  My goal was to scrape Delicious for data and place it
into Cassandra.  Additionally I wanted to make that data available as
another issue I think new comers run into is that they don't have data
to play with.  I intend to do this in Python with ieure's lazyboy
code.

I think doing something other than a blog will give another dimension
to new comers understanding the data model of Cassandra.  I hope to
contribute some tutorials around this also.  Hopefully by the end of
August I have at least something thrown together to contribute.

Either way, I am willing to assist with documentation (granted up to
my level of understanding at this point :)).

I can be reached at asenchi@asenchi.com if anyone wants to throw stuff
my way.  I am a n00b in this area of tech, that may be beneficial for
documentation.

Thanks,

-- 
# Curt Micol

Re: Fixing the data model names

Posted by Arin Sarkissian <ar...@rspot.net>.

Mark I can work on that with you.
We should do this regardless of naming changes etc.
I'll even volunteer to do a PHP app based on the data model we mock up.

if you wanna coordinate some work on this you can reach me at:
email: arin@rspot.net (or arin@digg.com)
IM/Twitter/IRC/just_about_everything_online: phatduckk

- Arin


On Tue, Aug 11, 2009 at 10:36 PM, Mark McBride<ma...@gmail.com> wrote:
> It seems to me that what would be most helpful, regardless of changes,
> is having a document that describes the data model in more detail than
> the current data model wiki page.  I can take a stab at creating a new
> page that includes examples if that would be useful.
>
> On Tue, Aug 11, 2009 at 10:34 PM, Arin Sarkissian<ar...@rspot.net> wrote:
>> I agree that the names are pretty horrible for a newbie...
>>
>> I'll echo the concerns that the RDBMS vernacular messes with a
>> newcomer's head. I feel like the words "Row" and "Column" are way too
>> loaded since most people have an RDBMS background... BUT
>>
>> In the BigTable paper we've got the term "Column Family". This term is
>> also used in HBase and Hypertable. Since the term's out there in the
>> wild I wouldn't feel comfortable ditching it and making something up
>> to fill its spot. That would lead to a scenario where folks with
>> experience with Hbase, Hypertable and Bigtable get confused (or think
>> the naming is dumb) but would lesson the confusion for RDBMS peeps.
>> Doesn't sound like the right tradeoff: 4 sets of folks have something
>> new to digest instead of 1.
>>
>> The "bad" terms are "column" and "row". That's where the real issues
>> arise... but given the fact that I believe we should keep "column
>> family" i have no idea what we'd call the things inside the CF? It
>> would be odd as hell to have a CF contain "records" etc. Does that
>> mean we should keep it called "column"? IMO w/o an awesome
>> alternative, yes.
>>
>> The word "row" should go away tho...
>> When I first started using cassandra I thought that: a key pointed to
>> a row and that row had one of each column family. This isn't the case
>> but the RDBMS terms + SQL-ish thinking caused me and many other to
>> assume as much. Took us a while to figure that out...
>>
>> But realistically how much of this confusion could be avoided with a
>> legit example? Once you see a good example you start getting it. A lot
>> of people have been pointed towards the ThriftIterface page on the
>> wiki which clears up next to nothing:
>> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
>> "edges", "base_attributes" etc. It's next door to nonsensical..
>>
>> What if we had a real example that people could relate to... a model a
>> blog or something along those lines & update the
>> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
>> on the API methods would be used to accomplish basic tasks... ex: get
>> all comments for a blog entry, list entires in time order, list
>> entries tagged "bar", find all entries with "foo" in the body (kinda
>> like the Facebook mail search example).
>>
>> -Arin
>>
>>
>>
>> On Tue, Aug 11, 2009 at 10:09 PM, Curt Micol<as...@gmail.com> wrote:
>>> Hello,
>>>
>>> I am hardly a developer, so this isn't directly addressed to me, but
>>> if I may comment on a couple of things from an outsider's
>>> (non-developer, new to this scale of database) perspective.
>>>
>>> On Wed, Aug 12, 2009 at 12:38 AM, Eric Evans<ee...@rackspace.com> wrote:
>>>> On Tue, 2009-08-11 at 10:37 -0700, Evan Weaver wrote:
>>>>> In my experience, the naming of the data model has been a huge barrier
>>>>> to entry for users of Cassandra. This goes both for people familiar
>>>>> with SQL, and for people familiar with BigTable. I would like to
>>>>> change this before 0.4, since the 0.3 to 0.4 transition is the Great
>>>>> API Breakening.
>>>
>>> I agree that there is a barrier, specifically because most people have
>>> no experience with this type of data structure and as you mention are
>>> coming from SQL.  Clearer names along with more documentation/examples
>>> will help grow the user base of Cassandra quite a bit.
>>>
>>>>> So technically this is not a bikeshed, because I'm happy to do all the
>>>>> work. I'll even submit a patch for Digg's Python client. Since there
>>>>> are no production deployments of ASF, and only a couple
>>>>> well-maintained clients, now is the time to break the world. A few
>>>>> hours of work now will pay off richly in terms of community
>>>>> involvement and reduced noob-explanation-time.
>>>
>>> I would offer my services here also if a change were accepted.
>>>
>>> And while I don't know what the exact names should be (nor am I
>>> qualified tbh), I think they should be clearer than they are. At this
>>> point they seem to be a mixture of RDBMS and Document DB terms.  The
>>> change to 'keyspace' from 'table' I think was a first step in this
>>> process, but it should be taken further and all names normalized
>>> across the board to properly represent their relationship with each
>>> other. At least that's my very humble opinion.
>>>
>>> In response to Mr. Evan's comment regarding the Bigtable paper, does
>>> the Cassandra community want this to be a requirement for using the
>>> software? I would think not.  Sure, most early adopters are coming
>>> from that paper, but it shouldn't be a source of entry to use the
>>> database, but rather to develop it.
>>>
>>> Again, my opinion carries little weight, but +1 from this user.
>>>
>>> Thanks for everyone's hard work, I am really excited to see how this
>>> project continues to progress.
>>>
>>> --
>>> # Curt Micol
>>>
>>
>

Re: Fixing the data model names

Posted by Mark McBride <ma...@gmail.com>.

It seems to me that what would be most helpful, regardless of changes,
is having a document that describes the data model in more detail than
the current data model wiki page.  I can take a stab at creating a new
page that includes examples if that would be useful.

On Tue, Aug 11, 2009 at 10:34 PM, Arin Sarkissian<ar...@rspot.net> wrote:
> I agree that the names are pretty horrible for a newbie...
>
> I'll echo the concerns that the RDBMS vernacular messes with a
> newcomer's head. I feel like the words "Row" and "Column" are way too
> loaded since most people have an RDBMS background... BUT
>
> In the BigTable paper we've got the term "Column Family". This term is
> also used in HBase and Hypertable. Since the term's out there in the
> wild I wouldn't feel comfortable ditching it and making something up
> to fill its spot. That would lead to a scenario where folks with
> experience with Hbase, Hypertable and Bigtable get confused (or think
> the naming is dumb) but would lesson the confusion for RDBMS peeps.
> Doesn't sound like the right tradeoff: 4 sets of folks have something
> new to digest instead of 1.
>
> The "bad" terms are "column" and "row". That's where the real issues
> arise... but given the fact that I believe we should keep "column
> family" i have no idea what we'd call the things inside the CF? It
> would be odd as hell to have a CF contain "records" etc. Does that
> mean we should keep it called "column"? IMO w/o an awesome
> alternative, yes.
>
> The word "row" should go away tho...
> When I first started using cassandra I thought that: a key pointed to
> a row and that row had one of each column family. This isn't the case
> but the RDBMS terms + SQL-ish thinking caused me and many other to
> assume as much. Took us a while to figure that out...
>
> But realistically how much of this confusion could be avoided with a
> legit example? Once you see a good example you start getting it. A lot
> of people have been pointed towards the ThriftIterface page on the
> wiki which clears up next to nothing:
> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
> "edges", "base_attributes" etc. It's next door to nonsensical..
>
> What if we had a real example that people could relate to... a model a
> blog or something along those lines & update the
> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
> on the API methods would be used to accomplish basic tasks... ex: get
> all comments for a blog entry, list entires in time order, list
> entries tagged "bar", find all entries with "foo" in the body (kinda
> like the Facebook mail search example).
>
> -Arin
>
>
>
> On Tue, Aug 11, 2009 at 10:09 PM, Curt Micol<as...@gmail.com> wrote:
>> Hello,
>>
>> I am hardly a developer, so this isn't directly addressed to me, but
>> if I may comment on a couple of things from an outsider's
>> (non-developer, new to this scale of database) perspective.
>>
>> On Wed, Aug 12, 2009 at 12:38 AM, Eric Evans<ee...@rackspace.com> wrote:
>>> On Tue, 2009-08-11 at 10:37 -0700, Evan Weaver wrote:
>>>> In my experience, the naming of the data model has been a huge barrier
>>>> to entry for users of Cassandra. This goes both for people familiar
>>>> with SQL, and for people familiar with BigTable. I would like to
>>>> change this before 0.4, since the 0.3 to 0.4 transition is the Great
>>>> API Breakening.
>>
>> I agree that there is a barrier, specifically because most people have
>> no experience with this type of data structure and as you mention are
>> coming from SQL.  Clearer names along with more documentation/examples
>> will help grow the user base of Cassandra quite a bit.
>>
>>>> So technically this is not a bikeshed, because I'm happy to do all the
>>>> work. I'll even submit a patch for Digg's Python client. Since there
>>>> are no production deployments of ASF, and only a couple
>>>> well-maintained clients, now is the time to break the world. A few
>>>> hours of work now will pay off richly in terms of community
>>>> involvement and reduced noob-explanation-time.
>>
>> I would offer my services here also if a change were accepted.
>>
>> And while I don't know what the exact names should be (nor am I
>> qualified tbh), I think they should be clearer than they are. At this
>> point they seem to be a mixture of RDBMS and Document DB terms.  The
>> change to 'keyspace' from 'table' I think was a first step in this
>> process, but it should be taken further and all names normalized
>> across the board to properly represent their relationship with each
>> other. At least that's my very humble opinion.
>>
>> In response to Mr. Evan's comment regarding the Bigtable paper, does
>> the Cassandra community want this to be a requirement for using the
>> software? I would think not.  Sure, most early adopters are coming
>> from that paper, but it shouldn't be a source of entry to use the
>> database, but rather to develop it.
>>
>> Again, my opinion carries little weight, but +1 from this user.
>>
>> Thanks for everyone's hard work, I am really excited to see how this
>> project continues to progress.
>>
>> --
>> # Curt Micol
>>
>

Re: Fixing the data model names

Posted by Arin Sarkissian <ar...@rspot.net>.

I agree that the names are pretty horrible for a newbie...

I'll echo the concerns that the RDBMS vernacular messes with a
newcomer's head. I feel like the words "Row" and "Column" are way too
loaded since most people have an RDBMS background... BUT

In the BigTable paper we've got the term "Column Family". This term is
also used in HBase and Hypertable. Since the term's out there in the
wild I wouldn't feel comfortable ditching it and making something up
to fill its spot. That would lead to a scenario where folks with
experience with Hbase, Hypertable and Bigtable get confused (or think
the naming is dumb) but would lesson the confusion for RDBMS peeps.
Doesn't sound like the right tradeoff: 4 sets of folks have something
new to digest instead of 1.

The "bad" terms are "column" and "row". That's where the real issues
arise... but given the fact that I believe we should keep "column
family" i have no idea what we'd call the things inside the CF? It
would be odd as hell to have a CF contain "records" etc. Does that
mean we should keep it called "column"? IMO w/o an awesome
alternative, yes.

The word "row" should go away tho...
When I first started using cassandra I thought that: a key pointed to
a row and that row had one of each column family. This isn't the case
but the RDBMS terms + SQL-ish thinking caused me and many other to
assume as much. Took us a while to figure that out...

But realistically how much of this confusion could be avoided with a
legit example? Once you see a good example you start getting it. A lot
of people have been pointed towards the ThriftIterface page on the
wiki which clears up next to nothing:
http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
"edges", "base_attributes" etc. It's next door to nonsensical..

What if we had a real example that people could relate to... a model a
blog or something along those lines & update the
http://wiki.apache.org/cassandra/ThriftInterface page to show how each
on the API methods would be used to accomplish basic tasks... ex: get
all comments for a blog entry, list entires in time order, list
entries tagged "bar", find all entries with "foo" in the body (kinda
like the Facebook mail search example).

-Arin

On Tue, Aug 11, 2009 at 10:09 PM, Curt Micol<as...@gmail.com> wrote:
> Hello,
>
> I am hardly a developer, so this isn't directly addressed to me, but
> if I may comment on a couple of things from an outsider's
> (non-developer, new to this scale of database) perspective.
>
> On Wed, Aug 12, 2009 at 12:38 AM, Eric Evans<ee...@rackspace.com> wrote:
>> On Tue, 2009-08-11 at 10:37 -0700, Evan Weaver wrote:
>>> In my experience, the naming of the data model has been a huge barrier
>>> to entry for users of Cassandra. This goes both for people familiar
>>> with SQL, and for people familiar with BigTable. I would like to
>>> change this before 0.4, since the 0.3 to 0.4 transition is the Great
>>> API Breakening.
>
> I agree that there is a barrier, specifically because most people have
> no experience with this type of data structure and as you mention are
> coming from SQL.  Clearer names along with more documentation/examples
> will help grow the user base of Cassandra quite a bit.
>
>>> So technically this is not a bikeshed, because I'm happy to do all the
>>> work. I'll even submit a patch for Digg's Python client. Since there
>>> are no production deployments of ASF, and only a couple
>>> well-maintained clients, now is the time to break the world. A few
>>> hours of work now will pay off richly in terms of community
>>> involvement and reduced noob-explanation-time.
>
> I would offer my services here also if a change were accepted.
>
> And while I don't know what the exact names should be (nor am I
> qualified tbh), I think they should be clearer than they are. At this
> point they seem to be a mixture of RDBMS and Document DB terms.  The
> change to 'keyspace' from 'table' I think was a first step in this
> process, but it should be taken further and all names normalized
> across the board to properly represent their relationship with each
> other. At least that's my very humble opinion.
>
> In response to Mr. Evan's comment regarding the Bigtable paper, does
> the Cassandra community want this to be a requirement for using the
> software? I would think not.  Sure, most early adopters are coming
> from that paper, but it shouldn't be a source of entry to use the
> database, but rather to develop it.
>
> Again, my opinion carries little weight, but +1 from this user.
>
> Thanks for everyone's hard work, I am really excited to see how this
> project continues to progress.
>
> --
> # Curt Micol
>

Re: Fixing the data model names

Posted by Curt Micol <as...@gmail.com>.

Hello,

I am hardly a developer, so this isn't directly addressed to me, but
if I may comment on a couple of things from an outsider's
(non-developer, new to this scale of database) perspective.

On Wed, Aug 12, 2009 at 12:38 AM, Eric Evans<ee...@rackspace.com> wrote:
> On Tue, 2009-08-11 at 10:37 -0700, Evan Weaver wrote:
>> In my experience, the naming of the data model has been a huge barrier
>> to entry for users of Cassandra. This goes both for people familiar
>> with SQL, and for people familiar with BigTable. I would like to
>> change this before 0.4, since the 0.3 to 0.4 transition is the Great
>> API Breakening.

I agree that there is a barrier, specifically because most people have
no experience with this type of data structure and as you mention are
coming from SQL.  Clearer names along with more documentation/examples
will help grow the user base of Cassandra quite a bit.

>> So technically this is not a bikeshed, because I'm happy to do all the
>> work. I'll even submit a patch for Digg's Python client. Since there
>> are no production deployments of ASF, and only a couple
>> well-maintained clients, now is the time to break the world. A few
>> hours of work now will pay off richly in terms of community
>> involvement and reduced noob-explanation-time.

I would offer my services here also if a change were accepted.

And while I don't know what the exact names should be (nor am I
qualified tbh), I think they should be clearer than they are. At this
point they seem to be a mixture of RDBMS and Document DB terms.  The
change to 'keyspace' from 'table' I think was a first step in this
process, but it should be taken further and all names normalized
across the board to properly represent their relationship with each
other. At least that's my very humble opinion.

In response to Mr. Evan's comment regarding the Bigtable paper, does
the Cassandra community want this to be a requirement for using the
software? I would think not.  Sure, most early adopters are coming
from that paper, but it shouldn't be a source of entry to use the
database, but rather to develop it.

Again, my opinion carries little weight, but +1 from this user.

Thanks for everyone's hard work, I am really excited to see how this
project continues to progress.

-- 
# Curt Micol

Re: Fixing the data model names

Posted by Eric Evans <ee...@rackspace.com>.

On Tue, 2009-08-11 at 10:37 -0700, Evan Weaver wrote:
> In my experience, the naming of the data model has been a huge barrier
> to entry for users of Cassandra. This goes both for people familiar
> with SQL, and for people familiar with BigTable. I would like to
> change this before 0.4, since the 0.3 to 0.4 transition is the Great
> API Breakening.
> 
> I (that is, all of us at Twitter) are willing to write all the patches
> and update the wiki, if I get the necessary community buy-in. I hoped
> that I could do one patch per each external interface change, and then
> after those are complete, a patch for each internal interface change
> as a phase 2.
> 
> So technically this is not a bikeshed, because I'm happy to do all the
> work. I'll even submit a patch for Digg's Python client. Since there
> are no production deployments of ASF, and only a couple
> well-maintained clients, now is the time to break the world. A few
> hours of work now will pay off richly in terms of community
> involvement and reduced noob-explanation-time.
> 
> In general, I think the data model names should have the following goals:
> 
>  * Use existing, widely understood terms.
>  * Do not use terms that have conflicting meanings.
>  * Express analogies in the data model, where useful.
>  * Be unambiguous.
> 
> Are these goals valid? Clearly I think they are, because I wrote you a
> very long email about it. Also, I don't think the current names meet
> these goals. Currently, we have:
> 
>   Cluster, contains keyspaces:
> 
>   This is fine.
> 
>   Keyspace: contains column families.
> 
> There was some discussion of this change on the list a while back.
> Keyspace beats Table by a mile, due to the "conflicting existing
> usage" rule, but I think we can do better.
> 
>   Column family: containing a name, keys, column type, column sort,
> and sub column sort.
> 
>   This name is from BigTable, and not in wide usage. It does not
> express the hierarchy of storage, rather referring to a side effect of
> the storage hierarchy by talking about the most granular data objects.
> Confusing.

I disagree. 

We have a lot of people coming to us that have read the BigTable paper
(most? all?), and who are already familiar with the term "Column
family". If we change this, people will forever be mapping it from what
we call it, to "Column family", and that is not good.

To put it another way. A widely recognized publication has already
established the terminology for this.

It's also descriptive since the thing we call "Column family" is in fact
a grouping, or family, of columns.

>   Key: associated with columns.
> 
>   Since there's no word for the entire
> key-and-columns-in-a-column-family thing ("row"), it's hard to talk
> about this level of the data model clearly.

Actually, I think "row" works just fine, and without being enshrined in
the interface.

>   Column: containing a name, value, and timestamp.
> 
>   This is from BigTable. In most cases, except when contained within a
> super column, the data is row-oriented. There is nothing inherently
> columnar about the storage. Furthermore, column is widely understood
> from SQL to mean a table-enforced, strongly typed slot. Since
> Cassandra does not have a tabular model, this is straight-up wrong.
> Timestamps are an additional unexpected innovation in the normal use
> of "column".

Another word for column in object relational parlance is "attribute".

>   Super column, containing a name and columns.
> 
>   This is a container of columns. However, the name expresses some
> kind of priority order, but nothing about the container nature, even
> though that's the most important property. This is not in any other
> usage anywhere, and will always require explanation. Despite being a
> type of column, it cannot be updated or overwritten like a standard
> column, and does not have a timestamp.
> 
> Try to approach the naming with the mind of a beginner. For what it's
> worth, it took me at least 6 weeks to become comfortable with the
> current Cassandra terminology, and I had many false assumptions based
> on the names. I remember it took far less than that when starting out
> with SQL. At least there you can defer the confusing parts until
> later; Cassandra hits you with the confusion all up front. Just
> because we are comfortable now, doesn't mean that the current names
> are a good thing.

> So, on to the new proposed naming. In Cassandra's implementation, each
> level of the data model contains the totality of the lower levels.
> I've tried to express that in the new names.
> 
>   Cluster.
> 
>   No change.
> 
>   Database (formerly keyspace formerly table).
> 
>   Since this is quite literally the same as a database in an RDMBS,
> there's no reason to change the term. It's a namespace with a specific
> set of storage flags flipped. Its usage is analogous to the same usage
> in an RDBMS.
> 
>   Record collection (formerly column family).

If a record is analogous to a row, than a "record collection" seems to
be a very confusing way of describing a column family (or attributes if
you will).

>   This expresses the container nature--an ordered set. The word
> "collection" is used in document databases to mean the same thing.
> 
>   Record (formerly a-thing-without-a-name)
> 
>   This is the row itself. It has a key, and attributes, but the thing
> itself is not a key. It is not a "document" because it does not
> arbitrarily nest, and it's not "row" because that might imply the
> tabular nature of an RDBMS. Record has a history in databases which is
> reasonable in this context. It does not imply that a record
> necessarily corresponds to a complete object in the application, but
> it doesn't rule it out. Since this is the only thing that has a key,
> it's still valid to refer to a "key" in isolation, when convenient.

Like "row" above, I think you can use the term "record" when describing
the the unit of storage without enshrining it in the interface.

>  Attribute (formerly column).
> 
>  It has a name, value, and a timestamp. It does not imply anything
> about the storage. It does not imply a tabular model. It's more
> specific then "tuple", but easier to talk about than "timestamped
> key/value pair". It's the same as attributes in any object system.
> 
>  Attribute collection (formerly super column).
> 
>  This is clearly a container of attributes. That is all it implies,
> and that is what it is. It is analogous to record collection.

As noted earlier, "attribute" is another way of referring to a column
when talking about a relational databases. IMO, if column is confusing,
attribute is worse.

> In short:
> 
>   Cluster
>   Database
>   Record collection
>   Record
>   Attribute collection
>   Attribute
> 
> We could call the cluster "database collection", but even I'm not
> going to go that far. I realize that each level is merely a collection
> of the collections under it, but an "attribute collection collection
> collection collection" is no help to day-to-day usage. ;-)
> 
> As a heuristic, do the current names help, or get in the way? I'm not
> married to the new proposal, but I want us to move in the right
> direction, and not act like the current unusual naming is a badge of
> honor, or forget our own difficulties in getting started.
> 
> Keep in mind that BigTable, as an internal Google project, did not
> have API clarity as a primary goal; witness the colon-string-API that
> got copied by Cassandra originally.
> 
> Comments please!

You're proposing some pretty disruptive changes, and as such the benefit
needs to be clear and obvious, IMO it's not.

The timing is also pretty bad considering we're nearing the end of the
0.4 roadmap, and this wasn't on the list.

-- 
Eric Evans
eevans@rackspace.com

Re: Fixing the data model names

Posted by Mark McBride <ma...@gmail.com>.

+1 on this, although I don't know if it's feasible to hold up 0.4 for
it.  I'll echo the difficulties in getting familiar with Cassandra
terminology.  My only issue is "Attribute Collection" is a mouthful.
Something like AttributeSet might be more concise and still convey
roughly the same meaning.

   ---Mark

On Tue, Aug 11, 2009 at 10:37 AM, Evan Weaver<ew...@gmail.com> wrote:
> Dear Cassandra Developers,
>
> In my experience, the naming of the data model has been a huge barrier
> to entry for users of Cassandra. This goes both for people familiar
> with SQL, and for people familiar with BigTable. I would like to
> change this before 0.4, since the 0.3 to 0.4 transition is the Great
> API Breakening.
>
> I (that is, all of us at Twitter) are willing to write all the patches
> and update the wiki, if I get the necessary community buy-in. I hoped
> that I could do one patch per each external interface change, and then
> after those are complete, a patch for each internal interface change
> as a phase 2.
>
> So technically this is not a bikeshed, because I'm happy to do all the
> work. I'll even submit a patch for Digg's Python client. Since there
> are no production deployments of ASF, and only a couple
> well-maintained clients, now is the time to break the world. A few
> hours of work now will pay off richly in terms of community
> involvement and reduced noob-explanation-time.
>
> In general, I think the data model names should have the following goals:
>
>  * Use existing, widely understood terms.
>  * Do not use terms that have conflicting meanings.
>  * Express analogies in the data model, where useful.
>  * Be unambiguous.
>
> Are these goals valid? Clearly I think they are, because I wrote you a
> very long email about it. Also, I don't think the current names meet
> these goals. Currently, we have:
>
>  Cluster, contains keyspaces:
>
>  This is fine.
>
>  Keyspace: contains column families.
>
> There was some discussion of this change on the list a while back.
> Keyspace beats Table by a mile, due to the "conflicting existing
> usage" rule, but I think we can do better.
>
>  Column family: containing a name, keys, column type, column sort,
> and sub column sort.
>
>  This name is from BigTable, and not in wide usage. It does not
> express the hierarchy of storage, rather referring to a side effect of
> the storage hierarchy by talking about the most granular data objects.
> Confusing.
>
>  Key: associated with columns.
>
>  Since there's no word for the entire
> key-and-columns-in-a-column-family thing ("row"), it's hard to talk
> about this level of the data model clearly.
>
>  Column: containing a name, value, and timestamp.
>
>  This is from BigTable. In most cases, except when contained within a
> super column, the data is row-oriented. There is nothing inherently
> columnar about the storage. Furthermore, column is widely understood
> from SQL to mean a table-enforced, strongly typed slot. Since
> Cassandra does not have a tabular model, this is straight-up wrong.
> Timestamps are an additional unexpected innovation in the normal use
> of "column".
>
>  Super column, containing a name and columns.
>
>  This is a container of columns. However, the name expresses some
> kind of priority order, but nothing about the container nature, even
> though that's the most important property. This is not in any other
> usage anywhere, and will always require explanation. Despite being a
> type of column, it cannot be updated or overwritten like a standard
> column, and does not have a timestamp.
>
> Try to approach the naming with the mind of a beginner. For what it's
> worth, it took me at least 6 weeks to become comfortable with the
> current Cassandra terminology, and I had many false assumptions based
> on the names. I remember it took far less than that when starting out
> with SQL. At least there you can defer the confusing parts until
> later; Cassandra hits you with the confusion all up front. Just
> because we are comfortable now, doesn't mean that the current names
> are a good thing.
>
> So, on to the new proposed naming. In Cassandra's implementation, each
> level of the data model contains the totality of the lower levels.
> I've tried to express that in the new names.
>
>  Cluster.
>
>  No change.
>
>  Database (formerly keyspace formerly table).
>
>  Since this is quite literally the same as a database in an RDMBS,
> there's no reason to change the term. It's a namespace with a specific
> set of storage flags flipped. Its usage is analogous to the same usage
> in an RDBMS.
>
>  Record collection (formerly column family).
>
>  This expresses the container nature--an ordered set. The word
> "collection" is used in document databases to mean the same thing.
>
>  Record (formerly a-thing-without-a-name)
>
>  This is the row itself. It has a key, and attributes, but the thing
> itself is not a key. It is not a "document" because it does not
> arbitrarily nest, and it's not "row" because that might imply the
> tabular nature of an RDBMS. Record has a history in databases which is
> reasonable in this context. It does not imply that a record
> necessarily corresponds to a complete object in the application, but
> it doesn't rule it out. Since this is the only thing that has a key,
> it's still valid to refer to a "key" in isolation, when convenient.
>
>  Attribute (formerly column).
>
>  It has a name, value, and a timestamp. It does not imply anything
> about the storage. It does not imply a tabular model. It's more
> specific then "tuple", but easier to talk about than "timestamped
> key/value pair". It's the same as attributes in any object system.
>
>  Attribute collection (formerly super column).
>
>  This is clearly a container of attributes. That is all it implies,
> and that is what it is. It is analogous to record collection.
>
> In short:
>
>  Cluster
>  Database
>  Record collection
>  Record
>  Attribute collection
>  Attribute
>
> We could call the cluster "database collection", but even I'm not
> going to go that far. I realize that each level is merely a collection
> of the collections under it, but an "attribute collection collection
> collection collection" is no help to day-to-day usage. ;-)
>
> As a heuristic, do the current names help, or get in the way? I'm not
> married to the new proposal, but I want us to move in the right
> direction, and not act like the current unusual naming is a badge of
> honor, or forget our own difficulties in getting started.
>
> Keep in mind that BigTable, as an internal Google project, did not
> have API clarity as a primary goal; witness the colon-string-API that
> got copied by Cassandra originally.
>
> Comments please!
>
> Thanks,
>
> Evan
>
> --
> Evan Weaver
>