You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Evan Weaver <ew...@gmail.com> on 2009/08/17 20:10:23 UTC

Cassandra data model misconceptions, and their sources

Ok, here are the common Cassandra misconceptions, and their sources,
gleaned from experience and talking to various people.

Not listed in any particular order.

1. A key is global, and data in different column families must be related.
  - BigTable paper
  - key precedence in Thrift API

2. Table is like a row-oriented table
  - the name
  - somewhat fixed by changing to keyspace

3. Keyspace is not like a database (in SQL/CouchDB/MongoDB)
  - because it's not called that

4. Columns are literally columnar
  - the name
  - column sets are stored per key, not per column family (unlike
relational DBs)
  - column name as a piece of data is unusual (esp. in relational DBs)

5. Columns are versioned
  - BigTable paper

6. Super columns are magical
  - Name has no precendence anywhere
  - Super columns do not have timestamps unlike columns
  - Other MVAs are not fully recursive; just have values

7. Difference between column family, column, and super column is not clear
  - Everything has "column" in the name
  - "super", "family", and "" are not well-understood

8. Cassandra uses Paxos
  - BigTable paper

9. Cassandra uses client-side conflict resolution
  - Dynamo paper

A lot of things to get wrong, right off the bat.

Maybe this makes it clear why the BigTable references were not helpful
to us? For a new user, it provides as many wrong assumptions as
correct assumptions.

Evan

-- 
Evan Weaver

Re: Cassandra data model misconceptions, and their sources

Posted by Curt Micol <as...@gmail.com>.

On Mon, Aug 17, 2009 at 6:32 PM, Mark McBride<ma...@gmail.com> wrote:
> My first attempt at a revamped data model wiki page is up here
>
> http://wiki.apache.org/cassandra/DataModel2

Awesome.

> This one follows phatduckk's approach of describing the data model
> bottom up, which I found more intuitive.  I'm interested to hear if
>
> 1) It corrects some of the misconceptions people have run into
> 2) The bottom up approach is more approachable than top down.

Completely agree.

> 3) I got everything covered and everything right :)

Thanks Mark, it looks great so far.

-- 
# Curt Micol

Re: Cassandra data model misconceptions, and their sources

Posted by Adam Rosien <ad...@rosien.net>.

While I interpret Evan as arguing against using RDB terms like row and
column, I would favor keeping those terms. Cassandra's data model is
typically *initially* described as a table--without relational
aspects!--and then the distinction of its storage strategy
(column-oriented, mostly, sort of, with qualifiers) is explained. This
has helped me understand the similarities and differences quite well,
i.e., a very simplistic view of Cassandra is (RDB without relational,
column-oriented-ish, extensible column ids, ...). Understanding the
data layout makes it natural to understand the scaling and trade-offs.

As a philosophical aside, the "no-sql" meme emphasizing the exposure
of how data is actually stored is a great leap forward. We all need to
know all these details and what trade-offs are being made.

.. Adam

On Tue, Aug 18, 2009 at 10:42 AM, Eric Evans<ee...@rackspace.com> wrote:
> On Mon, 2009-08-17 at 15:32 -0700, Mark McBride wrote:
>> My first attempt at a revamped data model wiki page is up here
>>
>> http://wiki.apache.org/cassandra/DataModel2
>
> I think you are on the right track. Very nice.
>
> --
> Eric Evans
> eevans@rackspace.com
>
>

Re: Cassandra data model misconceptions, and their sources

Posted by Eric Evans <ee...@rackspace.com>.

On Mon, 2009-08-17 at 15:32 -0700, Mark McBride wrote:
> My first attempt at a revamped data model wiki page is up here
> 
> http://wiki.apache.org/cassandra/DataModel2

I think you are on the right track. Very nice.

-- 
Eric Evans
eevans@rackspace.com

Re: Cassandra data model misconceptions, and their sources

Posted by Adam Rosien <ad...@rosien.net>.

I find the diagrams of Evan and folks
(http://blog.evanweaver.com/files/cassandra/twitter.jpg) much easier
to grok than any particular naming scheme. Annotating that diagram
with specific implementations or constraints, like your wiki page, is
a great addition.

.. Adam

On Mon, Aug 17, 2009 at 3:32 PM, Mark McBride<ma...@gmail.com> wrote:
> My first attempt at a revamped data model wiki page is up here
>
> http://wiki.apache.org/cassandra/DataModel2
>
> This one follows phatduckk's approach of describing the data model
> bottom up, which I found more intuitive.  I'm interested to hear if
>
> 1) It corrects some of the misconceptions people have run into
> 2) The bottom up approach is more approachable than top down.
> 3) I got everything covered and everything right :)
>
>   ---Mark
>
> On Mon, Aug 17, 2009 at 11:31 AM, Edward
> Ribeiro<ed...@gmail.com> wrote:
>> Right on target, Evan!
>>
>> When I first downloaded Cassandra, three months ago, I tried to make
>> the analogy with BigTable, whose paper I'd already read, but the
>> differences between Cassandra and BigTable made it quite hard to grasp
>> some Cassandra concepts.
>>
>> Imho, as Table was renamed to Keyspace then Column should be the next
>> concept to be renamed as showed by numbers 4, 5, 6, and 7 of your
>> list. I would suggest to rename Column to Attribute (with the
>> corresponding AttributeFamily or AttributeSet). It's not the best
>> name, but right off the bat is what I can suggest.
>>
>> Edward
>>
>

Re: Cassandra data model misconceptions, and their sources

Posted by Mark McBride <ma...@gmail.com>.

My first attempt at a revamped data model wiki page is up here

http://wiki.apache.org/cassandra/DataModel2

This one follows phatduckk's approach of describing the data model
bottom up, which I found more intuitive.  I'm interested to hear if

1) It corrects some of the misconceptions people have run into
2) The bottom up approach is more approachable than top down.
3) I got everything covered and everything right :)

   ---Mark

On Mon, Aug 17, 2009 at 11:31 AM, Edward
Ribeiro<ed...@gmail.com> wrote:
> Right on target, Evan!
>
> When I first downloaded Cassandra, three months ago, I tried to make
> the analogy with BigTable, whose paper I'd already read, but the
> differences between Cassandra and BigTable made it quite hard to grasp
> some Cassandra concepts.
>
> Imho, as Table was renamed to Keyspace then Column should be the next
> concept to be renamed as showed by numbers 4, 5, 6, and 7 of your
> list. I would suggest to rename Column to Attribute (with the
> corresponding AttributeFamily or AttributeSet). It's not the best
> name, but right off the bat is what I can suggest.
>
> Edward
>

Re: Cassandra data model misconceptions, and their sources

Posted by Edward Ribeiro <ed...@gmail.com>.

Right on target, Evan!

When I first downloaded Cassandra, three months ago, I tried to make
the analogy with BigTable, whose paper I'd already read, but the
differences between Cassandra and BigTable made it quite hard to grasp
some Cassandra concepts.

Imho, as Table was renamed to Keyspace then Column should be the next
concept to be renamed as showed by numbers 4, 5, 6, and 7 of your
list. I would suggest to rename Column to Attribute (with the
corresponding AttributeFamily or AttributeSet). It's not the best
name, but right off the bat is what I can suggest.

Edward

Re: Cassandra data model misconceptions, and their sources

Posted by Curt Micol <as...@gmail.com>.

On Tue, Aug 18, 2009 at 10:36 AM, Evan Weaver<ew...@gmail.com> wrote:
> Did you read the previous thread about this?
>
> http://markmail.org/thread/qbocotgkan4mg73w
>
> I don't think your proposals are too good...I have a new proposal
> based on feedback in the previous thread, that I will send soon. But I
> wanted some comments on the misconceptions themselves.

Fair enough. I have read that thread, it seems though nothing
suggested caught attention which is why I was trying to brainstorm a
bit further out. I assumed this thread was more evidence of the need
for a name change, sorry to take it OT.

I do think you've highlighted a number of key misconceptions.

-- 
# Curt Micol

Re: Cassandra data model misconceptions, and their sources

Posted by Evan Weaver <ew...@gmail.com>.

Did you read the previous thread about this?

http://markmail.org/thread/qbocotgkan4mg73w

I don't think your proposals are too good...I have a new proposal
based on feedback in the previous thread, that I will send soon. But I
wanted some comments on the misconceptions themselves.

Evan

On Tue, Aug 18, 2009 at 1:33 AM, Curt Micol<as...@gmail.com> wrote:
> I've been thinking about this for a number of days, and again, while I am not a
> developer I thought I might toss in a proposal if that's okay.
>
> Since putting together a schema diagram and having a number of people review
> it, I think a change is warranted. Too many people are coming from the RDBMS
> world and the terms used by Cassandra are conflicting with those terms they
> are already familiar with.
>
> The TLDR version is as follows:
>
> Object (Column)
> ObjectFamily (ColumnFamily)
> Directory (Row)
> ObjectContainer (SuperColumn)
> Namespace (Keyspace)
>
> The long version...
>
> Object (Column)
> As Evan has stated repeatedly, column is a bit misleading especially when
> compared to other types of database systems.  I think this is probably the
> most important change to the data model names, and exactly where I started
> since this is the 'core' of Cassandra.  Object gives the impression that this
> is a piece of data, it's relatively structured but the name gives no
> impression how strict that structure is. 'Objects' have names that have values
> and timestamps. Simple and too the point. 'Object' doesn't come with the
> preconceived notions that 'column' comes with and leaves room for Cassandra to
> define what an 'object' is without any conflict to preexisting data
> structures.
>
> By changing this, we can move up the ladder to other data types and
> easily rename them to something that 'contains objects' or 'accesses objects'.
> This allows us to describe the data model in the name structure without
> having to get too deep into the definition.
>
> Directory (Row)
> 'row' is currently unnamed, but still a structure that exists in the model.
> It's not specifically data itself, but more of a mapping of how to get to
> objects (using keys). 'Directory' fills this void quite well. It is easily
> explained as a path to get to data and not data itself.
>
> ObjectFamily (ColumnFamily)
> There's no argument that the one direct link to the BigTable paper is 'column
> families'. It's perhaps the only structure that is virtually the same in both
> pieces of software.  Considering this, I think we need to avoid too drastic a
> change.  With that said, I think a change is necessary due to the differences
> in columns between the two databases. 'object family' is descriptive of the
> relation between objects and removes any reference to tabular structures while
> keeping a loose relationship to 'column family' in the BigTable paper.
>
> ObjectContainer (SuperColumn)
> I could see this being shortened to 'container' in every day conversation.
> However, 'objectcontainer' fits nicely with the rest of the data model names
> and is descriptive of it's purpose and use. Ultimately a 'supercolumn' is
> nothing more than a named container of columns (and I've seen on at least 3
> different occasions the word container used to describe supercolumns).
> 'supercolumn' had no real connection to what exactly it was defining, but with
> 'object container' we have a clear understanding that we are naming the
> structure that holds objects. Or as I explained it to a friend, we are naming
> the 'jar' and not the 'honey'. :)
>
> Namespace (Keyspace)
> This one I go back and forth on. I know it's been changed from 'Table' to
> 'keyspace' and Evan proposed 'database', but I think that 'namespace' is
> really what it is we are talking about. Wikipedia has this as the first line
> to describe 'namespace':
>
> A namespace is an abstract container or environment created to hold a
> logical grouping of unique identifiers or symbols (i.e., names).
>
> Originally I thought 'objectspace' would fit better, but I think 'namespace'
> comes with a better history and is clearer to what this structure really is.
> Especially when you relate the name namespace to how it is used in Ruby, Python
> and Java. Ultimately though, I think I prefer 'keyspace' over 'table'
> or 'database'.
>
> The only issue I see with all of these names is the potential conflict with
> programming languages and their objects. I know next to nothing about Java so
> I don't know if there would be a conflict here. I've ran the following Google
> search 'reserved words in *' where '*' is Ruby, Python, Java and C++ and
> received no mention of 'object' being a reserved word in any of those
> languages.
>
> I also grep'd through current source code and there doesn't seem to be any
> real conflicts that couldn't be named something else so as not to conflict
> with this naming structure.
>
> In the end, I think it's a good idea to look at this and work out a solution.
> Documentation and tutorials are going to help, but I think people are so
> entrenched in the RDBMS world that there is somewhat of a barrier to
> understanding Cassandra's data model.
>
> Thanks for your time,
>
> --
> # Curt Micol
>



-- 
Evan Weaver

Re: Cassandra data model misconceptions, and their sources

Posted by Joe Stump <jo...@joestump.net>.

On Aug 18, 2009, at 11:51 AM, Wilson Mar wrote:

> I say process because there may be a perfect word in Hebrew, Nigerian,
> or other language we can borrow that implies the perfect nuance we
> need.

I've found Finnish to have pronounceable English looking words that  
have great meanings. For instance, I named a big SQL queue (for  
running hundreds of thousands of SQL queries across Digg's clusters)  
Muuttaa, which means "to move or migrate" in Finnish.

Wiktionary is a great source for this.

--Joe

Re: Cassandra data model misconceptions, and their sources

Posted by Matthias Wessendorf <ma...@apache.org>.

On Tue, Aug 18, 2009 at 8:24 PM, Jonathan Ellis<jb...@gmail.com> wrote:
> That's really outside our scope here.  Anyone who wants to write docs
> in a non-English language is welcome to start another thread to
> discuss terminology in that language, but we shouldn't hold up the
> canonical English docs for that.

+1

>
> -Jonathan
>
> On Tue, Aug 18, 2009 at 10:51 AM, Wilson Mar<wi...@gmail.com> wrote:
>> This is a good discussion.
>>
>> I would like to add that whatever English names we end up with we
>> should also get non-English versions of those words as part of our
>> process.
>>
>> I say process because there may be a perfect word in Hebrew, Nigerian,
>> or other language we can borrow that implies the perfect nuance we
>> need.
>>
>> We are expecting great things from this technology, so having
>> translations of the key words from the get-go would help toward
>> quicker wider world-wide adoption.
>>
>> - Wilson Mar
>>
>



-- 
Matthias Wessendorf

blog: http://matthiaswessendorf.wordpress.com/
sessions: http://www.slideshare.net/mwessendorf
twitter: http://twitter.com/mwessendorf

Re: Cassandra data model misconceptions, and their sources

Posted by Jonathan Ellis <jb...@gmail.com>.

That's really outside our scope here.  Anyone who wants to write docs
in a non-English language is welcome to start another thread to
discuss terminology in that language, but we shouldn't hold up the
canonical English docs for that.

-Jonathan

On Tue, Aug 18, 2009 at 10:51 AM, Wilson Mar<wi...@gmail.com> wrote:
> This is a good discussion.
>
> I would like to add that whatever English names we end up with we
> should also get non-English versions of those words as part of our
> process.
>
> I say process because there may be a perfect word in Hebrew, Nigerian,
> or other language we can borrow that implies the perfect nuance we
> need.
>
> We are expecting great things from this technology, so having
> translations of the key words from the get-go would help toward
> quicker wider world-wide adoption.
>
> - Wilson Mar
>

Re: Cassandra data model misconceptions, and their sources

Posted by Wilson Mar <wi...@gmail.com>.

This is a good discussion.

I would like to add that whatever English names we end up with we
should also get non-English versions of those words as part of our
process.

I say process because there may be a perfect word in Hebrew, Nigerian,
or other language we can borrow that implies the perfect nuance we
need.

We are expecting great things from this technology, so having
translations of the key words from the get-go would help toward
quicker wider world-wide adoption.

- Wilson Mar

Re: Cassandra data model misconceptions, and their sources

Posted by Curt Micol <as...@gmail.com>.

I've been thinking about this for a number of days, and again, while I am not a
developer I thought I might toss in a proposal if that's okay.

Since putting together a schema diagram and having a number of people review
it, I think a change is warranted. Too many people are coming from the RDBMS
world and the terms used by Cassandra are conflicting with those terms they
are already familiar with.

The TLDR version is as follows:

Object (Column)
ObjectFamily (ColumnFamily)
Directory (Row)
ObjectContainer (SuperColumn)
Namespace (Keyspace)

The long version...

Object (Column)
As Evan has stated repeatedly, column is a bit misleading especially when
compared to other types of database systems.  I think this is probably the
most important change to the data model names, and exactly where I started
since this is the 'core' of Cassandra.  Object gives the impression that this
is a piece of data, it's relatively structured but the name gives no
impression how strict that structure is. 'Objects' have names that have values
and timestamps. Simple and too the point. 'Object' doesn't come with the
preconceived notions that 'column' comes with and leaves room for Cassandra to
define what an 'object' is without any conflict to preexisting data
structures.

By changing this, we can move up the ladder to other data types and
easily rename them to something that 'contains objects' or 'accesses objects'.
This allows us to describe the data model in the name structure without
having to get too deep into the definition.

Directory (Row)
'row' is currently unnamed, but still a structure that exists in the model.
It's not specifically data itself, but more of a mapping of how to get to
objects (using keys). 'Directory' fills this void quite well. It is easily
explained as a path to get to data and not data itself.

ObjectFamily (ColumnFamily)
There's no argument that the one direct link to the BigTable paper is 'column
families'. It's perhaps the only structure that is virtually the same in both
pieces of software.  Considering this, I think we need to avoid too drastic a
change.  With that said, I think a change is necessary due to the differences
in columns between the two databases. 'object family' is descriptive of the
relation between objects and removes any reference to tabular structures while
keeping a loose relationship to 'column family' in the BigTable paper.

ObjectContainer (SuperColumn)
I could see this being shortened to 'container' in every day conversation.
However, 'objectcontainer' fits nicely with the rest of the data model names
and is descriptive of it's purpose and use. Ultimately a 'supercolumn' is
nothing more than a named container of columns (and I've seen on at least 3
different occasions the word container used to describe supercolumns).
'supercolumn' had no real connection to what exactly it was defining, but with
'object container' we have a clear understanding that we are naming the
structure that holds objects. Or as I explained it to a friend, we are naming
the 'jar' and not the 'honey'. :)

Namespace (Keyspace)
This one I go back and forth on. I know it's been changed from 'Table' to
'keyspace' and Evan proposed 'database', but I think that 'namespace' is
really what it is we are talking about. Wikipedia has this as the first line
to describe 'namespace':

A namespace is an abstract container or environment created to hold a
logical grouping of unique identifiers or symbols (i.e., names).

Originally I thought 'objectspace' would fit better, but I think 'namespace'
comes with a better history and is clearer to what this structure really is.
Especially when you relate the name namespace to how it is used in Ruby, Python
and Java. Ultimately though, I think I prefer 'keyspace' over 'table'
or 'database'.

The only issue I see with all of these names is the potential conflict with
programming languages and their objects. I know next to nothing about Java so
I don't know if there would be a conflict here. I've ran the following Google
search 'reserved words in *' where '*' is Ruby, Python, Java and C++ and
received no mention of 'object' being a reserved word in any of those
languages.

I also grep'd through current source code and there doesn't seem to be any
real conflicts that couldn't be named something else so as not to conflict
with this naming structure.

In the end, I think it's a good idea to look at this and work out a solution.
Documentation and tutorials are going to help, but I think people are so
entrenched in the RDBMS world that there is somewhat of a barrier to
understanding Cassandra's data model.

Thanks for your time,

-- 
# Curt Micol