You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Karoly Negyesi <ch...@gmail.com> on 2010/07/16 01:26:47 UTC

A very short summary on Cassandra for a book

Hi,

I am writing a scalability chapter in a book and I need to mention
Apache Cassandra although it's just a mention. Still I would not like
to be sloppy and would like to get verification whether my summary is
accurate. "Cassandra stores four or five dimension associated arrays.
The first dimension is fixed on creation of the database but the rest
can be infinitely large. Inserts are super fast and can happen to any
database server in the cluster. However, the system is append only
there so there is no in-place update operation like increment. Also
sorting happens on insert time."

Thanks

Karoly Negyesi

RE: A very short summary on Cassandra for a book

Posted by Sanjay Sharma <sa...@impetus.co.in>.

Hi Jonathan,
I fear 'row-oriented' could fuel the holy war between 'row-based RDBMS' and 'column-oriented NoSQL databases'

Some related reads here -
-http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html
-http://en.wikipedia.org/wiki/Column-oriented_DBMS
-http://en.wikipedia.org/wiki/Apache_Cassandra says- "The values from a column family for each key are stored together, making Cassandra a hybrid between a column-oriented DBMS and a row-oriented store"

http://en.wikipedia.org/wiki/Apache_Cassandra certainly needs some cleanup!

Cheers,
Sanjay

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com]
Sent: Tuesday, July 20, 2010 8:11 AM
To: user@cassandra.apache.org
Subject: Re: A very short summary on Cassandra for a book

Keep it simple.  Something like "Cassandra is a row-oriented, fully
distributed database designed for scalability, availability, and
durability."

Trying to explain the data model in two sentences is not going to
work, and "4 or 5 dimension associated arrays" is the wrong tree to
bark up entirely.  ("row-oriented" is the right one. :)

On Thu, Jul 15, 2010 at 6:26 PM, Karoly Negyesi <ch...@gmail.com> wrote:
> Hi,
>
> I am writing a scalability chapter in a book and I need to mention
> Apache Cassandra although it's just a mention. Still I would not like
> to be sloppy and would like to get verification whether my summary is
> accurate. "Cassandra stores four or five dimension associated arrays.
> The first dimension is fixed on creation of the database but the rest
> can be infinitely large. Inserts are super fast and can happen to any
> database server in the cluster. However, the system is append only
> there so there is no in-place update operation like increment. Also
> sorting happens on insert time."
>
> Thanks
>
> Karoly Negyesi
>

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Meet Impetus at the OSCON 2010 in Portland, Oregon during July 19th to 23rd. Listen to our Senior Director of Engineering and expert speaker Vineet Tyagi talk about building a Ruby application server.

Click http://www.impetus.com/ to know more. Follow our updates on www.twitter.com/impetuscalling .

NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: A very short summary on Cassandra for a book

Posted by Jonathan Ellis <jb...@gmail.com>.

Keep it simple.  Something like "Cassandra is a row-oriented, fully
distributed database designed for scalability, availability, and
durability."

Trying to explain the data model in two sentences is not going to
work, and "4 or 5 dimension associated arrays" is the wrong tree to
bark up entirely.  ("row-oriented" is the right one. :)

On Thu, Jul 15, 2010 at 6:26 PM, Karoly Negyesi <ch...@gmail.com> wrote:
> Hi,
>
> I am writing a scalability chapter in a book and I need to mention
> Apache Cassandra although it's just a mention. Still I would not like
> to be sloppy and would like to get verification whether my summary is
> accurate. "Cassandra stores four or five dimension associated arrays.
> The first dimension is fixed on creation of the database but the rest
> can be infinitely large. Inserts are super fast and can happen to any
> database server in the cluster. However, the system is append only
> there so there is no in-place update operation like increment. Also
> sorting happens on insert time."
>
> Thanks
>
> Karoly Negyesi
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: A very short summary on Cassandra for a book

Posted by David Strauss <da...@fourkitchens.com>.

On 2010-07-16 01:57, Dave Viner wrote:
> I am no expert... but parts seem accurate, parts not.
> 
> "Cassandra stores four or five dimension associated arrays"
> not sure what you're counting as a dimension of the associated array,
> but here are the 2 associative array-like syntaxes:
> 
> ColumnFamily[row-key][column-name] = value1
> ColumnFamily[row-key][super-column-name][column-name] = value2

You're forgetting the first dimension: the keyspace. However, that
dimension is mostly a scope for configuration and administration, just
like MySQL "databases" on a single MySQL instance.

> "The first dimension is fixed on creation of the database but the
> rest can be infinitely large"
> I don't understand this sentence.  The definition of a ColumnFamily is
> set by the configuration file (storage-conf.xml).  If you change it, and
> restart a node, that node will use the new definition of the CF.

For a book, I would avoid pinning down what's dynamic at runtime and
what's fixed at startup because that's changing rapidly with upcoming
versions. Cassandra 0.7 features dynamic keyspace and column family
creation, and its release is going to happen well before the end of 2010.

Even now, it's possible to modify most configurations with no disruption
via a rolling cluster restart.

> It is true that the number of columns can be large.  I have no idea if
> it's actually infinite - but more or less.

There is no hard cap on the number of columns in a row. Real-world
systems are known to comfortably scale to millions of columns per row.

In current Cassandra releases, however, each super-column must fit into
memory. This is because the current architecture treats super-columns
and columns very similarly. While it's planned to change this for future
releases, there's interest in a broader overhaul allowing arbitrary
dimensionality; I wouldn't count on any change soon.

Also -- and this isn't much of a restriction -- each row must fit on a
single node's disk.

> Also, it's probably not precise to call it a database, since that tends
> to invoke images of things like MySQL, Oracle, Postgres, etc.  

Those are *relational* databases. Historically, "database" has been a
general term for persistent data stores.

> "Inserts are super fast and can happen to any
> database server in the cluster."
> Yes, this is true.

Not 100% true. The sharding/partitioning mechanism in Cassandra assigns
each row to at least one server in the cluster (more if the replication
level is higher than one). It's possible to "write" to any server in the
cluster, but the write will only complete once confirmed on an
appropriate number of nodes (based on ConsistencyLevel).

ConsistencyLevel.ZERO is a special exception that allows nearly blind
writes to any node in the cluster, asynchronously replicating the data
to the proper nodes, but most applications use at least
ConsistencyLevel.ONE for any serious writes.

The replication topology also affects write latency. Using a RackAware
approach, Cassandra will often require a confirmed write at a remote
location.

Cassandra intentionally allows applications to dynamically decide read
and write latency tradeoffs against consistency guarantees. So, I'd say
writes in Cassandra are "as fast as your consistency and durability
requirements allow."

> "However, the system is append only there so there is no in-place update
> operation like increment"
> The first part is not quite true.  There is appending, but there is no
> increment that's guaranteed universal.  Cassandra is "eventually
> consistent".  So atomic increment doesn't really work in the "eventual"
> world.  But, more precisely, one can add, update, change, modify, delete
> rows, columns, and values at any time from any node.

The lack of increment support has little to do with eventual consistency
and everything to do with timestamp-based conflict resolution. With
vector clocks (likely landing in 0.7 as a result of Digg's work), it
will be possible to support increment and decrement operations, just not
ones that give you an instant, unique result. The actual inc and dec
support probably won't be in 0.7, though.

> "Also sorting happens on insert time"
> Yes, I believe this is true.

Basically true. I could nitpick, but it wouldn't add much clarity to the
discussion.

-- 
David Strauss
   | david@fourkitchens.com
   | +1 512 577 5827 [mobile]
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]

Re: A very short summary on Cassandra for a book

Posted by Dave Viner <da...@pobox.com>.

I am no expert... but parts seem accurate, parts not.

"Cassandra stores four or five dimension associated arrays"
not sure what you're counting as a dimension of the associated array, but
here are the 2 associative array-like syntaxes:

ColumnFamily[row-key][column-name] = value1
ColumnFamily[row-key][super-column-name][column-name] = value2

"The first dimension is fixed on creation of the database but the rest can
be infinitely large"
I don't understand this sentence.  The definition of a ColumnFamily is set
by the configuration file (storage-conf.xml).  If you change it, and restart
a node, that node will use the new definition of the CF.

It is true that the number of columns can be large.  I have no idea if it's
actually infinite - but more or less.

Also, it's probably not precise to call it a database, since that tends to
invoke images of things like MySQL, Oracle, Postgres, etc.

"Inserts are super fast and can happen to any
database server in the cluster."
Yes, this is true.

"However, the system is append only there so there is no in-place update
operation like increment"
The first part is not quite true.  There is appending, but there is no
increment that's guaranteed universal.  Cassandra is "eventually
consistent".  So atomic increment doesn't really work in the "eventual"
world.  But, more precisely, one can add, update, change, modify, delete
rows, columns, and values at any time from any node.

"Also sorting happens on insert time"
Yes, I believe this is true.

Dave Viner

On Thu, Jul 15, 2010 at 4:26 PM, Karoly Negyesi <ch...@gmail.com> wrote:

> Hi,
>
> I am writing a scalability chapter in a book and I need to mention
> Apache Cassandra although it's just a mention. Still I would not like
> to be sloppy and would like to get verification whether my summary is
> accurate. "Cassandra stores four or five dimension associated arrays.
> The first dimension is fixed on creation of the database but the rest
> can be infinitely large. Inserts are super fast and can happen to any
> database server in the cluster. However, the system is append only
> there so there is no in-place update operation like increment. Also
> sorting happens on insert time."
>
> Thanks
>
> Karoly Negyesi
>