You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by mrevilgnome <mr...@gmail.com> on 2013/01/09 17:51:04 UTC

Wide rows in CQL 3

We use the thrift bindings for our current production cluster, so I haven't
been tracking the developments regarding CQL3. I just discovered when
speaking to another potential DSE customer that wide rows, or rather
columns not defined in the metadata aren't supported in CQL 3.

I'm curious to understand the reasoning behind this, whether this is an
intentional direction shift away from the big table paradigm, and what's
supposed to happen to those of us who have already bought into
C* specifically because of the wide row support. What is our upgrade path?

Re: Wide rows in CQL 3

Posted by Janne Jalkanen <ja...@ecyrd.com>.

On 10 Jan 2013, at 01:30, Edward Capriolo <ed...@gmail.com> wrote:

> Column families that mix static and dynamic columns are pretty common. In fact it is pretty much the default case, you have a default validator then some columns have specific validators. In the old days people used to say "You only need one column family" you would subdivide your row key into parts username=username, password=password, friend-friene = friends, pet-pets = pets. It's very efficient and very easy if you understand what a slice is. Is everyone else just adding a column family every time they have new data? :) Sounds very un-no-sql-like. 

Well, we for sure are heavily mixing static and dynamic columns; it's quite useful, really. Which is why upgrading to CQL3 isn't really something I've considered seriously at any point.

> Most people are probably going to store column names as tersely as possible. Your not going to store "password" as a multibyte UTF8("password"). You store it as ascii("password"). (or really ascii('pw')

UTF8('password') === ascii('password'), actually - as long as you're within ascii range, UTF8 and ascii are equal byte for byte. It's not until code points > 128 where you start getting multibytes.

Having said that, doesn't the sparse storage lend itself really well for further column name optimisation - like using a single byte to denote the column name and then have a lookup table?  The server could do a lot of nice tricks in this area, when afforded so by a tighter schema. Also, I think that compression pretty much does this already - effect is the same even if mechanism is different.

/Janne

Re: Wide rows in CQL 3

Posted by Edward Capriolo <ed...@gmail.com>.

Also I have to say I do not get that blank sparse column.

Ghost ranges are a little weird but they don't bother me.

1 its a row of nothing. The definition of a waste.

2 suppose of have 1 billion rows and my distribution is mostly rows of 1 or
2 columns. My database is now significantly bigger. That stinks.

3 suppose I write columns frequently. Well do I have to constantly need to
keep writing this sparse empty row? It seems like I would. Worst case each
stable with a write to a rowkey also has this sparse column, meaning
multiple blank empty wasteful columns on disk to solve ghosts, that do not
bother me anyway.

4 are these sparse columns also taking memtable space?

This questions would give me serious pause to use sparse tables





On Wednesday, January 9, 2013, Edward Capriolo <ed...@gmail.com>
wrote:
> "By no upgrade path" I mean to say if I have a table with compact storage
I can not upgrade it to sparse storage. If i have an existing COMPACT table
and I want to add a Map to it, I can not. This is what I mean by no upgrade
path.
>
> Column families that mix static and dynamic columns are pretty common. In
fact it is pretty much the default case, you have a default validator then
some columns have specific validators. In the old days people used to say
"You only need one column family" you would subdivide your row key into
parts username=username, password=password, friend-friene = friends,
pet-pets = pets. It's very efficient and very easy if you understand what a
slice is. Is everyone else just adding a column family every time they have
new data? :) Sounds very un-no-sql-like.
> Most people are probably going to store column names as tersely as
possible. Your not going to store "password" as a multibyte
UTF8("password"). You store it as ascii("password"). (or really ascii('pw')
> Also for the rest of my comment I meant that the comparator of any sparse
tables always seems to be a COMPOSITE even if it is only one part (last I
checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a
compact table it is -colname-
> My overarching point is the 5 things I listed do have a cost, the user by
default gets sparse storage unless they are smart enough to know they do
not want it. This is naturally going to force people away from compact
storage.
> Basically for any column family: two possible decision paths:
> 1) use compact
> 2) use sparse
> Other then ease of use why would I chose sparse? Why should it be the
default?
> On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne <sy...@datastax.com>
wrote:
>>
>> c way. Now I can't pretend knowing what every user is doing, but from
>> my experience and what I've seen, this is not such a common thing and CF
are
>> either static or dynamic in nature, not both.
>

Re: Wide rows in CQL 3

Posted by Edward Capriolo <ed...@gmail.com>.

"By no upgrade path" I mean to say if I have a table with compact storage I
can not upgrade it to sparse storage. If i have an existing COMPACT table
and I want to add a Map to it, I can not. This is what I mean by no upgrade
path.

Column families that mix static and dynamic columns are pretty common. In
fact it is pretty much the default case, you have a default validator then
some columns have specific validators. In the old days people used to say
"You only need one column family" you would subdivide your row key into
parts username=username, password=password, friend-friene = friends,
pet-pets = pets. It's very efficient and very easy if you understand what a
slice is. Is everyone else just adding a column family every time they have
new data? :) Sounds very un-no-sql-like.

Most people are probably going to store column names as tersely as
possible. Your not going to store "password" as a multibyte
UTF8("password"). You store it as ascii("password"). (or really ascii('pw')

Also for the rest of my comment I meant that the comparator of any sparse
tables always seems to be a COMPOSITE even if it is only one part (last I
checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a
compact table it is -colname-

My overarching point is the 5 things I listed do have a cost, the user by
default gets sparse storage unless they are smart enough to know they do
not want it. This is naturally going to force people away from compact
storage.

Basically for any column family: two possible decision paths:

1) use compact
2) use sparse

Other then ease of use why would I chose sparse? Why should it be the
default?

On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne <sy...@datastax.com>wrote:

> c way. Now I can't pretend knowing what every user is doing, but from
> my experience and what I've seen, this is not such a common thing and CF
> are
> either static or dynamic in nature, not both.
>

Re: Wide rows in CQL 3

Posted by aaron morton <aa...@thelastpickle.com>.

> Is this possible without using multiple rows in CQL3 non compact tables?  
Depending on the number of (log record) keys you *could* do this as a map type in your CQL Table. 

create table log_row (
sequence timestamp, 
props map<text, text>
)

Cheers


-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/01/2013, at 1:58 AM, Vegard Berget <po...@fantasista.no> wrote:

> Thanks for explaining, Sylvain.
> You say that it is "not a mandatory one", how long could we expect it to be "not mandatory"?
> I think the new CQL stuff is great and I will probably use it heavily.  I understand the upgrade path, but my question is if I should start planning for an all-CQL future, or if I still could make some CFs with thrift and also expect it to work in 3 years time.  You say "you should see CQL3 non compact tables as the new stuff, the thing that you use post-upgrade" - but doesn't that mean that we also have to suddenly depend on a schema?  Let us for example say you have a logger, which logs all kinds of different stuff - typically key-value - and that each row could contain different keys.    
> ROWKEY1:  key1: val1, key2: val2, key3: val3
> ROWKEY2:  key4: val4, key1: val2, keyN: valN
> 
> Is this possible without using multiple rows in CQL3 non compact tables?  
> 
> .vegard,
> 
> 
> 
> ----- Original Message -----
> From:
> user@cassandra.apache.org
> 
> To:
> "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Cc:
> 
> Sent:
> Wed, 9 Jan 2013 23:14:25 +0100
> Subject:
> Re: Wide rows in CQL 3
> 
> 
> I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one, you
> can stick to thrift if you don't think CQL3 is better. But if you do decide to
> upgrade, you should see CQL3 non compact tables as the new stuff, the thing
> that you use post-upgrade. While you upgrade, stick to compact tables. Once
> you've upgraded, then you can start using the new stuff and accessing the new
> stuff the old way doesn't matter.
> 
> 
> 
> 
> --
> Sylvain
>

Re: Wide rows in CQL 3

Posted by Vegard Berget <po...@fantasista.no>.

Thanks for explaining, Sylvain.You say that it is "not a mandatory
one", how long could we expect it to be "not mandatory"?I think the
new CQL stuff is great and I will probably use it heavily.  I
understand the upgrade path, but my question is if I should start
planning for an all-CQL future, or if I still could make some CFs with
thrift and also expect it to work in 3 years time.  You say "you
should see CQL3 non compact tables as the new stuff, the thing that
you use post-upgrade" - but doesn't that mean that we also have to
suddenly depend on a schema?  Let us for example say you have a
logger, which logs all kinds of different stuff - typically key-value
- and that each row could contain different keys.    ROWKEY1:
 key1: val1, key2: val2, key3: val3ROWKEY2:  key4: val4, key1: val2,
keyN: valN
Is this possible without using multiple rows in CQL3 non compact
tables?  
.vegard,

----- Original Message -----
From: user@cassandra.apache.org
To:"user@cassandra.apache.org" 
Cc:
Sent:Wed, 9 Jan 2013 23:14:25 +0100
Subject:Re: Wide rows in CQL 3

I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory
one, you
 can stick to thrift if you don't think CQL3 is better. But if you do
decide to  upgrade, you should see CQL3 non compact tables as the new
stuff, the thing that you use post-upgrade. While you upgrade, stick
to compact tables. Once you've upgraded, then you can start using the
new stuff and accessing the new stuff the old way doesn't matter. 

 -- Sylvain

Re: Wide rows in CQL 3

Posted by Sylvain Lebresne <sy...@datastax.com>.

> There is no "upgrade path".

I don't think that's true. The goal of the blog post you've linked is to
discuss that upgrade path (and in particular show that for the most part,
you
can access your thrift data from CQL3 without any modification whatsoever).

> You adopt CQL3's sparse tables as soon as you start creating column
families
> from CQL.

That's not true, you can create non sparse from CQL3 (using COMPACT STORAGE)
and so you can work with both CQL3 and thrift alongside the time it takes
you
to upgrade from thrift to CQL3. Then, for things that you know you will only
access to CQL3 (i.e. when the "upgrade is complete"), you can start using
non
compact tables and enjoy their convenience (like collections for instance).

> There is not much backwards compatibility. CQL3 can query compact tables,
but
> you may have to remove the metadata from them so they can be transposed.

I think "not much backwards compatibility" is a tad unfair. The only case
where
you "may have to remove the metadata" is if you are using a CF in both a
static
and dynamic way. Now I can't pretend knowing what every user is doing, but
from
my experience and what I've seen, this is not such a common thing and CF are
either static or dynamic in nature, not both.

I do think that for most user upgrading from thrift to CQL3 won't require
any
data migration or messing with metadata. But more importantly, things are
not
completely closed. If you have *concrete* difficulties moving from thrift to
CQL3, please do share them on this mailing list and we'll try to help you
out.

> Thrift can not write into CQL tables easily, because of how the primary
keys
> and column names are encoded into the key column and compact metadata is
not
> equal to cql3's metadata.

I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one,
you
can stick to thrift if you don't think CQL3 is better. But if you do decide
to
upgrade, you should see CQL3 non compact tables as the new stuff, the thing
that you use post-upgrade. While you upgrade, stick to compact tables. Once
you've upgraded, then you can start using the new stuff and accessing the
new
stuff the old way doesn't matter.

> My biggest beefs are:
> 1) column names are UTF8 (seems wasteful in most cases)

That's largely not true, the "wasteful in most cases" part at least. A
column
name in CQL3 does not always translate to a internal column name. You can
still
do your time series where the internal column name is an int and you don't
waste space.

As for the static cases, yes, CQL3 forces UTF8, I'm pretty certain that
people
overwhelmingly use UTF8 or ascii in those cases. And because CQL3 forces
you to
declare your column names in those static cases, we may actually be able to
optimize the size used internally for those in the future, which is harder
with
thrift, so I think we actually have the potential to make is less wasteful
in
most cases.

> 2) sparse empty row to ghost (seems like tiny rows with one column have
much
> overhead now)

It is true that for non compact CQL3 we've focused on flexibility and on
making
the behavior predictable, which does adds some slight space overhead.
However:
- that's why compact storage is here. There is zero overhead over thrift if
  you use compact storage. That's even why we named it like that, it's
compact.
- we know that most the overhead of non compact tables can be win back by
  optimization of the storage engine. That's an advantage of having an API
  that is not too ties to the underlying storage: it gives room for
  optimizations.

> 3) using composites (with (compound primary keys) in some table designs)
is
> wasteful. Composite adds two unsigned bytes for size and one unsigned
byte as
> 0 per part.

See above.

> 4) many lines of code between user/request and actual disk. (tracing a CQL
> select VS a slice, young gen, etc)

If you are saying the implementation of CQL3 is more lines of code than the
thrift part, then you're probably right, but given how much convenient CQL3
is
compared to thrift, I happily take that criticism.

But in term of overhead, provided you use prepared statement (which you
should
if you care about performance), then it remains to be proven that CQL3 has
more
overhead than thrift. In particular in terms of garbage (since you're citing
young gen), while I haven't tested it, I'd be *really* surprised if thrift
is
generating less garbage than CQL3. And in term of the query tracing there is
almost no difference whatsoever between the two.

> 5) not sure if "collections" can be used in REALLY wide row scenarios. aka
> 1,000,000 entry set?

Lists have their downsides (listed in the documentation) but for sets and
maps,
they have no more limitation than wide rows have in theory. They do have the
limitation with the currently the API don't allow to fetch parts of a
collection. But that will change.

That being said and possibly more importantly, collections are *not* meant
to
be very wide. They are *not* meant for wide row scenarios. CQL3 has wide
rows
support (in the sense of thrift) *without* collections and for true wide row
scenarios you want to dedicate it a CF, because that is the right thing to
do.

--
Sylvain

Re: Wide rows in CQL 3

Posted by Edward Capriolo <ed...@gmail.com>.

I ask myself this every day. CQL3 is "new way" to do things, including wide
rows with collections. There is no "upgrade path". You adopt CQL3's sparse
tables as soon as you start creating column families from CQL. There is not
much backwards compatibility. CQL3 can query compact tables, but you may
have to remove the metadata from them so they can be transposed. Thrift can
not write into CQL tables easily, because of how the primary keys and
column names are encoded into the key column and compact metadata is not
equal to cql3's metadata.

http://www.datastax.com/dev/blog/thrift-to-cql3

For a large swath of problems I like how CQL3 deals with them. For example
you do not really need CQL3 to store a collection in a column family along
side other data. You can use wide rows for this, but the integrated
solution with CQL3 metadata is interesting.

My biggest beefs are:
1) column names are UTF8 (seems wasteful in most cases)
2) sparse empty row to ghost (seems like tiny rows with one column have
much overhead now)
3) using composites (with (compound primary keys) in some table designs) is
wasteful. Composite adds two unsigned bytes for size and one unsigned byte
as 0 per part.
4) many lines of code between user/request and actual disk. (tracing a CQL
select VS a slice, young gen, etc)
5) not sure if "collections" can be used in REALLY wide row scenarios. aka
1,000,000 entry set?

I feel that in an effort to be nube friendly, sparse+CQL is presented as
the best default option.  However the 5 above items are not minor, and in
several use cases could make CQL's sparse tables a bad choice for certain
applications. Those users would get better performance from compact
storage. I feel that message sometimes gets washed away in all the CQL
coolness. "What is that you say? This is not actually the most efficient
way to store this data? Well who cares I can do an IN CLAUSE! WooHoo!"

On Wed, Jan 9, 2013 at 12:10 PM, Ben Hood <0x...@gmail.com> wrote:

> I'm currently in the process of porting my app from Thrift to CQL3 and it
> seems to me that the underlying storage layout hasn't really changed
> fundamentally. The difference appears to be that CQL3 offers a neater
> abstraction on top of the wide row format. For example, in CQL3, your query
> results are bound to a specific schema, so you get named columns back -
> previously you had to process the slices procedurally. The insert path
> appears to be tighter as well - you don't seem to get away with leaving out
> key attributes.
>
> I'm sure somebody more knowledgeable can explain this better though.
>
> Cheers,
>
> Ben
>
>
> On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome <mr...@gmail.com> wrote:
>
>> We use the thrift bindings for our current production cluster, so I
>> haven't been tracking the developments regarding CQL3. I just discovered
>> when speaking to another potential DSE customer that wide rows, or rather
>> columns not defined in the metadata aren't supported in CQL 3.
>>
>> I'm curious to understand the reasoning behind this, whether this is an
>> intentional direction shift away from the big table paradigm, and what's
>> supposed to happen to those of us who have already bought into
>> C* specifically because of the wide row support. What is our upgrade path?
>>
>
>

Re: Wide rows in CQL 3

Posted by Ben Hood <0x...@gmail.com>.

I'm currently in the process of porting my app from Thrift to CQL3 and it
seems to me that the underlying storage layout hasn't really changed
fundamentally. The difference appears to be that CQL3 offers a neater
abstraction on top of the wide row format. For example, in CQL3, your query
results are bound to a specific schema, so you get named columns back -
previously you had to process the slices procedurally. The insert path
appears to be tighter as well - you don't seem to get away with leaving out
key attributes.

I'm sure somebody more knowledgeable can explain this better though.

Cheers,

Ben

On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome <mr...@gmail.com> wrote:

> We use the thrift bindings for our current production cluster, so I
> haven't been tracking the developments regarding CQL3. I just discovered
> when speaking to another potential DSE customer that wide rows, or rather
> columns not defined in the metadata aren't supported in CQL 3.
>
> I'm curious to understand the reasoning behind this, whether this is an
> intentional direction shift away from the big table paradigm, and what's
> supposed to happen to those of us who have already bought into
> C* specifically because of the wide row support. What is our upgrade path?
>

Re: Wide rows in CQL 3

Posted by "Hiller, Dean" <De...@nrel.gov>.

Probably should read this
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts

I don't see wide row support going away since they specifically made the change to enable 2 billion columns in a row according to that paper.

Dean

From: mrevilgnome <mr...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, January 9, 2013 9:51 AM
To: user <us...@cassandra.apache.org>>
Subject: Wide rows in CQL 3

We use the thrift bindings for our current production cluster, so I haven't been tracking the developments regarding CQL3. I just discovered when speaking to another potential DSE customer that wide rows, or rather columns not defined in the metadata aren't supported in CQL 3.

I'm curious to understand the reasoning behind this, whether this is an intentional direction shift away from the big table paradigm, and what's supposed to happen to those of us who have already bought into C* specifically because of the wide row support. What is our upgrade path?