You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Eric Evans <ee...@acunu.com> on 2012/01/02 17:53:41 UTC

Re: CQL support for compound columns

On Sat, Dec 31, 2011 at 1:12 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Fri, Dec 30, 2011 at 12:30 PM, Eric Evans <ee...@acunu.com> wrote:
>>> CREATE TABLE timeline (
>>>    user_id int,
>>>    posted_at uuid,
>>>    body string,
>>>    posted_by string,
>>>    PRIMARY KEY(user_id, posted_at, posted_by),
>>>    VALUE(body)
>>> );
>>
>> I think the value declaration also helps in that it's one more thing
>> that provides cues as to the data model it creates (more expressive).
>> But this got me thinking, why not introduce something special for the
>> composite name as well?  That way the PRIMARY KEY syntax (which comes
>> preloaded with meaning and expectations) could be kept more SQLish,
>> and the whole thing looks more like an extension to the language as
>> opposed to a modification.
>>
>> Say:
>>
>> CREATE TABLE timeline (
>>  user_id int PRIMARY KEY,
>>  posted_at uuid,
>>  body text,
>>  posted_by text,
>>  COMPOSITE_NAME(posted_at, posted_by),
>>  COMPOSITE_VALUE(body)
>> )
>
> I went back and forth on this mentally, but I come down as -0 on CN
> instead of PK.  For two reasons:
>
> First, the composite PRIMARY KEY is a better description of what you
> can actually do with the data.  In a relational model, a PK of user_id
> means there is only one (user_id, posted_at, body, posted_by) row with
> a given user_id.  Which is not the case here.  PK = (row key +
> composite components) captures exactly what is "immutable and unique"
> in a given object, so it's actually exactly what it's meant for and
> not an abuse at all.  (It even fits nicely with the "queries involving
> the PK are always indexed" assumption that isn't required by the SQL
> standard but every other database does anyway because it makes the
> most sense.)

Yeah, you're right, PK is a better fit for this.

Now that I'm forced to think about it a bit more, I think my un-SQL
reaction is probably rooted more in the abuse of the PRIMARY KEY
syntax, than the meaning it conveys.

In SQL, PRIMARY KEY is a modifier to a column spec, and here PRIMARY
KEY(user_id, posted_at, posted_by) reads like a PRIMARY modifier
applied to a KEY() function.  It's also a little strange the way it
appears in the grouping of column specs, when it's actually defining a
grouping or relationship of them (maybe this is what you meant about
using TRANSPOSED WITH <options> to emphasize the non-standard).

I wonder if there isn't a way to keep the PRIMARY KEY connection while
making it a little more SQL (and hence more intuitive).  Maybe
something like:


CREATE TABLE timeline (
  (user_id int, posted_at uuid, posted_by) PRIMARY KEY,
  body text
)


-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu

Re: CQL support for compound columns

Posted by Sylvain Lebresne <sy...@datastax.com>.
I had complains that my preceding mail was unreadable (thanks gmailfor
fucking my formatting up), so I've posted the same thing with nice
formatting on the JIRA ticket.

--
Sylvain

On Tue, Jan 3, 2012 at 7:08 PM, Sylvain Lebresne <sy...@datastax.com> wrote:
> Ok, I think I'm warming up to what we're getting at. I would change
> thesyntax of the VALUE() thing however. Instead of:CREATE TABLE
> timeline (   userid int,   posted_at uuid,   body string,   PRIMARY
> KEY(user_id, posted_at),   VALUE(body))I would prefer:CREATE COMPACT
> TABLE timeline (   userid int,   posted_at uuid,   body string,
> PRIMARY KEY(user_id, posted_at),)
> The reasons being that it really influences the implementation layout
> of theCF in C*. Namely, the non-compact CF defined by CREATE TABLE
> timeline (   userid int,   posted_at uuid,   body string,   PRIMARY
> KEY(user_id, posted_at),)would look in C* like:<userid> : {
> <posted_at>:'body' -> <value>}while the COMPACT variant would
> be:<userid> : {    <posted_at> -> <value>}which is using the fact that
> there is only 1 field not part of the key to"optimize" the layout. And
> I believe making the COMPACT part of the CREATEemphasis better that
> it's a property of the definition itself (that cannot bechanged)
> rather than of that specific 'body' field. It also make the rule
> forCOMPACT table rather simple: "a compact table should have only one
> field notpart of the primary key"; you don't have to deal with errors
> like someonedefining two VALUE() for instance.
>
> That being said, I'd like to try to resume where we're at (including
> theCOMPACT change above) and add a few random ideas along the way.
> Please correctme if I've got something wrong.
> I think we have 4 different cases, 2 for 'standard' CF without
> composites:- static CFs (the only case CQL handle really well today)-
> dynamic CFs (wide rows, time series if you prefer)and 2 for CF with
> composite column names:- 'dense' composite (typically time series but
> where the key is naturally  multi-parts)- 'sparse' composite (aka
> super columns)
> Let me try to take an example for which, with how it would
> translateinternally and example queries.
>
> Standard "static" CF--------------------
> "For each user, holds his infos"
> CREATE TABLE users (    userid uuid PRIMARY KEY,    firstname text,
> lastname text,    age int)
> In C*:<userid> : {    'firstname' -> <value>    'lastname' -> <value>
>   'age' -> <value>}
> Query:SELECT firstname, lastname FROM users WHERE userid = '...';
> Standard "dynamic" CF---------------------
> "For each user, keep each url he clicked on with the date of last click"
> CREATE COMPACT TABLE clicks (    userid uuid,    url text,
> timestamp date    PRIMARY KEY (userid, url))
> In C*:<userid> : {    <url> -> <timestamp>}
> Query:SELECT url, timestamp FROM clicks WHERE userid = '...';SELECT
> timestamp FROM clicks WHERE userid = '...' and url = 'http://...';
> 'dense' composite-----------------
> "For each user, keep ip and port from where he connected with the date
> of lastconnection"
> CREATE COMPACT TABLE connections (    userid uuid,    ip binary,
> port int,    timestamp date    PRIMARY KEY (userid, ip, port))
> In C*:<userid> : {    <ip>:<port> -> <timestamp>}
> Query:SELECT ip, port, timestamp FROM connections WHERE userid = '...';
> 'sparse' composite------------------
> "User timeline"
> CREATE TABLE timeline (    userid uuid,    posted_at date,    body
> text,    posted_by text,    PRIMARY KEY (user_id, posted_at),);
> In C*:<userid> : {    <posted_at>:'body' -> <value>
> <posted_at>:'posted_by' -> <value>}
> Query:SELECT body, posted_by FROM timeline WHERE userid = '...' and
> posted_at = '2 janvier 2010'
> Note: I think we really should also be able to do queries like:SELECT
> posted_ad, body, posted_by FROM timeline WHERE userid = '...' and
> posted_at > '2 janvier 2010'but that's more akin to the modification
> of the syntax for slices.
>
> Random other ideas------------------
> 1) We could allow something like:    CONSTRAINT key PRIMARY KEY
> (userid, ip, port)  which would then allow to write    SELECT
> timestamp FROM users WHERE key = ('...', 192.168.0.1, 80);  (I believe
> this is the 'standard' notation to name a 'composite' key in SQL)
> - Above we're ony handling the use of composites for column names, but
> they  can be useful for value (and row keys) and it could be nice to
> have an easy  notation for that (clearly a following ticket however).
> What about:
> CREATE COMPACT TABLE timeline (    userid_part1 text,    userid_part2
> int,    posted_at date,    posted_by uuid,    body text    header text
>    GROUP (userid_part1, userid_part2) AS userid,    PRIMARY KEY
> (userid, posted_at, posted_by)    GROUP (header, body))
> In C*:<userid_part1>:<userid_part2> : {    <posted_at>:<posted_by> ->
> <header>:<body>}
> Query:SELECT posted_at, posted_by, body, header FROM timeline WHERE
> userid = ('john', 32)
>
> --
> Sylvain
> On Mon, Jan 2, 2012 at 8:29 PM, Eric Evans <ee...@acunu.com> wrote:
>> On Mon, Jan 2, 2012 at 12:55 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> On Mon, Jan 2, 2012 at 10:53 AM, Eric Evans <ee...@acunu.com> wrote:
>>>> In SQL, PRIMARY KEY is a modifier to a column spec, and here PRIMARY
>>>> KEY(user_id, posted_at, posted_by) reads like a PRIMARY modifier
>>>> applied to a KEY() function.  It's also a little strange the way it
>>>> appears in the grouping of column specs, when it's actually defining a
>>>> grouping or relationship of them (maybe this is what you meant about
>>>> using TRANSPOSED WITH <options> to emphasize the non-standard).
>>>
>>> Fear not, I can set your mind at ease. :)
>>>
>>> Personally I think the syntax works reasonably well in its own right,
>>> but my main reason for the proposed syntax is that it is actually
>>> standard SQL for composite primary keys at least as far back as SQL
>>> 92, as a subcategory of table constraints.  The SQL standard is not
>>> freely linkable, but see
>>> http://www.postgresql.org/docs/9.1/static/sql-createtable.html for a
>>> real-world example.
>>
>> OK, I stand corrected (and my mind is at ease :) ).
>>
>>
>> --
>> Eric Evans
>> Acunu | http://www.acunu.com | @acunu

Re: CQL support for compound columns

Posted by Sylvain Lebresne <sy...@datastax.com>.
Ok, I think I'm warming up to what we're getting at. I would change
thesyntax of the VALUE() thing however. Instead of:CREATE TABLE
timeline (   userid int,   posted_at uuid,   body string,   PRIMARY
KEY(user_id, posted_at),   VALUE(body))I would prefer:CREATE COMPACT
TABLE timeline (   userid int,   posted_at uuid,   body string,
PRIMARY KEY(user_id, posted_at),)
The reasons being that it really influences the implementation layout
of theCF in C*. Namely, the non-compact CF defined by CREATE TABLE
timeline (   userid int,   posted_at uuid,   body string,   PRIMARY
KEY(user_id, posted_at),)would look in C* like:<userid> : {
<posted_at>:'body' -> <value>}while the COMPACT variant would
be:<userid> : {    <posted_at> -> <value>}which is using the fact that
there is only 1 field not part of the key to"optimize" the layout. And
I believe making the COMPACT part of the CREATEemphasis better that
it's a property of the definition itself (that cannot bechanged)
rather than of that specific 'body' field. It also make the rule
forCOMPACT table rather simple: "a compact table should have only one
field notpart of the primary key"; you don't have to deal with errors
like someonedefining two VALUE() for instance.

That being said, I'd like to try to resume where we're at (including
theCOMPACT change above) and add a few random ideas along the way.
Please correctme if I've got something wrong.
I think we have 4 different cases, 2 for 'standard' CF without
composites:- static CFs (the only case CQL handle really well today)-
dynamic CFs (wide rows, time series if you prefer)and 2 for CF with
composite column names:- 'dense' composite (typically time series but
where the key is naturally  multi-parts)- 'sparse' composite (aka
super columns)
Let me try to take an example for which, with how it would
translateinternally and example queries.

Standard "static" CF--------------------
"For each user, holds his infos"
CREATE TABLE users (    userid uuid PRIMARY KEY,    firstname text,
lastname text,    age int)
In C*:<userid> : {    'firstname' -> <value>    'lastname' -> <value>
  'age' -> <value>}
Query:SELECT firstname, lastname FROM users WHERE userid = '...';
Standard "dynamic" CF---------------------
"For each user, keep each url he clicked on with the date of last click"
CREATE COMPACT TABLE clicks (    userid uuid,    url text,
timestamp date    PRIMARY KEY (userid, url))
In C*:<userid> : {    <url> -> <timestamp>}
Query:SELECT url, timestamp FROM clicks WHERE userid = '...';SELECT
timestamp FROM clicks WHERE userid = '...' and url = 'http://...';
'dense' composite-----------------
"For each user, keep ip and port from where he connected with the date
of lastconnection"
CREATE COMPACT TABLE connections (    userid uuid,    ip binary,
port int,    timestamp date    PRIMARY KEY (userid, ip, port))
In C*:<userid> : {    <ip>:<port> -> <timestamp>}
Query:SELECT ip, port, timestamp FROM connections WHERE userid = '...';
'sparse' composite------------------
"User timeline"
CREATE TABLE timeline (    userid uuid,    posted_at date,    body
text,    posted_by text,    PRIMARY KEY (user_id, posted_at),);
In C*:<userid> : {    <posted_at>:'body' -> <value>
<posted_at>:'posted_by' -> <value>}
Query:SELECT body, posted_by FROM timeline WHERE userid = '...' and
posted_at = '2 janvier 2010'
Note: I think we really should also be able to do queries like:SELECT
posted_ad, body, posted_by FROM timeline WHERE userid = '...' and
posted_at > '2 janvier 2010'but that's more akin to the modification
of the syntax for slices.

Random other ideas------------------
1) We could allow something like:    CONSTRAINT key PRIMARY KEY
(userid, ip, port)  which would then allow to write    SELECT
timestamp FROM users WHERE key = ('...', 192.168.0.1, 80);  (I believe
this is the 'standard' notation to name a 'composite' key in SQL)
- Above we're ony handling the use of composites for column names, but
they  can be useful for value (and row keys) and it could be nice to
have an easy  notation for that (clearly a following ticket however).
What about:
CREATE COMPACT TABLE timeline (    userid_part1 text,    userid_part2
int,    posted_at date,    posted_by uuid,    body text    header text
   GROUP (userid_part1, userid_part2) AS userid,    PRIMARY KEY
(userid, posted_at, posted_by)    GROUP (header, body))
In C*:<userid_part1>:<userid_part2> : {    <posted_at>:<posted_by> ->
<header>:<body>}
Query:SELECT posted_at, posted_by, body, header FROM timeline WHERE
userid = ('john', 32)

--
Sylvain
On Mon, Jan 2, 2012 at 8:29 PM, Eric Evans <ee...@acunu.com> wrote:
> On Mon, Jan 2, 2012 at 12:55 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Mon, Jan 2, 2012 at 10:53 AM, Eric Evans <ee...@acunu.com> wrote:
>>> In SQL, PRIMARY KEY is a modifier to a column spec, and here PRIMARY
>>> KEY(user_id, posted_at, posted_by) reads like a PRIMARY modifier
>>> applied to a KEY() function.  It's also a little strange the way it
>>> appears in the grouping of column specs, when it's actually defining a
>>> grouping or relationship of them (maybe this is what you meant about
>>> using TRANSPOSED WITH <options> to emphasize the non-standard).
>>
>> Fear not, I can set your mind at ease. :)
>>
>> Personally I think the syntax works reasonably well in its own right,
>> but my main reason for the proposed syntax is that it is actually
>> standard SQL for composite primary keys at least as far back as SQL
>> 92, as a subcategory of table constraints.  The SQL standard is not
>> freely linkable, but see
>> http://www.postgresql.org/docs/9.1/static/sql-createtable.html for a
>> real-world example.
>
> OK, I stand corrected (and my mind is at ease :) ).
>
>
> --
> Eric Evans
> Acunu | http://www.acunu.com | @acunu

Re: CQL support for compound columns

Posted by Eric Evans <ee...@acunu.com>.
On Mon, Jan 2, 2012 at 12:55 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Mon, Jan 2, 2012 at 10:53 AM, Eric Evans <ee...@acunu.com> wrote:
>> In SQL, PRIMARY KEY is a modifier to a column spec, and here PRIMARY
>> KEY(user_id, posted_at, posted_by) reads like a PRIMARY modifier
>> applied to a KEY() function.  It's also a little strange the way it
>> appears in the grouping of column specs, when it's actually defining a
>> grouping or relationship of them (maybe this is what you meant about
>> using TRANSPOSED WITH <options> to emphasize the non-standard).
>
> Fear not, I can set your mind at ease. :)
>
> Personally I think the syntax works reasonably well in its own right,
> but my main reason for the proposed syntax is that it is actually
> standard SQL for composite primary keys at least as far back as SQL
> 92, as a subcategory of table constraints.  The SQL standard is not
> freely linkable, but see
> http://www.postgresql.org/docs/9.1/static/sql-createtable.html for a
> real-world example.

OK, I stand corrected (and my mind is at ease :) ).


-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu

Re: CQL support for compound columns

Posted by Jonathan Ellis <jb...@gmail.com>.
On Mon, Jan 2, 2012 at 10:53 AM, Eric Evans <ee...@acunu.com> wrote:
> In SQL, PRIMARY KEY is a modifier to a column spec, and here PRIMARY
> KEY(user_id, posted_at, posted_by) reads like a PRIMARY modifier
> applied to a KEY() function.  It's also a little strange the way it
> appears in the grouping of column specs, when it's actually defining a
> grouping or relationship of them (maybe this is what you meant about
> using TRANSPOSED WITH <options> to emphasize the non-standard).

Fear not, I can set your mind at ease. :)

Personally I think the syntax works reasonably well in its own right,
but my main reason for the proposed syntax is that it is actually
standard SQL for composite primary keys at least as far back as SQL
92, as a subcategory of table constraints.  The SQL standard is not
freely linkable, but see
http://www.postgresql.org/docs/9.1/static/sql-createtable.html for a
real-world example.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: CQL support for compound columns

Posted by Edward Capriolo <ed...@gmail.com>.
Maybe the ship on this has sailed, but I am a bit miffed  on "create
table". CQL is going out of its way to make things so easy for people. But
if someone does not understand the concept of a column family making it
easy for them to design something that is an anti-pattern is odd to me.

As an admin I have been called many times to troubleshoot database
performance issues databases. It sometimes boils down to a bad schema
choice. At later/production stages these become hard to dig out of. It
usually takes more hardware, converting GB or TB of data, application cut
overs.

I do not call "column families" "tables". If someone newer to cassandra did
I would correct them. Why not call Java references pointers? I hate being
ambiguous on key terminology.



On Mon, Jan 2, 2012 at 11:53 AM, Eric Evans <ee...@acunu.com> wrote:

> On Sat, Dec 31, 2011 at 1:12 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > On Fri, Dec 30, 2011 at 12:30 PM, Eric Evans <ee...@acunu.com> wrote:
> >>> CREATE TABLE timeline (
> >>>    user_id int,
> >>>    posted_at uuid,
> >>>    body string,
> >>>    posted_by string,
> >>>    PRIMARY KEY(user_id, posted_at, posted_by),
> >>>    VALUE(body)
> >>> );
> >>
> >> I think the value declaration also helps in that it's one more thing
> >> that provides cues as to the data model it creates (more expressive).
> >> But this got me thinking, why not introduce something special for the
> >> composite name as well?  That way the PRIMARY KEY syntax (which comes
> >> preloaded with meaning and expectations) could be kept more SQLish,
> >> and the whole thing looks more like an extension to the language as
> >> opposed to a modification.
> >>
> >> Say:
> >>
> >> CREATE TABLE timeline (
> >>  user_id int PRIMARY KEY,
> >>  posted_at uuid,
> >>  body text,
> >>  posted_by text,
> >>  COMPOSITE_NAME(posted_at, posted_by),
> >>  COMPOSITE_VALUE(body)
> >> )
> >
> > I went back and forth on this mentally, but I come down as -0 on CN
> > instead of PK.  For two reasons:
> >
> > First, the composite PRIMARY KEY is a better description of what you
> > can actually do with the data.  In a relational model, a PK of user_id
> > means there is only one (user_id, posted_at, body, posted_by) row with
> > a given user_id.  Which is not the case here.  PK = (row key +
> > composite components) captures exactly what is "immutable and unique"
> > in a given object, so it's actually exactly what it's meant for and
> > not an abuse at all.  (It even fits nicely with the "queries involving
> > the PK are always indexed" assumption that isn't required by the SQL
> > standard but every other database does anyway because it makes the
> > most sense.)
>
> Yeah, you're right, PK is a better fit for this.
>
> Now that I'm forced to think about it a bit more, I think my un-SQL
> reaction is probably rooted more in the abuse of the PRIMARY KEY
> syntax, than the meaning it conveys.
>
> In SQL, PRIMARY KEY is a modifier to a column spec, and here PRIMARY
> KEY(user_id, posted_at, posted_by) reads like a PRIMARY modifier
> applied to a KEY() function.  It's also a little strange the way it
> appears in the grouping of column specs, when it's actually defining a
> grouping or relationship of them (maybe this is what you meant about
> using TRANSPOSED WITH <options> to emphasize the non-standard).
>
> I wonder if there isn't a way to keep the PRIMARY KEY connection while
> making it a little more SQL (and hence more intuitive).  Maybe
> something like:
>
>
> CREATE TABLE timeline (
>  (user_id int, posted_at uuid, posted_by) PRIMARY KEY,
>  body text
> )
>
>
> --
> Eric Evans
> Acunu | http://www.acunu.com | @acunu
>