You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by David Savage <da...@gmail.com> on 2014/03/13 12:12:29 UTC

CQL Select Map using an IN relationship

Hi there,

I'm experimenting using cassandra and have run across an error message
which I need a little more information on.

The use case I'm experimenting with is a series of document updates
(documents being an arbitrary map of key value pairs), I would like to find
the latest document updates after a specified time period. I don't want to
store many copies of the documents (one per update) as the updates are
often only to single keys in the map so that would involve a lot of
duplicated data.

The solution I've found that seems to fit best in terms of performance is
to have two tables.

One that has an event log of timeuuid -> docid and a second that stores the
documents themselves stored by docid -> map<string, string>. I then run two
queries, one to select ids that have changed after a certain time:

SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)

and then a second to select the actual documents themselves

SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)

However this then explodes on query with the error message:

"Cannot restrict PRIMARY KEY part id by IN relation as a collection is
selected by the query"

Detective work lead me to these lines in
org.apache.cassandra.cql3.statementsSelectStatement:

                    // We only support IN for the last name and for compact
storage so far
                    // TODO: #3885 allows us to extend to non compact as
well, but that remains to be done
                    if (i != stmt.columnRestrictions.length - 1)
                        throw new
InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
restricted by IN relation", cname));
                    else if (stmt.selectACollection())
                        throw new
InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
by IN relation as a collection is selected by the query", cname));

It seems like #3885 will allow support for the first IF block above, but I
don't think it will allow the second, am I correct?

Any pointers on how I can work around this would be greatly appreciated.

Kind regards,

Dave

Re: CQL Select Map using an IN relationship

Posted by Peter Lin <wo...@gmail.com>.
probably a good idea to open a jira ticket to explain this better in the
docs. the downside of moving so fast is the docs often fall behind and
users have to dig around to figure things out. not everyone wants to read
the CQL3 antlr grammar to figure things out.


On Thu, Mar 13, 2014 at 11:27 AM, David Savage <da...@gmail.com>wrote:

> Great that works, thx! I probably would have never found that...
>
> It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or
> PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the
> time.
>
> Kind regards,
>
> Dave
>
>
> On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael <michael.laing@nytimes.com
> > wrote:
>
>> Create your table like this and it will work:
>>
>> CREATE TABLE test.documents (group text,id bigint,data
>> map<text,text>,PRIMARY KEY ((group, id)));
>>
>> The extra parens catenate 'group' and 'id' into the partition key - IN
>> will work on the last component of a partition key.
>>
>> ml
>>
>>
>> On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com>wrote:
>>
>>> Nope, upgraded to 2.0.5 and still get the same problem, I actually
>>> simplified the problem a little in my first post, there's a composite
>>> primary key involved as I need to partition ids into groups
>>>
>>> So the full CQL statements are:
>>>
>>> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
>>> 'replication_factor':3};
>>>
>>>
>>> CREATE TABLE test.documents (group text,id bigint,data
>>> map<text,text>,PRIMARY KEY (group, id));
>>>
>>>
>>> INSERT INTO test.documents(id,group,data) VALUES
>>> (0,'test',{'count':'0'});
>>>
>>> INSERT INTO test.documents(id,group,data) VALUES
>>> (1,'test',{'count':'1'});
>>>
>>> INSERT INTO test.documents(id,group,data) VALUES
>>> (2,'test',{'count':'2'});
>>>
>>>
>>> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>> Kind regards,
>>>
>>>
>>> /Dave
>>>
>>>
>>> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:
>>>
>>>> Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got
>>>> dragged in by the cassandra unit library I'm using for testing [1] I will
>>>> try to fix my build dependencies and retry, thx.
>>>>
>>>> /Dave
>>>>
>>>> [1] https://github.com/jsevellec/cassandra-unit
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>>>> michael.laing@nytimes.com> wrote:
>>>>
>>>>> I have no problem doing this w 2.0.5 - what version of C* are you
>>>>> using? Or maybe I don't understand your data model... attach 'creates' if
>>>>> you don't mind.
>>>>>
>>>>> ml
>>>>>
>>>>>
>>>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:
>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> Thanks for the help, unfortunately I'm not sure that's the problem,
>>>>>> the id is the primary key on the documents table and the timestamp
>>>>>> is the primary key on the eventlog table
>>>>>>
>>>>>> Kind regards,
>>>>>>
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> it's not clear to me if your "id" column is the KEY or just a
>>>>>>> regular column with secondary index.
>>>>>>>
>>>>>>> queries that have IN on non primary key columns isn't supported yet.
>>>>>>> not sure if that answers your question.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <
>>>>>>> davemssavage@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I'm experimenting using cassandra and have run across an error
>>>>>>>> message which I need a little more information on.
>>>>>>>>
>>>>>>>> The use case I'm experimenting with is a series of document updates
>>>>>>>> (documents being an arbitrary map of key value pairs), I would like to find
>>>>>>>> the latest document updates after a specified time period. I don't want to
>>>>>>>> store many copies of the documents (one per update) as the updates are
>>>>>>>> often only to single keys in the map so that would involve a lot of
>>>>>>>> duplicated data.
>>>>>>>>
>>>>>>>> The solution I've found that seems to fit best in terms of
>>>>>>>> performance is to have two tables.
>>>>>>>>
>>>>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>>>>> then run two queries, one to select ids that have changed after a certain
>>>>>>>> time:
>>>>>>>>
>>>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>>>
>>>>>>>> and then a second to select the actual documents themselves
>>>>>>>>
>>>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>>>>>>>>
>>>>>>>> However this then explodes on query with the error message:
>>>>>>>>
>>>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection
>>>>>>>> is selected by the query"
>>>>>>>>
>>>>>>>> Detective work lead me to these lines in
>>>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>>>
>>>>>>>>                     // We only support IN for the last name and for
>>>>>>>> compact storage so far
>>>>>>>>                     // TODO: #3885 allows us to extend to non
>>>>>>>> compact as well, but that remains to be done
>>>>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>>>>                         throw new
>>>>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>>>>> restricted by IN relation", cname));
>>>>>>>>                     else if (stmt.selectACollection())
>>>>>>>>                         throw new
>>>>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>>>>
>>>>>>>> It seems like #3885 will allow support for the first IF block
>>>>>>>> above, but I don't think it will allow the second, am I correct?
>>>>>>>>
>>>>>>>> Any pointers on how I can work around this would be greatly
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CQL Select Map using an IN relationship

Posted by "Laing, Michael" <mi...@nytimes.com>.
I have found that range_key communicates better what you can actually do
with them, whereas clustering is more passive.

ml


On Thu, Mar 13, 2014 at 2:08 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

>   “range key” is formally known as “clustering column”. One or more
> clustering columns can be specified to identify individual rows in a
> partition. Without clustering columns, one partition is one row. So, it’s a
> matter of whether you want your rows to be in the same partition or
> distributed.
>
> -- Jack Krupansky
>
>  *From:* Laing, Michael <mi...@nytimes.com>
> *Sent:* Thursday, March 13, 2014 1:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: CQL Select Map using an IN relationship
>
>  Think of them as:
>
>
> PRIMARY KEY (partition_key[, range_key])
>
>
> where the partition_key can be compounded as:
>
>
> (partition_key0 [, partition_key1, ...])
>
>
> and the optional range_key can be compounded as:
>
>
> range_key0 [, range_key1 ...]
>
>
> If you do this: PRIMARY KEY (key1, key2) - then key1 is the partition_key
> and key2 is the range_key and queries will work that hash to key1 (the
> partition) using = or IN and specify a range on key2.
>
> But if you do this: PRIMARY key ((key1, key2)) then (key1, key2) is the
> compound partition key - there is no range key - and you can specify = on
> key1 and = or IN on key2 (but not a range).
>
> Anyway that's what I remember! Hope it helps.
>
> ml
>
>
> On Thu, Mar 13, 2014 at 11:27 AM, David Savage <da...@gmail.com>wrote:
>
>> Great that works, thx! I probably would have never found that...
>>
>> It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or
>> PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the
>> time.
>>
>> Kind regards,
>>
>> Dave
>>
>>
>> On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> Create your table like this and it will work:
>>>
>>> CREATE TABLE test.documents (group text,id bigint,data
>>> map<text,text>,PRIMARY KEY ((group, id)));
>>>
>>> The extra parens catenate 'group' and 'id' into the partition key - IN
>>> will work on the last component of a partition key.
>>>
>>> ml
>>>
>>>
>>> On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com>wrote:
>>>
>>>> Nope, upgraded to 2.0.5 and still get the same problem, I actually
>>>> simplified the problem a little in my first post, there's a composite
>>>> primary key involved as I need to partition ids into groups
>>>>
>>>> So the full CQL statements are:
>>>>
>>>>
>>>> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
>>>> 'replication_factor':3};
>>>>
>>>>
>>>>
>>>> CREATE TABLE test.documents (group text,id bigint,data
>>>> map<text,text>,PRIMARY KEY (group, id));
>>>>
>>>>
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (0,'test',{'count':'0'});
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (1,'test',{'count':'1'});
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (2,'test',{'count':'2'});
>>>>
>>>>
>>>>
>>>> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>>>>
>>>>
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>>
>>>>
>>>> /Dave
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:
>>>>
>>>>>  Hmmm that maybe the problem, I'm currently testing with 2.0.2 which
>>>>> got dragged in by the cassandra unit library I'm using for testing [1] I
>>>>> will try to fix my build dependencies and retry, thx.
>>>>>
>>>>> /Dave
>>>>>
>>>>> [1] https://github.com/jsevellec/cassandra-unit
>>>>>
>>>>>
>>>>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>>>>> michael.laing@nytimes.com> wrote:
>>>>>
>>>>>> I have no problem doing this w 2.0.5 - what version of C* are you
>>>>>> using? Or maybe I don't understand your data model... attach 'creates' if
>>>>>> you don't mind.
>>>>>>
>>>>>> ml
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <davemssavage@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> Thanks for the help, unfortunately I'm not sure that's the problem,
>>>>>>> the id is the primary key on the documents table and the timestamp
>>>>>>> is the primary key on the eventlog table
>>>>>>>
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> it's not clear to me if your "id" column is the KEY or just a
>>>>>>>> regular column with secondary index.
>>>>>>>>
>>>>>>>> queries that have IN on non primary key columns isn't supported
>>>>>>>> yet. not sure if that answers your question.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <
>>>>>>>> davemssavage@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I'm experimenting using cassandra and have run across an error
>>>>>>>>> message which I need a little more information on.
>>>>>>>>>
>>>>>>>>> The use case I'm experimenting with is a series of document
>>>>>>>>> updates (documents being an arbitrary map of key value pairs), I would like
>>>>>>>>> to find the latest document updates after a specified time period. I don't
>>>>>>>>> want to store many copies of the documents (one per update) as the updates
>>>>>>>>> are often only to single keys in the map so that would involve a lot of
>>>>>>>>> duplicated data.
>>>>>>>>>
>>>>>>>>> The solution I've found that seems to fit best in terms of
>>>>>>>>> performance is to have two tables.
>>>>>>>>>
>>>>>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>>>>>> then run two queries, one to select ids that have changed after a certain
>>>>>>>>> time:
>>>>>>>>>
>>>>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>>>>
>>>>>>>>> and then a second to select the actual documents themselves
>>>>>>>>>
>>>>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6,
>>>>>>>>> 7…)
>>>>>>>>>
>>>>>>>>> However this then explodes on query with the error message:
>>>>>>>>>
>>>>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a
>>>>>>>>> collection is selected by the query"
>>>>>>>>>
>>>>>>>>> Detective work lead me to these lines in
>>>>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>>>>
>>>>>>>>>                      // We only support IN for the last name and
>>>>>>>>> for compact storage so far
>>>>>>>>>                     // TODO: #3885 allows us to extend to non
>>>>>>>>> compact as well, but that remains to be done
>>>>>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>>>>>                         throw new
>>>>>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>>>>>> restricted by IN relation", cname));
>>>>>>>>>                     else if (stmt.selectACollection())
>>>>>>>>>                         throw new
>>>>>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>>>>>
>>>>>>>>> It seems like #3885 will allow support for the first IF block
>>>>>>>>> above, but I don't think it will allow the second, am I correct?
>>>>>>>>>
>>>>>>>>> Any pointers on how I can work around this would be greatly
>>>>>>>>> appreciated.
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: CQL Select Map using an IN relationship

Posted by David Savage <da...@gmail.com>.
Hi Michael,

Thanks for the feedback, all makes sense.

If anyone wants me to raise a jira ticket for docs on (key1, key2) vs
((key1,key2)) and their implications, or fixing that if block in
SelectStatement, let me know - though for the if block possibly best if
that jira is raised by a C* expert so it uses the right terminology to
describe the problem ;)

Kind regards,

Dave


On Fri, Mar 14, 2014 at 12:49 AM, Laing, Michael
<mi...@nytimes.com>wrote:

> These are my personal opinions, reflecting both my long experience w
> database systems, and my newness to Cassandra...
>
> [tl;dr]
>
> The Cassandra contributors, having made its history, tend to describe it
> in terms of implementation rather than action. And its implementation has a
> history, all relatively recent, that many know, but which to newcomers like
> me is obscure and, frankly, not particularly relevant.
>
> Note: we are all trying to understand Crimea now, and to really
> understand, you have to ingest several hundred years of history. Luckily,
> Cassandra has not been around quite so long!
>
> But Cassandra's history creeps into the nomenclature of CQL3. So what
> might logically be called a 'hash key' is called a 'partition key', what is
> called a 'clustering key' might be better termed a 'range key' IMHO.
>
> The 'official' terms in the nomenclature are important to know, they are
> just not descriptive of the actions one takes as a user of them. However,
> they have meaning to those who have 'lived' the history of Cassandra, and
> form an important bridge to the past.
>
> As a new user I found them non-intuitive. Amazon has done a much better
> job with DynamoDB - muddled, however, by bad syntax choices.
>
> But you adjust and mentally map... I am still bumfuzzled when people talk
> of slices and other C* cruft but just let it slide by like lectures from my
> mother. That and thrift can just fade into history with gopher and lynx as
> far as I am concerned - CQL3 is where it's at.
>
> But another thing to remember is that performance is king - and to get
> performance you fly 'close to the metal': Cassandra does that and you
> should know the code paths, the physical structures, and the
> characteristics of your 'metal' to understand how to build high-performing
> apps.
>
> ***
>
> The answer to both asterisks is Yes. You should use the term 'clustering
> column' because that is what is in the docs - but you should think 'range
> key' for how you use it. Similarly 'partition key' : 'hash key'.
>
> Good luck,
>
> ml
>

Re: CQL Select Map using an IN relationship

Posted by "Laing, Michael" <mi...@nytimes.com>.
These are my personal opinions, reflecting both my long experience w
database systems, and my newness to Cassandra...

[tl;dr]

The Cassandra contributors, having made its history, tend to describe it in
terms of implementation rather than action. And its implementation has a
history, all relatively recent, that many know, but which to newcomers like
me is obscure and, frankly, not particularly relevant.

Note: we are all trying to understand Crimea now, and to really understand,
you have to ingest several hundred years of history. Luckily, Cassandra has
not been around quite so long!

But Cassandra's history creeps into the nomenclature of CQL3. So what might
logically be called a 'hash key' is called a 'partition key', what is
called a 'clustering key' might be better termed a 'range key' IMHO.

The 'official' terms in the nomenclature are important to know, they are
just not descriptive of the actions one takes as a user of them. However,
they have meaning to those who have 'lived' the history of Cassandra, and
form an important bridge to the past.

As a new user I found them non-intuitive. Amazon has done a much better job
with DynamoDB - muddled, however, by bad syntax choices.

But you adjust and mentally map... I am still bumfuzzled when people talk
of slices and other C* cruft but just let it slide by like lectures from my
mother. That and thrift can just fade into history with gopher and lynx as
far as I am concerned - CQL3 is where it's at.

But another thing to remember is that performance is king - and to get
performance you fly 'close to the metal': Cassandra does that and you
should know the code paths, the physical structures, and the
characteristics of your 'metal' to understand how to build high-performing
apps.

***

The answer to both asterisks is Yes. You should use the term 'clustering
column' because that is what is in the docs - but you should think 'range
key' for how you use it. Similarly 'partition key' : 'hash key'.

Good luck,

ml

Re: CQL Select Map using an IN relationship

Posted by David Savage <da...@gmail.com>.
Thanks for the explanations.

To confirm I understand, Michaels explanation seems to say that that :

* the partition key supports =/IN but not >,>=,<,<=
* the range key (or clustering column) supports =,>,>=,<,<= but not IN. Is
that correct?

Jacks explanation seems to say that by grouping the two columns in the
primary key ((key1, key2)) this will prevent data from being partitioned
across nodes in the cluster, is that correct?

Also in another response thread Sylvian seemed to hint that it's historical
that IN is not supported on the range key / clustering column [1]. If I've
understood that correctly I'm happy to raise a jira ticket to track this so
it can be fixed.

Thanks for your help.

Kind regards,

Dave

[1] Please let me know if I should pick one of these terms for clarity...


On Thu, Mar 13, 2014 at 6:08 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

>   "range key" is formally known as "clustering column". One or more
> clustering columns can be specified to identify individual rows in a
> partition. Without clustering columns, one partition is one row. So, it's a
> matter of whether you want your rows to be in the same partition or
> distributed.
>
> -- Jack Krupansky
>
>  *From:* Laing, Michael <mi...@nytimes.com>
> *Sent:* Thursday, March 13, 2014 1:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: CQL Select Map using an IN relationship
>
>  Think of them as:
>
>
> PRIMARY KEY (partition_key[, range_key])
>
>
> where the partition_key can be compounded as:
>
>
> (partition_key0 [, partition_key1, ...])
>
>
> and the optional range_key can be compounded as:
>
>
> range_key0 [, range_key1 ...]
>
>
> If you do this: PRIMARY KEY (key1, key2) - then key1 is the partition_key
> and key2 is the range_key and queries will work that hash to key1 (the
> partition) using = or IN and specify a range on key2.
>
> But if you do this: PRIMARY key ((key1, key2)) then (key1, key2) is the
> compound partition key - there is no range key - and you can specify = on
> key1 and = or IN on key2 (but not a range).
>
> Anyway that's what I remember! Hope it helps.
>
> ml
>
>
> On Thu, Mar 13, 2014 at 11:27 AM, David Savage <da...@gmail.com>wrote:
>
>> Great that works, thx! I probably would have never found that...
>>
>> It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or
>> PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the
>> time.
>>
>> Kind regards,
>>
>> Dave
>>
>>
>> On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> Create your table like this and it will work:
>>>
>>> CREATE TABLE test.documents (group text,id bigint,data
>>> map<text,text>,PRIMARY KEY ((group, id)));
>>>
>>> The extra parens catenate 'group' and 'id' into the partition key - IN
>>> will work on the last component of a partition key.
>>>
>>> ml
>>>
>>>
>>> On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com>wrote:
>>>
>>>> Nope, upgraded to 2.0.5 and still get the same problem, I actually
>>>> simplified the problem a little in my first post, there's a composite
>>>> primary key involved as I need to partition ids into groups
>>>>
>>>> So the full CQL statements are:
>>>>
>>>>
>>>> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
>>>> 'replication_factor':3};
>>>>
>>>>
>>>>
>>>> CREATE TABLE test.documents (group text,id bigint,data
>>>> map<text,text>,PRIMARY KEY (group, id));
>>>>
>>>>
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (0,'test',{'count':'0'});
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (1,'test',{'count':'1'});
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (2,'test',{'count':'2'});
>>>>
>>>>
>>>>
>>>> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>>>>
>>>>
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>>
>>>>
>>>> /Dave
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:
>>>>
>>>>>  Hmmm that maybe the problem, I'm currently testing with 2.0.2 which
>>>>> got dragged in by the cassandra unit library I'm using for testing [1] I
>>>>> will try to fix my build dependencies and retry, thx.
>>>>>
>>>>> /Dave
>>>>>
>>>>> [1] https://github.com/jsevellec/cassandra-unit
>>>>>
>>>>>
>>>>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>>>>> michael.laing@nytimes.com> wrote:
>>>>>
>>>>>> I have no problem doing this w 2.0.5 - what version of C* are you
>>>>>> using? Or maybe I don't understand your data model... attach 'creates' if
>>>>>> you don't mind.
>>>>>>
>>>>>> ml
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <davemssavage@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> Thanks for the help, unfortunately I'm not sure that's the problem,
>>>>>>> the id is the primary key on the documents table and the timestamp
>>>>>>> is the primary key on the eventlog table
>>>>>>>
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> it's not clear to me if your "id" column is the KEY or just a
>>>>>>>> regular column with secondary index.
>>>>>>>>
>>>>>>>> queries that have IN on non primary key columns isn't supported
>>>>>>>> yet. not sure if that answers your question.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <
>>>>>>>> davemssavage@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I'm experimenting using cassandra and have run across an error
>>>>>>>>> message which I need a little more information on.
>>>>>>>>>
>>>>>>>>> The use case I'm experimenting with is a series of document
>>>>>>>>> updates (documents being an arbitrary map of key value pairs), I would like
>>>>>>>>> to find the latest document updates after a specified time period. I don't
>>>>>>>>> want to store many copies of the documents (one per update) as the updates
>>>>>>>>> are often only to single keys in the map so that would involve a lot of
>>>>>>>>> duplicated data.
>>>>>>>>>
>>>>>>>>> The solution I've found that seems to fit best in terms of
>>>>>>>>> performance is to have two tables.
>>>>>>>>>
>>>>>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>>>>>> then run two queries, one to select ids that have changed after a certain
>>>>>>>>> time:
>>>>>>>>>
>>>>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>>>>
>>>>>>>>> and then a second to select the actual documents themselves
>>>>>>>>>
>>>>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6,
>>>>>>>>> 7...)
>>>>>>>>>
>>>>>>>>> However this then explodes on query with the error message:
>>>>>>>>>
>>>>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a
>>>>>>>>> collection is selected by the query"
>>>>>>>>>
>>>>>>>>> Detective work lead me to these lines in
>>>>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>>>>
>>>>>>>>>                      // We only support IN for the last name and
>>>>>>>>> for compact storage so far
>>>>>>>>>                     // TODO: #3885 allows us to extend to non
>>>>>>>>> compact as well, but that remains to be done
>>>>>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>>>>>                         throw new
>>>>>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>>>>>> restricted by IN relation", cname));
>>>>>>>>>                     else if (stmt.selectACollection())
>>>>>>>>>                         throw new
>>>>>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>>>>>
>>>>>>>>> It seems like #3885 will allow support for the first IF block
>>>>>>>>> above, but I don't think it will allow the second, am I correct?
>>>>>>>>>
>>>>>>>>> Any pointers on how I can work around this would be greatly
>>>>>>>>> appreciated.
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: CQL Select Map using an IN relationship

Posted by Jack Krupansky <ja...@basetechnology.com>.
“range key” is formally known as “clustering column”. One or more clustering columns can be specified to identify individual rows in a partition. Without clustering columns, one partition is one row. So, it’s a matter of whether you want your rows to be in the same partition or distributed.

-- Jack Krupansky

From: Laing, Michael 
Sent: Thursday, March 13, 2014 1:39 PM
To: user@cassandra.apache.org 
Subject: Re: CQL Select Map using an IN relationship

Think of them as: 

  PRIMARY KEY (partition_key[, range_key])

where the partition_key can be compounded as:


  (partition_key0 [, partition_key1, ...])

and the optional range_key can be compounded as: 

  range_key0 [, range_key1 ...]

If you do this: PRIMARY KEY (key1, key2) - then key1 is the partition_key and key2 is the range_key and queries will work that hash to key1 (the partition) using = or IN and specify a range on key2.

But if you do this: PRIMARY key ((key1, key2)) then (key1, key2) is the compound partition key - there is no range key - and you can specify = on key1 and = or IN on key2 (but not a range).

Anyway that's what I remember! Hope it helps.

ml



On Thu, Mar 13, 2014 at 11:27 AM, David Savage <da...@gmail.com> wrote:

  Great that works, thx! I probably would have never found that... 

  It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the time.

  Kind regards,

  Dave


  On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael <mi...@nytimes.com> wrote:

    Create your table like this and it will work: 

    CREATE TABLE test.documents (group text,id bigint,data map<text,text>,PRIMARY KEY ((group, id)));



    The extra parens catenate 'group' and 'id' into the partition key - IN will work on the last component of a partition key.


    ml



    On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com> wrote:

      Nope, upgraded to 2.0.5 and still get the same problem, I actually simplified the problem a little in my first post, there's a composite primary key involved as I need to partition ids into groups 

      So the full CQL statements are:

      CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', 'replication_factor':3};



      CREATE TABLE test.documents (group text,id bigint,data map<text,text>,PRIMARY KEY (group, id));



      INSERT INTO test.documents(id,group,data) VALUES (0,'test',{'count':'0'});

      INSERT INTO test.documents(id,group,data) VALUES (1,'test',{'count':'1'});

      INSERT INTO test.documents(id,group,data) VALUES (2,'test',{'count':'2'});



      SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);



      Thanks for your help.



      Kind regards,



      /Dave




      On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com> wrote:

        Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got dragged in by the cassandra unit library I'm using for testing [1] I will try to fix my build dependencies and retry, thx.

        /Dave


        [1] https://github.com/jsevellec/cassandra-unit



        On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <mi...@nytimes.com> wrote:

          I have no problem doing this w 2.0.5 - what version of C* are you using? Or maybe I don't understand your data model... attach 'creates' if you don't mind. 

          ml



          On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com> wrote:

            Hi Peter, 

            Thanks for the help, unfortunately I'm not sure that's the problem, the id is the primary key on the documents table and the timestamp is the primary key on the eventlog table

            Kind regards,



            Dave


            On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:


              it's not clear to me if your "id" column is the KEY or just a regular column with secondary index.


              queries that have IN on non primary key columns isn't supported yet. not sure if that answers your question.




              On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com> wrote:

                Hi there, 

                I'm experimenting using cassandra and have run across an error message which I need a little more information on.

                The use case I'm experimenting with is a series of document updates (documents being an arbitrary map of key value pairs), I would like to find the latest document updates after a specified time period. I don't want to store many copies of the documents (one per update) as the updates are often only to single keys in the map so that would involve a lot of duplicated data.

                The solution I've found that seems to fit best in terms of performance is to have two tables.

                One that has an event log of timeuuid -> docid and a second that stores the documents themselves stored by docid -> map<string, string>. I then run two queries, one to select ids that have changed after a certain time:

                SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)

                and then a second to select the actual documents themselves

                SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7…)


                However this then explodes on query with the error message:

                "Cannot restrict PRIMARY KEY part id by IN relation as a collection is selected by the query"

                Detective work lead me to these lines in org.apache.cassandra.cql3.statementsSelectStatement:

                                    // We only support IN for the last name and for compact storage so far
                                    // TODO: #3885 allows us to extend to non compact as well, but that remains to be done
                                    if (i != stmt.columnRestrictions.length - 1)
                                        throw new InvalidRequestException(String.format("PRIMARY KEY part %s cannot be restricted by IN relation", cname));
                                    else if (stmt.selectACollection())
                                        throw new InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s by IN relation as a collection is selected by the query", cname));

                It seems like #3885 will allow support for the first IF block above, but I don't think it will allow the second, am I correct? 

                Any pointers on how I can work around this would be greatly appreciated.

                Kind regards,

                Dave







Re: CQL Select Map using an IN relationship

Posted by "Laing, Michael" <mi...@nytimes.com>.
Think of them as:

PRIMARY KEY (partition_key[, range_key])

where the partition_key can be compounded as:

(partition_key0 [, partition_key1, ...])

and the optional range_key can be compounded as:

range_key0 [, range_key1 ...]

If you do this: PRIMARY KEY (key1, key2) - then key1 is the partition_key
and key2 is the range_key and queries will work that hash to key1 (the
partition) using = or IN and specify a range on key2.

But if you do this: PRIMARY key ((key1, key2)) then (key1, key2) is the
compound partition key - there is no range key - and you can specify = on
key1 and = or IN on key2 (but not a range).

Anyway that's what I remember! Hope it helps.

ml


On Thu, Mar 13, 2014 at 11:27 AM, David Savage <da...@gmail.com>wrote:

> Great that works, thx! I probably would have never found that...
>
> It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or
> PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the
> time.
>
> Kind regards,
>
> Dave
>
>
> On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael <michael.laing@nytimes.com
> > wrote:
>
>> Create your table like this and it will work:
>>
>> CREATE TABLE test.documents (group text,id bigint,data
>> map<text,text>,PRIMARY KEY ((group, id)));
>>
>> The extra parens catenate 'group' and 'id' into the partition key - IN
>> will work on the last component of a partition key.
>>
>> ml
>>
>>
>> On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com>wrote:
>>
>>> Nope, upgraded to 2.0.5 and still get the same problem, I actually
>>> simplified the problem a little in my first post, there's a composite
>>> primary key involved as I need to partition ids into groups
>>>
>>> So the full CQL statements are:
>>>
>>> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
>>> 'replication_factor':3};
>>>
>>>
>>> CREATE TABLE test.documents (group text,id bigint,data
>>> map<text,text>,PRIMARY KEY (group, id));
>>>
>>>
>>> INSERT INTO test.documents(id,group,data) VALUES
>>> (0,'test',{'count':'0'});
>>>
>>> INSERT INTO test.documents(id,group,data) VALUES
>>> (1,'test',{'count':'1'});
>>>
>>> INSERT INTO test.documents(id,group,data) VALUES
>>> (2,'test',{'count':'2'});
>>>
>>>
>>> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>> Kind regards,
>>>
>>>
>>> /Dave
>>>
>>>
>>> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:
>>>
>>>> Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got
>>>> dragged in by the cassandra unit library I'm using for testing [1] I will
>>>> try to fix my build dependencies and retry, thx.
>>>>
>>>> /Dave
>>>>
>>>> [1] https://github.com/jsevellec/cassandra-unit
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>>>> michael.laing@nytimes.com> wrote:
>>>>
>>>>> I have no problem doing this w 2.0.5 - what version of C* are you
>>>>> using? Or maybe I don't understand your data model... attach 'creates' if
>>>>> you don't mind.
>>>>>
>>>>> ml
>>>>>
>>>>>
>>>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:
>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> Thanks for the help, unfortunately I'm not sure that's the problem,
>>>>>> the id is the primary key on the documents table and the timestamp
>>>>>> is the primary key on the eventlog table
>>>>>>
>>>>>> Kind regards,
>>>>>>
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> it's not clear to me if your "id" column is the KEY or just a
>>>>>>> regular column with secondary index.
>>>>>>>
>>>>>>> queries that have IN on non primary key columns isn't supported yet.
>>>>>>> not sure if that answers your question.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <
>>>>>>> davemssavage@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I'm experimenting using cassandra and have run across an error
>>>>>>>> message which I need a little more information on.
>>>>>>>>
>>>>>>>> The use case I'm experimenting with is a series of document updates
>>>>>>>> (documents being an arbitrary map of key value pairs), I would like to find
>>>>>>>> the latest document updates after a specified time period. I don't want to
>>>>>>>> store many copies of the documents (one per update) as the updates are
>>>>>>>> often only to single keys in the map so that would involve a lot of
>>>>>>>> duplicated data.
>>>>>>>>
>>>>>>>> The solution I've found that seems to fit best in terms of
>>>>>>>> performance is to have two tables.
>>>>>>>>
>>>>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>>>>> then run two queries, one to select ids that have changed after a certain
>>>>>>>> time:
>>>>>>>>
>>>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>>>
>>>>>>>> and then a second to select the actual documents themselves
>>>>>>>>
>>>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7…)
>>>>>>>>
>>>>>>>> However this then explodes on query with the error message:
>>>>>>>>
>>>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection
>>>>>>>> is selected by the query"
>>>>>>>>
>>>>>>>> Detective work lead me to these lines in
>>>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>>>
>>>>>>>>                     // We only support IN for the last name and for
>>>>>>>> compact storage so far
>>>>>>>>                     // TODO: #3885 allows us to extend to non
>>>>>>>> compact as well, but that remains to be done
>>>>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>>>>                         throw new
>>>>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>>>>> restricted by IN relation", cname));
>>>>>>>>                     else if (stmt.selectACollection())
>>>>>>>>                         throw new
>>>>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>>>>
>>>>>>>> It seems like #3885 will allow support for the first IF block
>>>>>>>> above, but I don't think it will allow the second, am I correct?
>>>>>>>>
>>>>>>>> Any pointers on how I can work around this would be greatly
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CQL Select Map using an IN relationship

Posted by David Savage <da...@gmail.com>.
Great that works, thx! I probably would have never found that...

It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or
PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the
time.

Kind regards,

Dave


On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael
<mi...@nytimes.com>wrote:

> Create your table like this and it will work:
>
> CREATE TABLE test.documents (group text,id bigint,data
> map<text,text>,PRIMARY KEY ((group, id)));
>
> The extra parens catenate 'group' and 'id' into the partition key - IN
> will work on the last component of a partition key.
>
> ml
>
>
> On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com>wrote:
>
>> Nope, upgraded to 2.0.5 and still get the same problem, I actually
>> simplified the problem a little in my first post, there's a composite
>> primary key involved as I need to partition ids into groups
>>
>> So the full CQL statements are:
>>
>> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
>> 'replication_factor':3};
>>
>>
>> CREATE TABLE test.documents (group text,id bigint,data
>> map<text,text>,PRIMARY KEY (group, id));
>>
>>
>> INSERT INTO test.documents(id,group,data) VALUES (0,'test',{'count':'0'});
>>
>> INSERT INTO test.documents(id,group,data) VALUES (1,'test',{'count':'1'});
>>
>> INSERT INTO test.documents(id,group,data) VALUES (2,'test',{'count':'2'});
>>
>>
>> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>>
>>
>> Thanks for your help.
>>
>>
>> Kind regards,
>>
>>
>> /Dave
>>
>>
>> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:
>>
>>> Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got
>>> dragged in by the cassandra unit library I'm using for testing [1] I will
>>> try to fix my build dependencies and retry, thx.
>>>
>>> /Dave
>>>
>>> [1] https://github.com/jsevellec/cassandra-unit
>>>
>>>
>>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>>> michael.laing@nytimes.com> wrote:
>>>
>>>> I have no problem doing this w 2.0.5 - what version of C* are you
>>>> using? Or maybe I don't understand your data model... attach 'creates' if
>>>> you don't mind.
>>>>
>>>> ml
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:
>>>>
>>>>> Hi Peter,
>>>>>
>>>>> Thanks for the help, unfortunately I'm not sure that's the problem,
>>>>> the id is the primary key on the documents table and the timestamp is
>>>>> the primary key on the eventlog table
>>>>>
>>>>> Kind regards,
>>>>>
>>>>>
>>>>> Dave
>>>>>
>>>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> it's not clear to me if your "id" column is the KEY or just a regular
>>>>>> column with secondary index.
>>>>>>
>>>>>> queries that have IN on non primary key columns isn't supported yet.
>>>>>> not sure if that answers your question.
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <davemssavage@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I'm experimenting using cassandra and have run across an error
>>>>>>> message which I need a little more information on.
>>>>>>>
>>>>>>> The use case I'm experimenting with is a series of document updates
>>>>>>> (documents being an arbitrary map of key value pairs), I would like to find
>>>>>>> the latest document updates after a specified time period. I don't want to
>>>>>>> store many copies of the documents (one per update) as the updates are
>>>>>>> often only to single keys in the map so that would involve a lot of
>>>>>>> duplicated data.
>>>>>>>
>>>>>>> The solution I've found that seems to fit best in terms of
>>>>>>> performance is to have two tables.
>>>>>>>
>>>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>>>> then run two queries, one to select ids that have changed after a certain
>>>>>>> time:
>>>>>>>
>>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>>
>>>>>>> and then a second to select the actual documents themselves
>>>>>>>
>>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>>>>>>>
>>>>>>> However this then explodes on query with the error message:
>>>>>>>
>>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection
>>>>>>> is selected by the query"
>>>>>>>
>>>>>>> Detective work lead me to these lines in
>>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>>
>>>>>>>                     // We only support IN for the last name and for
>>>>>>> compact storage so far
>>>>>>>                     // TODO: #3885 allows us to extend to non
>>>>>>> compact as well, but that remains to be done
>>>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>>>                         throw new
>>>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>>>> restricted by IN relation", cname));
>>>>>>>                     else if (stmt.selectACollection())
>>>>>>>                         throw new
>>>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>>>
>>>>>>> It seems like #3885 will allow support for the first IF block above,
>>>>>>> but I don't think it will allow the second, am I correct?
>>>>>>>
>>>>>>> Any pointers on how I can work around this would be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: CQL Select Map using an IN relationship

Posted by "Laing, Michael" <mi...@nytimes.com>.
Create your table like this and it will work:

CREATE TABLE test.documents (group text,id bigint,data
map<text,text>,PRIMARY KEY ((group, id)));

The extra parens catenate 'group' and 'id' into the partition key - IN will
work on the last component of a partition key.

ml


On Thu, Mar 13, 2014 at 10:40 AM, David Savage <da...@gmail.com>wrote:

> Nope, upgraded to 2.0.5 and still get the same problem, I actually
> simplified the problem a little in my first post, there's a composite
> primary key involved as I need to partition ids into groups
>
> So the full CQL statements are:
>
> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
> 'replication_factor':3};
>
>
> CREATE TABLE test.documents (group text,id bigint,data
> map<text,text>,PRIMARY KEY (group, id));
>
>
> INSERT INTO test.documents(id,group,data) VALUES (0,'test',{'count':'0'});
>
> INSERT INTO test.documents(id,group,data) VALUES (1,'test',{'count':'1'});
>
> INSERT INTO test.documents(id,group,data) VALUES (2,'test',{'count':'2'});
>
>
> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>
>
> Thanks for your help.
>
>
> Kind regards,
>
>
> /Dave
>
>
> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:
>
>> Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got
>> dragged in by the cassandra unit library I'm using for testing [1] I will
>> try to fix my build dependencies and retry, thx.
>>
>> /Dave
>>
>> [1] https://github.com/jsevellec/cassandra-unit
>>
>>
>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> I have no problem doing this w 2.0.5 - what version of C* are you using?
>>> Or maybe I don't understand your data model... attach 'creates' if you
>>> don't mind.
>>>
>>> ml
>>>
>>>
>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:
>>>
>>>> Hi Peter,
>>>>
>>>> Thanks for the help, unfortunately I'm not sure that's the problem, the
>>>> id is the primary key on the documents table and the timestamp is the
>>>> primary key on the eventlog table
>>>>
>>>> Kind regards,
>>>>
>>>>
>>>> Dave
>>>>
>>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>>
>>>>>
>>>>> it's not clear to me if your "id" column is the KEY or just a regular
>>>>> column with secondary index.
>>>>>
>>>>> queries that have IN on non primary key columns isn't supported yet.
>>>>> not sure if that answers your question.
>>>>>
>>>>>
>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com>wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm experimenting using cassandra and have run across an error
>>>>>> message which I need a little more information on.
>>>>>>
>>>>>> The use case I'm experimenting with is a series of document updates
>>>>>> (documents being an arbitrary map of key value pairs), I would like to find
>>>>>> the latest document updates after a specified time period. I don't want to
>>>>>> store many copies of the documents (one per update) as the updates are
>>>>>> often only to single keys in the map so that would involve a lot of
>>>>>> duplicated data.
>>>>>>
>>>>>> The solution I've found that seems to fit best in terms of
>>>>>> performance is to have two tables.
>>>>>>
>>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>>> then run two queries, one to select ids that have changed after a certain
>>>>>> time:
>>>>>>
>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>
>>>>>> and then a second to select the actual documents themselves
>>>>>>
>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7…)
>>>>>>
>>>>>> However this then explodes on query with the error message:
>>>>>>
>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection
>>>>>> is selected by the query"
>>>>>>
>>>>>> Detective work lead me to these lines in
>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>
>>>>>>                     // We only support IN for the last name and for
>>>>>> compact storage so far
>>>>>>                     // TODO: #3885 allows us to extend to non compact
>>>>>> as well, but that remains to be done
>>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>>                         throw new
>>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>>> restricted by IN relation", cname));
>>>>>>                     else if (stmt.selectACollection())
>>>>>>                         throw new
>>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>>
>>>>>> It seems like #3885 will allow support for the first IF block above,
>>>>>> but I don't think it will allow the second, am I correct?
>>>>>>
>>>>>> Any pointers on how I can work around this would be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> Kind regards,
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: CQL Select Map using an IN relationship

Posted by David Savage <da...@gmail.com>.
Nope, upgraded to 2.0.5 and still get the same problem, I actually
simplified the problem a little in my first post, there's a composite
primary key involved as I need to partition ids into groups

So the full CQL statements are:

CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
'replication_factor':3};


CREATE TABLE test.documents (group text,id bigint,data
map<text,text>,PRIMARY KEY (group, id));


INSERT INTO test.documents(id,group,data) VALUES (0,'test',{'count':'0'});

INSERT INTO test.documents(id,group,data) VALUES (1,'test',{'count':'1'});

INSERT INTO test.documents(id,group,data) VALUES (2,'test',{'count':'2'});


SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);


Thanks for your help.


Kind regards,


/Dave


On Thu, Mar 13, 2014 at 2:00 PM, David Savage <da...@gmail.com>wrote:

> Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got
> dragged in by the cassandra unit library I'm using for testing [1] I will
> try to fix my build dependencies and retry, thx.
>
> /Dave
>
> [1] https://github.com/jsevellec/cassandra-unit
>
>
> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <michael.laing@nytimes.com
> > wrote:
>
>> I have no problem doing this w 2.0.5 - what version of C* are you using?
>> Or maybe I don't understand your data model... attach 'creates' if you
>> don't mind.
>>
>> ml
>>
>>
>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:
>>
>>> Hi Peter,
>>>
>>> Thanks for the help, unfortunately I'm not sure that's the problem, the
>>> id is the primary key on the documents table and the timestamp is the
>>> primary key on the eventlog table
>>>
>>> Kind regards,
>>>
>>>
>>> Dave
>>>
>>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>>
>>>>
>>>> it's not clear to me if your "id" column is the KEY or just a regular
>>>> column with secondary index.
>>>>
>>>> queries that have IN on non primary key columns isn't supported yet.
>>>> not sure if that answers your question.
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com>wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'm experimenting using cassandra and have run across an error message
>>>>> which I need a little more information on.
>>>>>
>>>>> The use case I'm experimenting with is a series of document updates
>>>>> (documents being an arbitrary map of key value pairs), I would like to find
>>>>> the latest document updates after a specified time period. I don't want to
>>>>> store many copies of the documents (one per update) as the updates are
>>>>> often only to single keys in the map so that would involve a lot of
>>>>> duplicated data.
>>>>>
>>>>> The solution I've found that seems to fit best in terms of performance
>>>>> is to have two tables.
>>>>>
>>>>> One that has an event log of timeuuid -> docid and a second that
>>>>> stores the documents themselves stored by docid -> map<string, string>. I
>>>>> then run two queries, one to select ids that have changed after a certain
>>>>> time:
>>>>>
>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>
>>>>> and then a second to select the actual documents themselves
>>>>>
>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>>>>>
>>>>> However this then explodes on query with the error message:
>>>>>
>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection is
>>>>> selected by the query"
>>>>>
>>>>> Detective work lead me to these lines in
>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>
>>>>>                     // We only support IN for the last name and for
>>>>> compact storage so far
>>>>>                     // TODO: #3885 allows us to extend to non compact
>>>>> as well, but that remains to be done
>>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>>                         throw new
>>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>>> restricted by IN relation", cname));
>>>>>                     else if (stmt.selectACollection())
>>>>>                         throw new
>>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>>> by IN relation as a collection is selected by the query", cname));
>>>>>
>>>>> It seems like #3885 will allow support for the first IF block above,
>>>>> but I don't think it will allow the second, am I correct?
>>>>>
>>>>> Any pointers on how I can work around this would be greatly
>>>>> appreciated.
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Dave
>>>>>
>>>>
>>>>
>>
>

Re: CQL Select Map using an IN relationship

Posted by David Savage <da...@gmail.com>.
Hmmm that maybe the problem, I'm currently testing with 2.0.2 which got
dragged in by the cassandra unit library I'm using for testing [1] I will
try to fix my build dependencies and retry, thx.

/Dave

[1] https://github.com/jsevellec/cassandra-unit


On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael
<mi...@nytimes.com>wrote:

> I have no problem doing this w 2.0.5 - what version of C* are you using?
> Or maybe I don't understand your data model... attach 'creates' if you
> don't mind.
>
> ml
>
>
> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:
>
>> Hi Peter,
>>
>> Thanks for the help, unfortunately I'm not sure that's the problem, the
>> id is the primary key on the documents table and the timestamp is the
>> primary key on the eventlog table
>>
>> Kind regards,
>>
>>
>> Dave
>>
>> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>>
>>>
>>> it's not clear to me if your "id" column is the KEY or just a regular
>>> column with secondary index.
>>>
>>> queries that have IN on non primary key columns isn't supported yet. not
>>> sure if that answers your question.
>>>
>>>
>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com>wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'm experimenting using cassandra and have run across an error message
>>>> which I need a little more information on.
>>>>
>>>> The use case I'm experimenting with is a series of document updates
>>>> (documents being an arbitrary map of key value pairs), I would like to find
>>>> the latest document updates after a specified time period. I don't want to
>>>> store many copies of the documents (one per update) as the updates are
>>>> often only to single keys in the map so that would involve a lot of
>>>> duplicated data.
>>>>
>>>> The solution I've found that seems to fit best in terms of performance
>>>> is to have two tables.
>>>>
>>>> One that has an event log of timeuuid -> docid and a second that stores
>>>> the documents themselves stored by docid -> map<string, string>. I then run
>>>> two queries, one to select ids that have changed after a certain time:
>>>>
>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>
>>>> and then a second to select the actual documents themselves
>>>>
>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>>>>
>>>> However this then explodes on query with the error message:
>>>>
>>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection is
>>>> selected by the query"
>>>>
>>>> Detective work lead me to these lines in
>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>
>>>>                     // We only support IN for the last name and for
>>>> compact storage so far
>>>>                     // TODO: #3885 allows us to extend to non compact
>>>> as well, but that remains to be done
>>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>>                         throw new
>>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>>> restricted by IN relation", cname));
>>>>                     else if (stmt.selectACollection())
>>>>                         throw new
>>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>>> by IN relation as a collection is selected by the query", cname));
>>>>
>>>> It seems like #3885 will allow support for the first IF block above,
>>>> but I don't think it will allow the second, am I correct?
>>>>
>>>> Any pointers on how I can work around this would be greatly appreciated.
>>>>
>>>> Kind regards,
>>>>
>>>> Dave
>>>>
>>>
>>>
>

Re: CQL Select Map using an IN relationship

Posted by "Laing, Michael" <mi...@nytimes.com>.
I have no problem doing this w 2.0.5 - what version of C* are you using? Or
maybe I don't understand your data model... attach 'creates' if you don't
mind.

ml


On Thu, Mar 13, 2014 at 9:24 AM, David Savage <da...@gmail.com>wrote:

> Hi Peter,
>
> Thanks for the help, unfortunately I'm not sure that's the problem, the id
> is the primary key on the documents table and the timestamp is the
> primary key on the eventlog table
>
> Kind regards,
>
>
> Dave
>
> On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:
>
>>
>> it's not clear to me if your "id" column is the KEY or just a regular
>> column with secondary index.
>>
>> queries that have IN on non primary key columns isn't supported yet. not
>> sure if that answers your question.
>>
>>
>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com>wrote:
>>
>>> Hi there,
>>>
>>> I'm experimenting using cassandra and have run across an error message
>>> which I need a little more information on.
>>>
>>> The use case I'm experimenting with is a series of document updates
>>> (documents being an arbitrary map of key value pairs), I would like to find
>>> the latest document updates after a specified time period. I don't want to
>>> store many copies of the documents (one per update) as the updates are
>>> often only to single keys in the map so that would involve a lot of
>>> duplicated data.
>>>
>>> The solution I've found that seems to fit best in terms of performance
>>> is to have two tables.
>>>
>>> One that has an event log of timeuuid -> docid and a second that stores
>>> the documents themselves stored by docid -> map<string, string>. I then run
>>> two queries, one to select ids that have changed after a certain time:
>>>
>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>
>>> and then a second to select the actual documents themselves
>>>
>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7…)
>>>
>>> However this then explodes on query with the error message:
>>>
>>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection is
>>> selected by the query"
>>>
>>> Detective work lead me to these lines in
>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>
>>>                     // We only support IN for the last name and for
>>> compact storage so far
>>>                     // TODO: #3885 allows us to extend to non compact as
>>> well, but that remains to be done
>>>                     if (i != stmt.columnRestrictions.length - 1)
>>>                         throw new
>>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>>> restricted by IN relation", cname));
>>>                     else if (stmt.selectACollection())
>>>                         throw new
>>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>>> by IN relation as a collection is selected by the query", cname));
>>>
>>> It seems like #3885 will allow support for the first IF block above, but
>>> I don't think it will allow the second, am I correct?
>>>
>>> Any pointers on how I can work around this would be greatly appreciated.
>>>
>>> Kind regards,
>>>
>>> Dave
>>>
>>
>>

Re: CQL Select Map using an IN relationship

Posted by David Savage <da...@gmail.com>.
Hi Peter,

Thanks for the help, unfortunately I'm not sure that's the problem, the id
is the primary key on the documents table and the timestamp is the primary
key on the eventlog table

Kind regards,


Dave

On Thursday, 13 March 2014, Peter Lin <wo...@gmail.com> wrote:

>
> it's not clear to me if your "id" column is the KEY or just a regular
> column with secondary index.
>
> queries that have IN on non primary key columns isn't supported yet. not
> sure if that answers your question.
>
>
> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com>wrote:
>
>> Hi there,
>>
>> I'm experimenting using cassandra and have run across an error message
>> which I need a little more information on.
>>
>> The use case I'm experimenting with is a series of document updates
>> (documents being an arbitrary map of key value pairs), I would like to find
>> the latest document updates after a specified time period. I don't want to
>> store many copies of the documents (one per update) as the updates are
>> often only to single keys in the map so that would involve a lot of
>> duplicated data.
>>
>> The solution I've found that seems to fit best in terms of performance is
>> to have two tables.
>>
>> One that has an event log of timeuuid -> docid and a second that stores
>> the documents themselves stored by docid -> map<string, string>. I then run
>> two queries, one to select ids that have changed after a certain time:
>>
>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>
>> and then a second to select the actual documents themselves
>>
>> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>>
>> However this then explodes on query with the error message:
>>
>> "Cannot restrict PRIMARY KEY part id by IN relation as a collection is
>> selected by the query"
>>
>> Detective work lead me to these lines in
>> org.apache.cassandra.cql3.statementsSelectStatement:
>>
>>                     // We only support IN for the last name and for
>> compact storage so far
>>                     // TODO: #3885 allows us to extend to non compact as
>> well, but that remains to be done
>>                     if (i != stmt.columnRestrictions.length - 1)
>>                         throw new
>> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
>> restricted by IN relation", cname));
>>                     else if (stmt.selectACollection())
>>                         throw new
>> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
>> by IN relation as a collection is selected by the query", cname));
>>
>> It seems like #3885 will allow support for the first IF block above, but
>> I don't think it will allow the second, am I correct?
>>
>> Any pointers on how I can work around this would be greatly appreciated.
>>
>> Kind regards,
>>
>> Dave
>>
>
>

Re: CQL Select Map using an IN relationship

Posted by Peter Lin <wo...@gmail.com>.
it's not clear to me if your "id" column is the KEY or just a regular
column with secondary index.

queries that have IN on non primary key columns isn't supported yet. not
sure if that answers your question.


On Thu, Mar 13, 2014 at 7:12 AM, David Savage <da...@gmail.com>wrote:

> Hi there,
>
> I'm experimenting using cassandra and have run across an error message
> which I need a little more information on.
>
> The use case I'm experimenting with is a series of document updates
> (documents being an arbitrary map of key value pairs), I would like to find
> the latest document updates after a specified time period. I don't want to
> store many copies of the documents (one per update) as the updates are
> often only to single keys in the map so that would involve a lot of
> duplicated data.
>
> The solution I've found that seems to fit best in terms of performance is
> to have two tables.
>
> One that has an event log of timeuuid -> docid and a second that stores
> the documents themselves stored by docid -> map<string, string>. I then run
> two queries, one to select ids that have changed after a certain time:
>
> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>
> and then a second to select the actual documents themselves
>
> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>
> However this then explodes on query with the error message:
>
> "Cannot restrict PRIMARY KEY part id by IN relation as a collection is
> selected by the query"
>
> Detective work lead me to these lines in
> org.apache.cassandra.cql3.statementsSelectStatement:
>
>                     // We only support IN for the last name and for
> compact storage so far
>                     // TODO: #3885 allows us to extend to non compact as
> well, but that remains to be done
>                     if (i != stmt.columnRestrictions.length - 1)
>                         throw new
> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
> restricted by IN relation", cname));
>                     else if (stmt.selectACollection())
>                         throw new
> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
> by IN relation as a collection is selected by the query", cname));
>
> It seems like #3885 will allow support for the first IF block above, but I
> don't think it will allow the second, am I correct?
>
> Any pointers on how I can work around this would be greatly appreciated.
>
> Kind regards,
>
> Dave
>

Re: CQL Select Map using an IN relationship

Posted by Sylvain Lebresne <sy...@datastax.com>.
On Thu, Mar 13, 2014 at 12:12 PM, David Savage <da...@gmail.com>wrote:

> Hi there,
>
> I'm experimenting using cassandra and have run across an error message
> which I need a little more information on.
>
> The use case I'm experimenting with is a series of document updates
> (documents being an arbitrary map of key value pairs), I would like to find
> the latest document updates after a specified time period. I don't want to
> store many copies of the documents (one per update) as the updates are
> often only to single keys in the map so that would involve a lot of
> duplicated data.
>
> The solution I've found that seems to fit best in terms of performance is
> to have two tables.
>
> One that has an event log of timeuuid -> docid and a second that stores
> the documents themselves stored by docid -> map<string, string>. I then run
> two queries, one to select ids that have changed after a certain time:
>
> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>
> and then a second to select the actual documents themselves
>
> SELECT id, data FROM documents WHERE id IN (0, 1, 2, 3, 4, 5, 6, 7...)
>
> However this then explodes on query with the error message:
>
> "Cannot restrict PRIMARY KEY part id by IN relation as a collection is
> selected by the query"
>
> Detective work lead me to these lines in
> org.apache.cassandra.cql3.statementsSelectStatement:
>
>                     // We only support IN for the last name and for
> compact storage so far
>                     // TODO: #3885 allows us to extend to non compact as
> well, but that remains to be done
>                     if (i != stmt.columnRestrictions.length - 1)
>                         throw new
> InvalidRequestException(String.format("PRIMARY KEY part %s cannot be
> restricted by IN relation", cname));
>                     else if (stmt.selectACollection())
>                         throw new
> InvalidRequestException(String.format("Cannot restrict PRIMARY KEY part %s
> by IN relation as a collection is selected by the query", cname));
>
> It seems like #3885 will allow support for the first IF block above, but I
> don't think it will allow the second, am I correct?
>

Right, #3885 is about the first one. Tbh, the 2nd limitation is kind of
historical and unless I'm forgetting something, we should be able to lift
that pretty easily. If you don't mind opening a JIRA ticket, I'll have a
look at removing said limitation.

--
Sylvain



>
> Any pointers on how I can work around this would be greatly appreciated.
>
> Kind regards,
>
> Dave
>