You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Subodh Nijsure <su...@gmail.com> on 2014/09/01 01:43:16 UTC

Help with select IN query in cassandra

I have following database schema

CREATE TABLE sensor_info_table (
  asset_id text,
  event_time timestamp,
  "timestamp" timeuuid,
  sensor_reading map<text, text>,
  sensor_serial_number text,
  sensor_type int,
  PRIMARY KEY ((asset_id), event_time, "timestamp")
);

CREATE INDEX event_time_index ON sensor_info_table (event_time);

CREATE INDEX timestamp_index ON sensor_info_table ("timestamp");

Now I am able to insert the data into this table, however I am unable
to do following query where I want to select items with specific
timeuuid values.

It gives me following error.

SELECT * from mydb.sensor_info_table where timestamp IN (
bfdfa614-3166-11e4-a61d-b888e30f5d17 ,
bf4521ac-3166-11e4-87a3-b888e30f5d17) ;

Bad Request: PRIMARY KEY column "timestamp" cannot be restricted
(preceding column "event_time" is either not restricted or by a non-EQ
relation)

What do I have to do to make this work?

For what its worth I am using django for my front end development and
I am using "timestamp timeuuid" field as unique indentifier to
reference specific sensor reading from django framework -- since
cassandra doesn't have way to generate unique id upon insert (like
old-style rdms's auto-fields).


Below is software version info.

show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 |
Thrift protocol 19.39.0]

I really don't understand what the error message preceeding column
"event_time" is either not restricted or by no-EQ relation?

-Subodh Nijsure

Re: Help with select IN query in cassandra

Posted by Subodh Nijsure <su...@gmail.com>.

Thanks Michael I will certainly go with this approach for now.

-Subodh

On Mon, Sep 1, 2014 at 6:33 AM, Laing, Michael
<mi...@nytimes.com> wrote:
> This should work for your query requirements - 2 tables w same info because
> disk is cheap and writes are fast so optimize for reads:
>
> CREATE TABLE sensor_asset (
>   asset_id text,
>   event_time timestamp,
>   tuuid timeuuid,
>   sensor_reading map<text, text>,
>   sensor_serial_number text,
>   sensor_type int,
>   PRIMARY KEY ((asset_id), event_time)
> );
>
> CREATE TABLE sensor_tuuid (
>   asset_id text,
>   event_time timestamp,
>   tuuid timeuuid,
>   sensor_reading map<text, text>,
>   sensor_serial_number text,
>   sensor_type int,
>   PRIMARY KEY (tuuid)
> );
>
> 1. Give me all sensor data for an asset:
>
> select * from sensor_asset where asset_id = <asset>;
>
> 2. Give me sensor data that matches a set of timeuuids:
>
> select * from sensor_tuuid where tuuid in (<tuuid1>, <tuuid2>, ...);
>
> 3. Give me all sensor data for an asset collected after | before | between
> event_time(s):
>
> select * from sensor_asset where asset_id = <asset> and event_time > <ts1>;
> select * from sensor_asset where asset_id = <asset> and event_time < <ts1>;
> select * from sensor_asset where asset_id = <asset> and event_time < <ts1>
> and event_time > <ts2>;
>
> ***
>
> Many people (not me) handle sensor data, so there may be better overall
> approaches considering volumes, deletion, compaction etc.
>
> But the above is simple and should make your current approach workable as
> you iterate toward a complete solution.
>
> Cheers,
> ml
>
>
>
> On Sun, Aug 31, 2014 at 11:08 PM, Subodh Nijsure <su...@gmail.com>
> wrote:
>>
>> Thanks for your help Michael.
>>
>> If specifying asset_id would help I can construct queries that can
>> include asset_id
>>
>> So I have been "playing" around with PRIMARY KEY definition and
>> following table definition
>>
>> CREATE TABLE sensor_info_table (
>>   asset_id text,
>>   event_time timestamp,
>>   "timestamp" timeuuid,
>>   sensor_reading map<text, text>,
>>   sensor_serial_number text,
>>   sensor_type int,
>>   PRIMARY KEY ((asset_id, "timestamp"), event_time)
>> );
>>
>> It does what I want to do, and I removed the index for timestamp item
>> since now it is part of primary key and thus my query like this works.
>>
>> SELECT * from sigsense.sensor_info_table where  asset_id='3' AND
>> timestamp IN (
>> 17830bb0-316a-11e4-800f-b888e30f5d17,16ddbdfe-316a-11e4-9f50-b888e30f5d17
>> );
>>
>> But now this doesn't work it give
>>
>> SELECT * from sensor_info_table where  asset_id='3' ;
>>
>> Bad Request: Partition key part timestamp must be restricted since
>> preceding part is
>>
>> I am keeping index on event_time as I sometime need to query something
>> "give me all data since time x" i.e. something like this works.
>>
>>  SELECT * from sensor_info_table where  event_time > '2014-08-31
>> 16:54:02-0700' ALLOW FILTERING;
>>
>> However if I do this things then this don't work:
>>
>> SELECT * from sensor_info_table where  asset_id='3' AND event_time >
>> '2014-08-31 16:54:02-0700';
>>
>> Bad Request: Partition key part timestamp must be restricted since
>> preceding part is
>>
>> Also  I am not conformable with fact that I need to specify ALLOW
>> FILTERING.
>>
>> I guess cassandra schema design task asks designer to write down
>> queries before designing schema.
>>
>> For the above table definition I want to do following queries:
>>
>> - Give me all sensor data for given asset.
>> - Give me sensor data that matches given set of timeuuids
>> - Give me all sendor data for a given asset, that were collected after
>> | before | between  certain event_time.
>>
>> Given these query criteria how should  I construct my schema? One
>> thought has occurred to me is make three tables with each item
>> asset_id , event_time, timeuuid as primary keys and depending on type
>> of query choose the table to do query upon. That seems like a waste of
>> resources (disk, cpu ), also increasing insert times(!) but thats the
>> way things need to happen in cassandra world its okay. ( I am
>> two-three weeks into learning about cassandra).
>>
>> -Subodh
>>
>> On Sun, Aug 31, 2014 at 6:44 PM, Laing, Michael
>> <mi...@nytimes.com> wrote:
>> > Oh it must be late - I missed the fact that you didn't want to specify
>> > asset_id. The above queries will still work but you have to use 'allow
>> > filtering' - generally not a good idea. I'll look again in the morning.
>> >
>> >
>> > On Sun, Aug 31, 2014 at 9:41 PM, Laing, Michael
>> > <mi...@nytimes.com>
>> > wrote:
>> >>
>> >> Hmm. Because the clustering key is (event_time, "timestamp"),
>> >> event_time
>> >> must be specified as well - hopefully that info is available to the ux.
>> >>
>> >> Unfortunately you will then hit another problem with your query: you
>> >> are
>> >> selecting a collection field... this will not work with IN on
>> >> "timestamp".
>> >>
>> >> So you could select all the "timestamp"s for an asset_id/event_time:
>> >>>
>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time =
>> >>> 1231234;
>> >>
>> >>
>> >> Or you could apply a range of "timestamp"s:
>> >>>
>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time =
>> >>> 1231234 and "timestamp" > 1d934af3-3178-11e4-ba8d-406c8f1838fa and
>> >>> "timestamp" < 20b82021-3178-11e4-abc2-406c8f1838fa;
>> >>
>> >>
>> >> BTW the secondary indices are not a good idea: high cardinality and of
>> >> no
>> >> use in this query that I can see.
>> >>
>> >> ml
>> >>
>> >>
>> >> On Sun, Aug 31, 2014 at 8:40 PM, Subodh Nijsure
>> >> <su...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Not really event time stamp is created by the sensor when it reads
>> >>> data
>> >>> and  timestamp is something server creates when inserting data into
>> >>> cassandra db.  At later point in time my django ux allows users to
>> >>> browse
>> >>> this data and reference interesting data points via the timestamp
>> >>> field. The
>> >>> timestamp field is my bridge between Sal and nosql world.
>> >>>
>> >>> Subodh
>> >>>
>> >>> On Aug 31, 2014 5:33 PM, "Laing, Michael" <mi...@nytimes.com>
>> >>> wrote:
>> >>>>
>> >>>> Are event_time and timestamp essentially representing the same
>> >>>> datetime?
>> >>>>
>> >>>> On Sunday, August 31, 2014, Subodh Nijsure <su...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> I have following database schema
>> >>>>>
>> >>>>> CREATE TABLE sensor_info_table (
>> >>>>>   asset_id text,
>> >>>>>   event_time timestamp,
>> >>>>>   "timestamp" timeuuid,
>> >>>>>   sensor_reading map<text, text>,
>> >>>>>   sensor_serial_number text,
>> >>>>>   sensor_type int,
>> >>>>>   PRIMARY KEY ((asset_id), event_time, "timestamp")
>> >>>>> );
>> >>>>>
>> >>>>> CREATE INDEX event_time_index ON sensor_info_table (event_time);
>> >>>>>
>> >>>>> CREATE INDEX timestamp_index ON sensor_info_table ("timestamp");
>> >>>>>
>> >>>>> Now I am able to insert the data into this table, however I am
>> >>>>> unable
>> >>>>> to do following query where I want to select items with specific
>> >>>>> timeuuid values.
>> >>>>>
>> >>>>> It gives me following error.
>> >>>>>
>> >>>>> SELECT * from mydb.sensor_info_table where timestamp IN (
>> >>>>> bfdfa614-3166-11e4-a61d-b888e30f5d17 ,
>> >>>>> bf4521ac-3166-11e4-87a3-b888e30f5d17) ;
>> >>>>>
>> >>>>> Bad Request: PRIMARY KEY column "timestamp" cannot be restricted
>> >>>>> (preceding column "event_time" is either not restricted or by a
>> >>>>> non-EQ
>> >>>>> relation)
>> >>>>>
>> >>>>> What do I have to do to make this work?
>> >>>>>
>> >>>>> For what its worth I am using django for my front end development
>> >>>>> and
>> >>>>> I am using "timestamp timeuuid" field as unique indentifier to
>> >>>>> reference specific sensor reading from django framework -- since
>> >>>>> cassandra doesn't have way to generate unique id upon insert (like
>> >>>>> old-style rdms's auto-fields).
>> >>>>>
>> >>>>>
>> >>>>> Below is software version info.
>> >>>>>
>> >>>>> show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 |
>> >>>>> Thrift protocol 19.39.0]
>> >>>>>
>> >>>>> I really don't understand what the error message preceeding column
>> >>>>> "event_time" is either not restricted or by no-EQ relation?
>> >>>>>
>> >>>>> -Subodh Nijsure
>> >>
>> >>
>> >
>
>

Re: Help with select IN query in cassandra

Posted by Subodh Nijsure <su...@gmail.com>.

I am guessing OP => Original Poster. I had not asked about deletion,
but Jack thanks for your suggestion for overall handling of deletion.
For the type of sensor data I am collecting we really don't want to
delete the data as we want to figure out data trend over real long
term ( 1, 2 years).

BTW - I found a good link that describes behavior of where clause in
cassandra, just posting that link in case someone stumbles upon this
thread in the future

http://mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

-Subodh

On Mon, Sep 1, 2014 at 8:56 AM, Jack Krupansky <ja...@basetechnology.com> wrote:
> I did see a reference to deletions: “overall approaches considering volumes,
> deletion, compaction etc.” Did I merely misunderstand the reference? That’s
> all I was responding to... sorry if my misunderstanding added any confusion!
>
> -- Jack Krupansky
>
> From: Laing, Michael
> Sent: Monday, September 1, 2014 11:34 AM
> To: user@cassandra.apache.org
> Subject: Re: Help with select IN query in cassandra
>
> Did the OP propose that?
>
>
> On Mon, Sep 1, 2014 at 10:53 AM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>>
>> One comment on deletions – aren’t deletions kind of an anti-pattern for
>> modern data processing, such as sensor data, time series data, and social
>> media? I mean, isn’t it usually better to return a full history of the data,
>> with some aging scheme, and manage the tracking of which values are
>> “current” (or “recent”)? Shouldn’t we be looking for and promoting “write
>> once” approaches as a much stronger preference/pattern? Or maybe I should
>> say “write once and bulk delete on aging” rather than the exercise in
>> futility of doing a massive number of deletes and updates in place?
>>
>> -- Jack Krupansky
>>
>> From: Laing, Michael
>> Sent: Monday, September 1, 2014 9:33 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Help with select IN query in cassandra
>>
>> This should work for your query requirements - 2 tables w same info
>> because disk is cheap and writes are fast so optimize for reads:
>>
>> CREATE TABLE sensor_asset (
>>   asset_id text,
>>   event_time timestamp,
>>   tuuid timeuuid,
>>   sensor_reading map<text, text>,
>>   sensor_serial_number text,
>>   sensor_type int,
>>   PRIMARY KEY ((asset_id), event_time)
>> );
>>
>> CREATE TABLE sensor_tuuid (
>>   asset_id text,
>>   event_time timestamp,
>>   tuuid timeuuid,
>>   sensor_reading map<text, text>,
>>   sensor_serial_number text,
>>   sensor_type int,
>>   PRIMARY KEY (tuuid)
>> );
>>
>> 1. Give me all sensor data for an asset:
>>
>> select * from sensor_asset where asset_id = <asset>;
>>
>> 2. Give me sensor data that matches a set of timeuuids:
>>
>> select * from sensor_tuuid where tuuid in (<tuuid1>, <tuuid2>, ...);
>>
>> 3. Give me all sensor data for an asset collected after | before | between
>> event_time(s):
>>
>> select * from sensor_asset where asset_id = <asset> and event_time >
>> <ts1>;
>> select * from sensor_asset where asset_id = <asset> and event_time <
>> <ts1>;
>> select * from sensor_asset where asset_id = <asset> and event_time < <ts1>
>> and event_time > <ts2>;
>>
>> ***
>>
>> Many people (not me) handle sensor data, so there may be better overall
>> approaches considering volumes, deletion, compaction etc.
>>
>> But the above is simple and should make your current approach workable as
>> you iterate toward a complete solution.
>>
>> Cheers,
>> ml
>>
>>
>>
>> On Sun, Aug 31, 2014 at 11:08 PM, Subodh Nijsure
>> <su...@gmail.com> wrote:
>>>
>>> Thanks for your help Michael.
>>>
>>> If specifying asset_id would help I can construct queries that can
>>> include asset_id
>>>
>>> So I have been "playing" around with PRIMARY KEY definition and
>>> following table definition
>>>
>>> CREATE TABLE sensor_info_table (
>>>   asset_id text,
>>>   event_time timestamp,
>>>   "timestamp" timeuuid,
>>>   sensor_reading map<text, text>,
>>>   sensor_serial_number text,
>>>   sensor_type int,
>>>   PRIMARY KEY ((asset_id, "timestamp"), event_time)
>>> );
>>>
>>> It does what I want to do, and I removed the index for timestamp item
>>> since now it is part of primary key and thus my query like this works.
>>>
>>> SELECT * from sigsense.sensor_info_table where  asset_id='3' AND
>>> timestamp IN (
>>> 17830bb0-316a-11e4-800f-b888e30f5d17,16ddbdfe-316a-11e4-9f50-b888e30f5d17
>>> );
>>>
>>> But now this doesn't work it give
>>>
>>> SELECT * from sensor_info_table where  asset_id='3' ;
>>>
>>> Bad Request: Partition key part timestamp must be restricted since
>>> preceding part is
>>>
>>> I am keeping index on event_time as I sometime need to query something
>>> "give me all data since time x" i.e. something like this works.
>>>
>>> SELECT * from sensor_info_table where  event_time > '2014-08-31
>>> 16:54:02-0700' ALLOW FILTERING;
>>>
>>> However if I do this things then this don't work:
>>>
>>> SELECT * from sensor_info_table where  asset_id='3' AND event_time >
>>> '2014-08-31 16:54:02-0700';
>>>
>>> Bad Request: Partition key part timestamp must be restricted since
>>> preceding part is
>>>
>>> Also  I am not conformable with fact that I need to specify ALLOW
>>> FILTERING.
>>>
>>> I guess cassandra schema design task asks designer to write down
>>> queries before designing schema.
>>>
>>> For the above table definition I want to do following queries:
>>>
>>> - Give me all sensor data for given asset.
>>> - Give me sensor data that matches given set of timeuuids
>>> - Give me all sendor data for a given asset, that were collected after
>>> | before | between  certain event_time.
>>>
>>> Given these query criteria how should  I construct my schema? One
>>> thought has occurred to me is make three tables with each item
>>> asset_id , event_time, timeuuid as primary keys and depending on type
>>> of query choose the table to do query upon. That seems like a waste of
>>> resources (disk, cpu ), also increasing insert times(!) but thats the
>>> way things need to happen in cassandra world its okay. ( I am
>>> two-three weeks into learning about cassandra).
>>>
>>> -Subodh
>>>
>>> On Sun, Aug 31, 2014 at 6:44 PM, Laing, Michael
>>> <mi...@nytimes.com> wrote:
>>> > Oh it must be late - I missed the fact that you didn't want to specify
>>> > asset_id. The above queries will still work but you have to use 'allow
>>> > filtering' - generally not a good idea. I'll look again in the morning.
>>> >
>>> >
>>> > On Sun, Aug 31, 2014 at 9:41 PM, Laing, Michael
>>> > <mi...@nytimes.com>
>>> > wrote:
>>> >>
>>> >> Hmm. Because the clustering key is (event_time, "timestamp"),
>>> >> event_time
>>> >> must be specified as well - hopefully that info is available to the
>>> >> ux.
>>> >>
>>> >> Unfortunately you will then hit another problem with your query: you
>>> >> are
>>> >> selecting a collection field... this will not work with IN on
>>> >> "timestamp".
>>> >>
>>> >> So you could select all the "timestamp"s for an asset_id/event_time:
>>> >>>
>>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time =
>>> >>> 1231234;
>>> >>
>>> >>
>>> >> Or you could apply a range of "timestamp"s:
>>> >>>
>>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time =
>>> >>> 1231234 and "timestamp" > 1d934af3-3178-11e4-ba8d-406c8f1838fa and
>>> >>> "timestamp" < 20b82021-3178-11e4-abc2-406c8f1838fa;
>>> >>
>>> >>
>>> >> BTW the secondary indices are not a good idea: high cardinality and of
>>> >> no
>>> >> use in this query that I can see.
>>> >>
>>> >> ml
>>> >>
>>> >>
>>> >> On Sun, Aug 31, 2014 at 8:40 PM, Subodh Nijsure
>>> >> <su...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Not really event time stamp is created by the sensor when it reads
>>> >>> data
>>> >>> and  timestamp is something server creates when inserting data into
>>> >>> cassandra db.  At later point in time my django ux allows users to
>>> >>> browse
>>> >>> this data and reference interesting data points via the timestamp
>>> >>> field. The
>>> >>> timestamp field is my bridge between Sal and nosql world.
>>> >>>
>>> >>> Subodh
>>> >>>
>>> >>> On Aug 31, 2014 5:33 PM, "Laing, Michael" <mi...@nytimes.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Are event_time and timestamp essentially representing the same
>>> >>>> datetime?
>>> >>>>
>>> >>>> On Sunday, August 31, 2014, Subodh Nijsure
>>> >>>> <su...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> I have following database schema
>>> >>>>>
>>> >>>>> CREATE TABLE sensor_info_table (
>>> >>>>>   asset_id text,
>>> >>>>>   event_time timestamp,
>>> >>>>>   "timestamp" timeuuid,
>>> >>>>>   sensor_reading map<text, text>,
>>> >>>>>   sensor_serial_number text,
>>> >>>>>   sensor_type int,
>>> >>>>>   PRIMARY KEY ((asset_id), event_time, "timestamp")
>>> >>>>> );
>>> >>>>>
>>> >>>>> CREATE INDEX event_time_index ON sensor_info_table (event_time);
>>> >>>>>
>>> >>>>> CREATE INDEX timestamp_index ON sensor_info_table ("timestamp");
>>> >>>>>
>>> >>>>> Now I am able to insert the data into this table, however I am
>>> >>>>> unable
>>> >>>>> to do following query where I want to select items with specific
>>> >>>>> timeuuid values.
>>> >>>>>
>>> >>>>> It gives me following error.
>>> >>>>>
>>> >>>>> SELECT * from mydb.sensor_info_table where timestamp IN (
>>> >>>>> bfdfa614-3166-11e4-a61d-b888e30f5d17 ,
>>> >>>>> bf4521ac-3166-11e4-87a3-b888e30f5d17) ;
>>> >>>>>
>>> >>>>> Bad Request: PRIMARY KEY column "timestamp" cannot be restricted
>>> >>>>> (preceding column "event_time" is either not restricted or by a
>>> >>>>> non-EQ
>>> >>>>> relation)
>>> >>>>>
>>> >>>>> What do I have to do to make this work?
>>> >>>>>
>>> >>>>> For what its worth I am using django for my front end development
>>> >>>>> and
>>> >>>>> I am using "timestamp timeuuid" field as unique indentifier to
>>> >>>>> reference specific sensor reading from django framework -- since
>>> >>>>> cassandra doesn't have way to generate unique id upon insert (like
>>> >>>>> old-style rdms's auto-fields).
>>> >>>>>
>>> >>>>>
>>> >>>>> Below is software version info.
>>> >>>>>
>>> >>>>> show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 |
>>> >>>>> Thrift protocol 19.39.0]
>>> >>>>>
>>> >>>>> I really don't understand what the error message preceeding column
>>> >>>>> "event_time" is either not restricted or by no-EQ relation?
>>> >>>>>
>>> >>>>> -Subodh Nijsure
>>> >>
>>> >>
>>> >
>>
>>
>
>