You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Raj N <ra...@gmail.com> on 2015/02/14 03:06:38 UTC

Storing bi-temporal data in Cassandra

Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
can do this using CQL for now. Taking the time series example from well
known modeling tutorials in Cassandra -

CREATE TABLE temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);

If I add another column transaction_time

CREATE TABLE temperatures (
weatherstation_id text,
event_time timestamp,
transaction_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time,transaction_time),
) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);

If I try to run a query using the following CQL, it throws an error -

select * from temperatures where weatherstation_id = 'foo' and event_time
>= '2015-01-01 00:00:00' and event_time < '2015-01-02 00:00:00' and
transaction_time < '2015-01-02 00:00:00'

It works if I use an equals clause for the event_time. I am trying to get
the state as of a particular transaction_time

-Raj

Re: Storing bi-temporal data in Cassandra

Posted by Jack Krupansky <ja...@gmail.com>.

The simple, easy way to look at this is that you can use a range when the
data will be contiguous.Only by allowing only the last clustering column to
use a range can Cassandra be assured that the rows selected by the range
will be contiguous (a "slice.") The point is that Cassandra is designed for
high performance operations, and with the trade-off that low-performance
operations are simply not supported.

DataStax Enterprise with Solr for search and Spark for analytics, or Lucene
indexing of Cassandra data through Stratio or Stargate, will give you much
more flexible query capabilities.

-- Jack Krupansky

On Sat, Feb 14, 2015 at 6:05 PM, Raj N <ra...@gmail.com> wrote:

> I don't think thats solves my problem. The question really is why can't we
> use ranges for both time columns when they are part of the primary key.
> They are on 1 row after all. Is this just a CQL limitation?
>
> -Raj
>
> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>> "I am trying to get the state as of a particular transaction_time"
>>
>>  --> In that case you should probably define your primary key in another
>> order for clustering columns
>>
>> PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>
>> Then, select * from temperatures where weatherstation_id = 'foo' and
>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>> 00:00:00' and transaction_time = 'xxxx'
>>
>>
>>
>> On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com> wrote:
>>
>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like
>>> I can do this using CQL for now. Taking the time series example from well
>>> known modeling tutorials in Cassandra -
>>>
>>> CREATE TABLE temperatures (
>>> weatherstation_id text,
>>> event_time timestamp,
>>> temperature text,
>>> PRIMARY KEY (weatherstation_id,event_time),
>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>
>>> If I add another column transaction_time
>>>
>>> CREATE TABLE temperatures (
>>> weatherstation_id text,
>>> event_time timestamp,
>>> transaction_time timestamp,
>>> temperature text,
>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>
>>> If I try to run a query using the following CQL, it throws an error -
>>>
>>> select * from temperatures where weatherstation_id = 'foo' and
>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>
>>> It works if I use an equals clause for the event_time. I am trying to
>>> get the state as of a particular transaction_time
>>>
>>> -Raj
>>>
>>
>>
>

Re: Storing bi-temporal data in Cassandra

Posted by "Laing, Michael" <mi...@nytimes.com>.

Perhaps you should learn more about Cassandra before you ask such questions.

It's easy if you just look at the readily accessible docs.

ml

On Sat, Feb 14, 2015 at 6:05 PM, Raj N <ra...@gmail.com> wrote:

> I don't think thats solves my problem. The question really is why can't we
> use ranges for both time columns when they are part of the primary key.
> They are on 1 row after all. Is this just a CQL limitation?
>
> -Raj
>
> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>> "I am trying to get the state as of a particular transaction_time"
>>
>>  --> In that case you should probably define your primary key in another
>> order for clustering columns
>>
>> PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>
>> Then, select * from temperatures where weatherstation_id = 'foo' and
>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>> 00:00:00' and transaction_time = 'xxxx'
>>
>>
>>
>> On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com> wrote:
>>
>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like
>>> I can do this using CQL for now. Taking the time series example from well
>>> known modeling tutorials in Cassandra -
>>>
>>> CREATE TABLE temperatures (
>>> weatherstation_id text,
>>> event_time timestamp,
>>> temperature text,
>>> PRIMARY KEY (weatherstation_id,event_time),
>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>
>>> If I add another column transaction_time
>>>
>>> CREATE TABLE temperatures (
>>> weatherstation_id text,
>>> event_time timestamp,
>>> transaction_time timestamp,
>>> temperature text,
>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>
>>> If I try to run a query using the following CQL, it throws an error -
>>>
>>> select * from temperatures where weatherstation_id = 'foo' and
>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>
>>> It works if I use an equals clause for the event_time. I am trying to
>>> get the state as of a particular transaction_time
>>>
>>> -Raj
>>>
>>
>>
>

Re: Storing bi-temporal data in Cassandra

Posted by Peter Lin <wo...@gmail.com>.

I think I get the basics of what you want to achieve. Side note, the sample
insert seems to have a typo for the transaction time

For the first query, I would store the data using weatherstation _id as the
key. The create table statement might look like this.

CREATE TABLE weatherstation (
weatherstation_id text,
latest_temperature int,
PRIMARY KEY (weatherstation_id)
)
Each new sensor reading would be a new dynamic column and it would update
the value for latest_temperature

For the second query, I get the impression the transactions are done in
batches. Is that interpretation correct?

Assuming inserts are in batches, I would store the same sensor data in a
second table. It might look like this.
CREATE TABLE readings_by_transaction (
transaction_time timestamp,
PRIMARY_KEY (transaction_time)
)

You could add a static column and use collection type to store the values,
or you can use dynamic columns. To get all values for a given transaction
time, you'd simply select all columns where the row_key is equal to the
timestamp. With the new collections, it's possible to static column that
uses map<weatherstation_id,temperature>. Each new sensor reading would be
an entry in the map.

Taking a step back and looking at the larger problem space. If we use a
loose definition of bi-temporal database, the use case fits, even if it
isn't like classic bi-temporal examples. In this specific case, the
temperature readings are valid only for the instance in time the value was
read. It's not possible to go back in time and re-read the temperature
value, so it's purely write once and read many.

One of the classic bi-temporal query examples is "get me all valid
addresses for [person x] between 2010 and 2012". My guess is your use case
will never need this type of query, given it's temperature data. The
approach I tend to go with is to store the raw data with an unique
immutable key and then have one or more summary tables. It does increase
the storage needs, but there are ways to handle that.

hope that helps




On Fri, Feb 20, 2015 at 10:20 AM, Raj N <ra...@gmail.com> wrote:

> Thanks for the response Peter. I used the temperature table because its
> the most common example on CQL timeseries and I thought I would reuse it.
> From some of the responses, looks like I was wrong.
>
> event_time is the time the event happened. So yes it is valid time. I was
> trying to see if I can get away with not having valid_from and valid_to in
> Cassandra.
> transaction_time is the time the database record was written.
>
> Let's take an example -
>
> INSERT INTO
> temperature(weatherstation_id,event_time,transaction_time,temperature)
> VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-18 07:01:00′,’72F’);
>
> INSERT INTO
> temperature(weatherstation_id,event_time,transaction_time,temperature)
> VALUES (’1234ABCD’,’2015-02-18 08:01:00′,’2015-02-18 08:01:00′,’72F’);
>
> And I get an update for the first record tomorrow, I want to keep both
> versions. So I would have -
>
> INSERT INTO
> temperature(weatherstation_id,event_time,transaction_time,temperature)
> VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-*19* 07:01:00′,’72F’);
>
> I fundamentally need to execute 2 types of queries -
>
> 1. select the latest values for the weatherstation for a given event time
> period which should ideally just return the first and third record.
> 2. select the values as of particular transaction time(say ’2015-02-18
> 08:01:00′), in which case I would expect to return first and second
> record.
>
> About your comment on having valid_time in the keys, do I have a choice in
> Cassandra, unless you are suggesting to use secondary indexes.
>
> I am new to bi-temporal data modeling. So please advise if you think
> building this on top of Cassandra is a stupid idea.
>
> -Rajesh
>
>
> On Sun, Feb 15, 2015 at 10:03 AM, Peter Lin <wo...@gmail.com> wrote:
>
>>
>> I've built several different bi-temporal databases over the year for a
>> variety of applications, so I have to ask "why are you modeling it this
>> way?"
>>
>> Having a temperatures table doesn't make sense to me. Normally a
>> bi-temporal database has transaction time and valid time. The transaction
>> time is the timestamp of when the data is saved and valid time is usually
>> expressed as effective & expiration time.
>>
>> In the example, is event time suppose to be valid time and what is the
>> granularity (seconds, hours, or day)?
>> Which kind of queries do you need to query most of the time?
>> Why is event_time and transaction_time part of the key?
>>
>> If you take time to study temporal databases, having valid time in the
>> keys will cause a lot of headaches. Basically, it's an anti-pattern for
>> temporal databases. The key should only be the unique identifier and it
>> shouldn't have transaction or valid time. The use case you describe looks
>> more like regular time series data and not really bi-temporal.
>>
>> I would suggest take time to understand the kinds of queries you need to
>> run and then change the table.
>>
>> peter
>>
>>
>> On Sun, Feb 15, 2015 at 9:00 AM, Jack Krupansky <jack.krupansky@gmail.com
>> > wrote:
>>
>>> I had forgotten, but there is a new tuple notation to iterate over more
>>> than one clustering column in C* 2.0.6:
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>>
>>> For example,
>>>
>>> SELECT ... WHERE (c1, c2) > (1, 0)
>>>
>>> There's an example in the CQL spec:
>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius <db...@mebigfatguy.com>
>>> wrote:
>>>
>>>>  As you point out, there's not really a node-based problem with your
>>>> query from a performance point of view. This is a limitation of CQL in
>>>> that, cql wants to slice one section of a partition's row (no matter how
>>>> big the section is). In your case, you are asking to slice multiple
>>>> sections of a partition's row, which currently isn't supported.
>>>>
>>>> It seems silly perhaps that this is the case, as certainly in your
>>>> example it would seem useful, and not to difficult, but the problem is that
>>>> you can wind up with n-depth slicing of that partitioned row given an
>>>> arbitrary query syntax if range queries on clustering keys was allowed
>>>> anywhere.
>>>>
>>>> At present, you can either duplicate the data using the other
>>>> clustering key (transaction_time) as primary clusterer for this use
>>>> case, or omit the 3rd criterion (transaction_time = 'xxxx')in the
>>>> query and get all the range query results and filter on the client.
>>>>
>>>> hth,
>>>> dave
>>>>
>>>>
>>>>
>>>> On 02/14/2015 06:05 PM, Raj N wrote:
>>>>
>>>> I don't think thats solves my problem. The question really is why can't
>>>> we use ranges for both time columns when they are part of the primary key.
>>>> They are on 1 row after all. Is this just a CQL limitation?
>>>>
>>>>  -Raj
>>>>
>>>> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>>  "I am trying to get the state as of a particular transaction_time"
>>>>>
>>>>>   --> In that case you should probably define your primary key in
>>>>> another order for clustering columns
>>>>>
>>>>>  PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>>>>
>>>>>  Then, select * from temperatures where weatherstation_id = 'foo' and
>>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>>> 00:00:00' and transaction_time = 'xxxx'
>>>>>
>>>>>
>>>>>
>>>>>  On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look
>>>>>> like I can do this using CQL for now. Taking the time series example from
>>>>>> well known modeling tutorials in Cassandra -
>>>>>>
>>>>>>  CREATE TABLE temperatures (
>>>>>> weatherstation_id text,
>>>>>> event_time timestamp,
>>>>>> temperature text,
>>>>>> PRIMARY KEY (weatherstation_id,event_time),
>>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>>>>
>>>>>>  If I add another column transaction_time
>>>>>>
>>>>>>  CREATE TABLE temperatures (
>>>>>> weatherstation_id text,
>>>>>> event_time timestamp,
>>>>>> transaction_time timestamp,
>>>>>>  temperature text,
>>>>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>>>>
>>>>>>  If I try to run a query using the following CQL, it throws an error
>>>>>> -
>>>>>>
>>>>>>  select * from temperatures where weatherstation_id = 'foo' and
>>>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>>>>
>>>>>>  It works if I use an equals clause for the event_time. I am trying
>>>>>> to get the state as of a particular transaction_time
>>>>>>
>>>>>>  -Raj
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Storing bi-temporal data in Cassandra

Posted by Raj N <ra...@gmail.com>.

Thanks for the response Peter. I used the temperature table because its the
most common example on CQL timeseries and I thought I would reuse it. From
some of the responses, looks like I was wrong.

event_time is the time the event happened. So yes it is valid time. I was
trying to see if I can get away with not having valid_from and valid_to in
Cassandra.
transaction_time is the time the database record was written.

Let's take an example -

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-18 07:01:00′,’72F’);

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 08:01:00′,’2015-02-18 08:01:00′,’72F’);

And I get an update for the first record tomorrow, I want to keep both
versions. So I would have -

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-*19* 07:01:00′,’72F’);

I fundamentally need to execute 2 types of queries -

1. select the latest values for the weatherstation for a given event time
period which should ideally just return the first and third record.
2. select the values as of particular transaction time(say ’2015-02-18
08:01:00′), in which case I would expect to return first and second record.

About your comment on having valid_time in the keys, do I have a choice in
Cassandra, unless you are suggesting to use secondary indexes.

I am new to bi-temporal data modeling. So please advise if you think
building this on top of Cassandra is a stupid idea.

-Rajesh


On Sun, Feb 15, 2015 at 10:03 AM, Peter Lin <wo...@gmail.com> wrote:

>
> I've built several different bi-temporal databases over the year for a
> variety of applications, so I have to ask "why are you modeling it this
> way?"
>
> Having a temperatures table doesn't make sense to me. Normally a
> bi-temporal database has transaction time and valid time. The transaction
> time is the timestamp of when the data is saved and valid time is usually
> expressed as effective & expiration time.
>
> In the example, is event time suppose to be valid time and what is the
> granularity (seconds, hours, or day)?
> Which kind of queries do you need to query most of the time?
> Why is event_time and transaction_time part of the key?
>
> If you take time to study temporal databases, having valid time in the
> keys will cause a lot of headaches. Basically, it's an anti-pattern for
> temporal databases. The key should only be the unique identifier and it
> shouldn't have transaction or valid time. The use case you describe looks
> more like regular time series data and not really bi-temporal.
>
> I would suggest take time to understand the kinds of queries you need to
> run and then change the table.
>
> peter
>
>
> On Sun, Feb 15, 2015 at 9:00 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> I had forgotten, but there is a new tuple notation to iterate over more
>> than one clustering column in C* 2.0.6:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>
>> For example,
>>
>> SELECT ... WHERE (c1, c2) > (1, 0)
>>
>> There's an example in the CQL spec:
>> https://cassandra.apache.org/doc/cql3/CQL.html
>>
>>
>> -- Jack Krupansky
>>
>> On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius <db...@mebigfatguy.com>
>> wrote:
>>
>>>  As you point out, there's not really a node-based problem with your
>>> query from a performance point of view. This is a limitation of CQL in
>>> that, cql wants to slice one section of a partition's row (no matter how
>>> big the section is). In your case, you are asking to slice multiple
>>> sections of a partition's row, which currently isn't supported.
>>>
>>> It seems silly perhaps that this is the case, as certainly in your
>>> example it would seem useful, and not to difficult, but the problem is that
>>> you can wind up with n-depth slicing of that partitioned row given an
>>> arbitrary query syntax if range queries on clustering keys was allowed
>>> anywhere.
>>>
>>> At present, you can either duplicate the data using the other clustering
>>> key (transaction_time) as primary clusterer for this use case, or omit
>>> the 3rd criterion (transaction_time = 'xxxx')in the query and get all
>>> the range query results and filter on the client.
>>>
>>> hth,
>>> dave
>>>
>>>
>>>
>>> On 02/14/2015 06:05 PM, Raj N wrote:
>>>
>>> I don't think thats solves my problem. The question really is why can't
>>> we use ranges for both time columns when they are part of the primary key.
>>> They are on 1 row after all. Is this just a CQL limitation?
>>>
>>>  -Raj
>>>
>>> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com>
>>> wrote:
>>>
>>>>  "I am trying to get the state as of a particular transaction_time"
>>>>
>>>>   --> In that case you should probably define your primary key in
>>>> another order for clustering columns
>>>>
>>>>  PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>>>
>>>>  Then, select * from temperatures where weatherstation_id = 'foo' and
>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>> 00:00:00' and transaction_time = 'xxxx'
>>>>
>>>>
>>>>
>>>>  On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look
>>>>> like I can do this using CQL for now. Taking the time series example from
>>>>> well known modeling tutorials in Cassandra -
>>>>>
>>>>>  CREATE TABLE temperatures (
>>>>> weatherstation_id text,
>>>>> event_time timestamp,
>>>>> temperature text,
>>>>> PRIMARY KEY (weatherstation_id,event_time),
>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>>>
>>>>>  If I add another column transaction_time
>>>>>
>>>>>  CREATE TABLE temperatures (
>>>>> weatherstation_id text,
>>>>> event_time timestamp,
>>>>> transaction_time timestamp,
>>>>>  temperature text,
>>>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>>>
>>>>>  If I try to run a query using the following CQL, it throws an error -
>>>>>
>>>>>  select * from temperatures where weatherstation_id = 'foo' and
>>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>>>
>>>>>  It works if I use an equals clause for the event_time. I am trying
>>>>> to get the state as of a particular transaction_time
>>>>>
>>>>>  -Raj
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Storing bi-temporal data in Cassandra

Posted by Peter Lin <wo...@gmail.com>.

I've built several different bi-temporal databases over the year for a
variety of applications, so I have to ask "why are you modeling it this
way?"

Having a temperatures table doesn't make sense to me. Normally a
bi-temporal database has transaction time and valid time. The transaction
time is the timestamp of when the data is saved and valid time is usually
expressed as effective & expiration time.

In the example, is event time suppose to be valid time and what is the
granularity (seconds, hours, or day)?
Which kind of queries do you need to query most of the time?
Why is event_time and transaction_time part of the key?

If you take time to study temporal databases, having valid time in the keys
will cause a lot of headaches. Basically, it's an anti-pattern for temporal
databases. The key should only be the unique identifier and it shouldn't
have transaction or valid time. The use case you describe looks more like
regular time series data and not really bi-temporal.

I would suggest take time to understand the kinds of queries you need to
run and then change the table.

peter


On Sun, Feb 15, 2015 at 9:00 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> I had forgotten, but there is a new tuple notation to iterate over more
> than one clustering column in C* 2.0.6:
>
> https://issues.apache.org/jira/browse/CASSANDRA-4851
>
> For example,
>
> SELECT ... WHERE (c1, c2) > (1, 0)
>
> There's an example in the CQL spec:
> https://cassandra.apache.org/doc/cql3/CQL.html
>
>
> -- Jack Krupansky
>
> On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius <db...@mebigfatguy.com>
> wrote:
>
>>  As you point out, there's not really a node-based problem with your
>> query from a performance point of view. This is a limitation of CQL in
>> that, cql wants to slice one section of a partition's row (no matter how
>> big the section is). In your case, you are asking to slice multiple
>> sections of a partition's row, which currently isn't supported.
>>
>> It seems silly perhaps that this is the case, as certainly in your
>> example it would seem useful, and not to difficult, but the problem is that
>> you can wind up with n-depth slicing of that partitioned row given an
>> arbitrary query syntax if range queries on clustering keys was allowed
>> anywhere.
>>
>> At present, you can either duplicate the data using the other clustering
>> key (transaction_time) as primary clusterer for this use case, or omit
>> the 3rd criterion (transaction_time = 'xxxx')in the query and get all
>> the range query results and filter on the client.
>>
>> hth,
>> dave
>>
>>
>>
>> On 02/14/2015 06:05 PM, Raj N wrote:
>>
>> I don't think thats solves my problem. The question really is why can't
>> we use ranges for both time columns when they are part of the primary key.
>> They are on 1 row after all. Is this just a CQL limitation?
>>
>>  -Raj
>>
>> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com>
>> wrote:
>>
>>>  "I am trying to get the state as of a particular transaction_time"
>>>
>>>   --> In that case you should probably define your primary key in
>>> another order for clustering columns
>>>
>>>  PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>>
>>>  Then, select * from temperatures where weatherstation_id = 'foo' and
>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>> 00:00:00' and transaction_time = 'xxxx'
>>>
>>>
>>>
>>>  On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com> wrote:
>>>
>>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like
>>>> I can do this using CQL for now. Taking the time series example from well
>>>> known modeling tutorials in Cassandra -
>>>>
>>>>  CREATE TABLE temperatures (
>>>> weatherstation_id text,
>>>> event_time timestamp,
>>>> temperature text,
>>>> PRIMARY KEY (weatherstation_id,event_time),
>>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>>
>>>>  If I add another column transaction_time
>>>>
>>>>  CREATE TABLE temperatures (
>>>> weatherstation_id text,
>>>> event_time timestamp,
>>>> transaction_time timestamp,
>>>>  temperature text,
>>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>>
>>>>  If I try to run a query using the following CQL, it throws an error -
>>>>
>>>>  select * from temperatures where weatherstation_id = 'foo' and
>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>>
>>>>  It works if I use an equals clause for the event_time. I am trying to
>>>> get the state as of a particular transaction_time
>>>>
>>>>  -Raj
>>>>
>>>
>>>
>>
>>
>

Re: Storing bi-temporal data in Cassandra

Posted by Jack Krupansky <ja...@gmail.com>.

I had forgotten, but there is a new tuple notation to iterate over more
than one clustering column in C* 2.0.6:

https://issues.apache.org/jira/browse/CASSANDRA-4851

For example,

SELECT ... WHERE (c1, c2) > (1, 0)

There's an example in the CQL spec:
https://cassandra.apache.org/doc/cql3/CQL.html


-- Jack Krupansky

On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius <db...@mebigfatguy.com>
wrote:

>  As you point out, there's not really a node-based problem with your query
> from a performance point of view. This is a limitation of CQL in that, cql
> wants to slice one section of a partition's row (no matter how big the
> section is). In your case, you are asking to slice multiple sections of a
> partition's row, which currently isn't supported.
>
> It seems silly perhaps that this is the case, as certainly in your example
> it would seem useful, and not to difficult, but the problem is that you can
> wind up with n-depth slicing of that partitioned row given an arbitrary
> query syntax if range queries on clustering keys was allowed anywhere.
>
> At present, you can either duplicate the data using the other clustering
> key (transaction_time) as primary clusterer for this use case, or omit
> the 3rd criterion (transaction_time = 'xxxx')in the query and get all the
> range query results and filter on the client.
>
> hth,
> dave
>
>
>
> On 02/14/2015 06:05 PM, Raj N wrote:
>
> I don't think thats solves my problem. The question really is why can't we
> use ranges for both time columns when they are part of the primary key.
> They are on 1 row after all. Is this just a CQL limitation?
>
>  -Raj
>
> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>>  "I am trying to get the state as of a particular transaction_time"
>>
>>   --> In that case you should probably define your primary key in
>> another order for clustering columns
>>
>>  PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>
>>  Then, select * from temperatures where weatherstation_id = 'foo' and
>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>> 00:00:00' and transaction_time = 'xxxx'
>>
>>
>>
>>  On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com> wrote:
>>
>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like
>>> I can do this using CQL for now. Taking the time series example from well
>>> known modeling tutorials in Cassandra -
>>>
>>>  CREATE TABLE temperatures (
>>> weatherstation_id text,
>>> event_time timestamp,
>>> temperature text,
>>> PRIMARY KEY (weatherstation_id,event_time),
>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>
>>>  If I add another column transaction_time
>>>
>>>  CREATE TABLE temperatures (
>>> weatherstation_id text,
>>> event_time timestamp,
>>> transaction_time timestamp,
>>>  temperature text,
>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>
>>>  If I try to run a query using the following CQL, it throws an error -
>>>
>>>  select * from temperatures where weatherstation_id = 'foo' and
>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>
>>>  It works if I use an equals clause for the event_time. I am trying to
>>> get the state as of a particular transaction_time
>>>
>>>  -Raj
>>>
>>
>>
>
>

Re: Storing bi-temporal data in Cassandra

Posted by Dave Brosius <db...@mebigfatguy.com>.

As you point out, there's not really a node-based problem with your 
query from a performance point of view. This is a limitation of CQL in 
that, cql wants to slice one section of a partition's row (no matter how 
big the section is). In your case, you are asking to slice multiple 
sections of a partition's row, which currently isn't supported.

It seems silly perhaps that this is the case, as certainly in your 
example it would seem useful, and not to difficult, but the problem is 
that you can wind up with n-depth slicing of that partitioned row given 
an arbitrary query syntax if range queries on clustering keys was 
allowed anywhere.

At present, you can either duplicate the data using the other clustering 
key (transaction_time) as primary clusterer for this use case, or omit 
the 3rd criterion (transaction_time = 'xxxx')in the query and get all 
the range query results and filter on the client.

hth,
dave


On 02/14/2015 06:05 PM, Raj N wrote:
> I don't think thats solves my problem. The question really is why 
> can't we use ranges for both time columns when they are part of the 
> primary key. They are on 1 row after all. Is this just a CQL limitation?
>
> -Raj
>
> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <doanduyhai@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     "I am trying to get the state as of a particular transaction_time"
>
>      --> In that case you should probably define your primary key in
>     another order for clustering columns
>
>     PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>
>     Then, select * from temperatures where weatherstation_id = 'foo'
>     and event_time >= '2015-01-01 00:00:00' and event_time <
>     '2015-01-02 00:00:00' and transaction_time = 'xxxx'
>
>
>
>     On Sat, Feb 14, 2015 at 3:06 AM, Raj N <raj.cassandra@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Has anyone designed a bi-temporal table in Cassandra? Doesn't
>         look like I can do this using CQL for now. Taking the time
>         series example from well known modeling tutorials in Cassandra -
>
>         CREATE TABLE temperatures (
>         weatherstation_id text,
>         event_time timestamp,
>         temperature text,
>         PRIMARY KEY (weatherstation_id,event_time),
>         ) WITH CLUSTERING ORDER BY (event_time DESC);
>
>         If I add another column transaction_time
>
>         CREATE TABLE temperatures (
>         weatherstation_id text,
>         event_time timestamp,
>         transaction_time timestamp,
>         temperature text,
>         PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>         ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time
>         DESC);
>
>         If I try to run a query using the following CQL, it throws an
>         error -
>
>         select * from temperatures where weatherstation_id = 'foo' and
>         event_time >= '2015-01-01 00:00:00' and event_time <
>         '2015-01-02 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>
>         It works if I use an equals clause for the event_time. I am
>         trying to get the state as of a particular transaction_time
>
>         -Raj
>
>
>

Re: Storing bi-temporal data in Cassandra

Posted by Raj N <ra...@gmail.com>.

I don't think thats solves my problem. The question really is why can't we
use ranges for both time columns when they are part of the primary key.
They are on 1 row after all. Is this just a CQL limitation?

-Raj

On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <do...@gmail.com> wrote:

> "I am trying to get the state as of a particular transaction_time"
>
>  --> In that case you should probably define your primary key in another
> order for clustering columns
>
> PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>
> Then, select * from temperatures where weatherstation_id = 'foo' and
> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
> 00:00:00' and transaction_time = 'xxxx'
>
>
>
> On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com> wrote:
>
>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
>> can do this using CQL for now. Taking the time series example from well
>> known modeling tutorials in Cassandra -
>>
>> CREATE TABLE temperatures (
>> weatherstation_id text,
>> event_time timestamp,
>> temperature text,
>> PRIMARY KEY (weatherstation_id,event_time),
>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>
>> If I add another column transaction_time
>>
>> CREATE TABLE temperatures (
>> weatherstation_id text,
>> event_time timestamp,
>> transaction_time timestamp,
>> temperature text,
>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>
>> If I try to run a query using the following CQL, it throws an error -
>>
>> select * from temperatures where weatherstation_id = 'foo' and event_time
>> >= '2015-01-01 00:00:00' and event_time < '2015-01-02 00:00:00' and
>> transaction_time < '2015-01-02 00:00:00'
>>
>> It works if I use an equals clause for the event_time. I am trying to get
>> the state as of a particular transaction_time
>>
>> -Raj
>>
>
>

Re: Storing bi-temporal data in Cassandra

Posted by DuyHai Doan <do...@gmail.com>.

"I am trying to get the state as of a particular transaction_time"

 --> In that case you should probably define your primary key in another
order for clustering columns

PRIMARY KEY (weatherstation_id,transaction_time,event_time)

Then, select * from temperatures where weatherstation_id = 'foo' and
event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
00:00:00' and transaction_time = 'xxxx'



On Sat, Feb 14, 2015 at 3:06 AM, Raj N <ra...@gmail.com> wrote:

> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
> can do this using CQL for now. Taking the time series example from well
> known modeling tutorials in Cassandra -
>
> CREATE TABLE temperatures (
> weatherstation_id text,
> event_time timestamp,
> temperature text,
> PRIMARY KEY (weatherstation_id,event_time),
> ) WITH CLUSTERING ORDER BY (event_time DESC);
>
> If I add another column transaction_time
>
> CREATE TABLE temperatures (
> weatherstation_id text,
> event_time timestamp,
> transaction_time timestamp,
> temperature text,
> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>
> If I try to run a query using the following CQL, it throws an error -
>
> select * from temperatures where weatherstation_id = 'foo' and event_time
> >= '2015-01-01 00:00:00' and event_time < '2015-01-02 00:00:00' and
> transaction_time < '2015-01-02 00:00:00'
>
> It works if I use an equals clause for the event_time. I am trying to get
> the state as of a particular transaction_time
>
> -Raj
>