You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Denis Mikhaylov <no...@gmail.com> on 2016/09/23 08:42:11 UTC

Help on temporal data modeling

Hi!

I have question regarding data modelling.

Let’s say that I have `subscriptions` table with two columns `subscription_id text` and `next_billing_date timestamp`.

How do I model a table to efficiently query all subscriptions due today (something like `where next_billing_date <= today`)

Re: Help on temporal data modeling

Posted by Peter Lin <wo...@gmail.com>.
yes it would. Whether next_billing_date is timestamp or date wouldn't make
any difference on scanning all partitions. If you want to them to be on the
same node, you can use composite key, but there's a trade off. The nodes
may get unbalanced, so you have to do the math to figure out if your
specific use case would lead to unbalanced nodes.

These kinds of partitioning problems have been around for over 30 yrs, so
it's not a problem that will magically go away. In theory, if each
partition is small enough the query times "should" be acceptable. Having a
few large partitions isn't good in practice regardless of whether it's
RDBMS or NoSql

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refCompositePk.html



On Fri, Sep 23, 2016 at 7:36 AM, Denis Mikhaylov <no...@gmail.com> wrote:

> Thanks, for you answer.
>
> Wouldn’t simple `select * from subscriptions where next_billing_date =
> '2016-10-25’` require full scan of all partitions?
>
>

Re: Help on temporal data modeling

Posted by Denis Mikhaylov <no...@gmail.com>.
Thanks, for you answer.

Wouldn’t simple `select * from subscriptions where next_billing_date = '2016-10-25’` require full scan of all partitions?

> On 23 Sep 2016, at 14:28, Peter Lin <wo...@gmail.com> wrote:
> 
> 
> Ignoring noSql for a minute, the standard way of modeling this in car and health insurance is with effective/expiration day. Commonly called bi-temporal data modeling.
> 
> How people model bi-temporal models varies quite a bit from first hand experience, but the common thing is to have transaction timestamp, effective day and expiration day. This way, when the system needs to get all the policies that expires in 30 days, it would run a query like this
> 
> select * from policies where expiration = '2016-10-25'
> 
> the batch application is responsible for calculating the expiration by getting today's date + 30 days. The first thing I would change is modify next_billing_date from timestamp to date and set the hour to 00:00:00. It doesn't make sense to use timestamp datatype for this particular use case.
> 
> Getting back to Cassandra and CQL. You should be able to just query it without having to resort to materialized views and secondary indexes. If you really want to index it, use Solr or buy a license of DSE. The temporal database I built on top of Cassandra uses Solr, which gives me greater flexibility and control over indexes and query performance.
> 
> 
> On Fri, Sep 23, 2016 at 4:42 AM, Denis Mikhaylov <no...@gmail.com> wrote:
> Hi!
> 
> I have question regarding data modelling.
> 
> Let’s say that I have `subscriptions` table with two columns `subscription_id text` and `next_billing_date timestamp`.
> 
> How do I model a table to efficiently query all subscriptions due today (something like `where next_billing_date <= today`)
> 


Re: Help on temporal data modeling

Posted by Peter Lin <wo...@gmail.com>.
Ignoring noSql for a minute, the standard way of modeling this in car and
health insurance is with effective/expiration day. Commonly called
bi-temporal data modeling.

How people model bi-temporal models varies quite a bit from first hand
experience, but the common thing is to have transaction timestamp,
effective day and expiration day. This way, when the system needs to get
all the policies that expires in 30 days, it would run a query like this

select * from policies where expiration = '2016-10-25'

the batch application is responsible for calculating the expiration by
getting today's date + 30 days. The first thing I would change is modify
next_billing_date from timestamp to date and set the hour to 00:00:00. It
doesn't make sense to use timestamp datatype for this particular use case.

Getting back to Cassandra and CQL. You should be able to just query it
without having to resort to materialized views and secondary indexes. If
you really want to index it, use Solr or buy a license of DSE. The temporal
database I built on top of Cassandra uses Solr, which gives me greater
flexibility and control over indexes and query performance.


On Fri, Sep 23, 2016 at 4:42 AM, Denis Mikhaylov <no...@gmail.com> wrote:

> Hi!
>
> I have question regarding data modelling.
>
> Let’s say that I have `subscriptions` table with two columns
> `subscription_id text` and `next_billing_date timestamp`.
>
> How do I model a table to efficiently query all subscriptions due today
> (something like `where next_billing_date <= today`)

Re: Help on temporal data modeling

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi Denis,

You might want to have a look at

- Materialized views
http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views
- Secondary index
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html

My 2 cents: make sure to understand the implications before moving forward,
what is happening under the hood or at least that your use case fits with
the recommendations. Indexes and views are not free. If you want to know
more, you might want to watch this:
https://www.youtube.com/watch?v=dxiuQ2CkXfM from DuyHai at the summit this
year :-).

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-09-23 10:42 GMT+02:00 Denis Mikhaylov <no...@gmail.com>:

> Hi!
>
> I have question regarding data modelling.
>
> Let’s say that I have `subscriptions` table with two columns
> `subscription_id text` and `next_billing_date timestamp`.
>
> How do I model a table to efficiently query all subscriptions due today
> (something like `where next_billing_date <= today`)