You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Abhishek Singh Bailoo <ab...@gmail.com> on 2015/06/05 23:10:30 UTC

Avoiding Data Duplication

Hello!

I have a column family to log in data coming from my GPS devices.

CREATE TABLE log(
  imei ascii,
  date ascii,
  dtime timestamp,
  data ascii,
  stime timestamp,
  PRIMARY KEY ((imei, date), dtime))
  WITH CLUSTERING ORDER BY (dtime DESC)
;

It is the standard schema for modeling time series data where
imei is the unique ID associated with each GPS device
date is the date taken from dtime
dtime is the date-time coming from the device
data is all the latitude, longitude etc that the device is sending us
stime is the date-time stamp of the server

The reason why I put dtime in the primary key as the clustering column is
because most of our queries are done on device time. There can be a delay
of a few minutes to a few hours (or a few days! in rare cases) between
dtime and stime if the network is not available.

However, now we want to query on server time as well for the purpose of
debugging. These queries will be not as common as queries on  device time.
Say for every 100 queries on dtime there will be just 1 query on stime.

What options do I have?

1. Seconday Index - not possible because stime is a timestamp and CQL does
not allow me to put < or > in the query for secondary index

2. Data duplication - I can build another column family where I will index
by stime but that means I am storing twice as much data. I know everyone
says that write operations are cheap and storage is cheap but how? If I
have to buy twice as many machines on AWS EC2 each with their own ephemeral
storage, then my bill doubles up!

Any other ideas I can try?

Many Thanks,
Abhishek

Re: Avoiding Data Duplication

Posted by Paulo Motta <pa...@gmail.com>.

Some options I can think of:

1 - depending on your data size and stime query frequency, you may use
spark to peform queries filtering by server time in the log table, maybe
within an device time window to reduce the dataset your spark job will need
to go through. more info on the spark connector:
https://github.com/datastax/spark-cassandra-connector

2 - if dtime and stime are almost always in the same date bucket
(day/hour/minute/second), you may create an additional table stable_log
with the same structure, but the date bucket refers to the sdate field. so,
when you have an entry when stime and dtime are not from the same bucket,
you should insert that entry in both the log and stime_log tables. when you
want to query entries by stime, you take the distinct union of the query of
both tables in your client application. this way, you only duplicate
delayed data.

3 - if you "data" field is big and you can't afford duplicating that,
create an additional table stable_log, but do not store the data field,
only the metadata (imei, date, dtime, stime).. so when you want to query by
stime, first query the stable_log, and then query the original log table to
fetch the data field.

2015-06-05 18:10 GMT-03:00 Abhishek Singh Bailoo <
abhishek.singh.bailoo@gmail.com>:

> Hello!
>
> I have a column family to log in data coming from my GPS devices.
>
> CREATE TABLE log(
>   imei ascii,
>   date ascii,
>   dtime timestamp,
>   data ascii,
>   stime timestamp,
>   PRIMARY KEY ((imei, date), dtime))
>   WITH CLUSTERING ORDER BY (dtime DESC)
> ;
>
> It is the standard schema for modeling time series data where
> imei is the unique ID associated with each GPS device
> date is the date taken from dtime
> dtime is the date-time coming from the device
> data is all the latitude, longitude etc that the device is sending us
> stime is the date-time stamp of the server
>
> The reason why I put dtime in the primary key as the clustering column is
> because most of our queries are done on device time. There can be a delay
> of a few minutes to a few hours (or a few days! in rare cases) between
> dtime and stime if the network is not available.
>
> However, now we want to query on server time as well for the purpose of
> debugging. These queries will be not as common as queries on  device time.
> Say for every 100 queries on dtime there will be just 1 query on stime.
>
> What options do I have?
>
> 1. Seconday Index - not possible because stime is a timestamp and CQL does
> not allow me to put < or > in the query for secondary index
>
> 2. Data duplication - I can build another column family where I will index
> by stime but that means I am storing twice as much data. I know everyone
> says that write operations are cheap and storage is cheap but how? If I
> have to buy twice as many machines on AWS EC2 each with their own ephemeral
> storage, then my bill doubles up!
>
> Any other ideas I can try?
>
> Many Thanks,
> Abhishek
>