You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chin Ko <ck...@gmail.com> on 2012/12/11 15:23:54 UTC

Selecting rows efficiently from a Cassandra CF containing time series data

I would like to get some opinions on how to select an incremental range of
rows efficiently from a Cassandra CF containing time series data.

Background:
We have a web application that uses a Cassandra CF as logging storage. We
insert a row into the CF for every "event" of each user of the web
application. The row key is timestamp+userid. The column values are
unstructured data. We only insert rows but never update or delete any rows
in the CF.

Data volume:
The CF grows by about 0.5 million rows per day. We have a 4 node cluster
and use the RandomPartitioner to spread the rows across the nodes.

Requirements:
There is a need to transfer the Cassandra data to another relational
database periodically. Due to the large size of the CF, instead of
truncating the relational table and reloading all rows into it each time,
we plan to run a job to select the "delta" rows since the last run and
insert them into the relational database.

We would like to have some flexibility in how often the data transfer job
is done. It may be run several times each day, or it may be not run at all
on a day.

Options considered:
- We are using RandomPartitioner, so range scan by row key is not feasible.
- Add a secondary index on the timestamp column, but reading rows via
secondary index still requires an equality condition and does not support
range scan.
- Add a secondary index on a column containing the date and hour of the
timestamp. Iterate each hour between the time job was last run and now.
Fetch all rows of each hour.

I would appreciate any ideas of other design options of the Cassandra CF to
enable extracting the rows efficiently.

Besides Java, has anyone used any ETL tools to do this kind of delta
extraction from Cassandra?

Thanks,
Chin

Re: Selecting rows efficiently from a Cassandra CF containing time series data

Posted by aaron morton <aa...@thelastpickle.com>.

Couple of ideas, one is to multiplex the even log stream (using flume or kafka) and feed it straight into your secondary system. The event system should allow you to rate limit inserts if that is a concern. 

The other is to use partitioning.

Group the log entries per user into some sensible partition, e.g. per day or per week. So your row key is "user_id : partition_start". 

You can then keep a record of dirty partitions, this can be tricky depending on scale. It could be a row for each user, and a column for each dirty partition. Loading the delta then requires a range scan over the dirty partitions CF to read all rows, and then reading the dirty partition for the user. You would want to look at a low GC Grace and LDB for the dirty partitions CF. 

Hope that helps.  

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/12/2012, at 7:20 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

> Wide rows does not work well if you start getting past 10,000,000 columns though so be very very careful there.  PlayOrm does some wide row indices for us and each row length is as large as the number of rows in a partition so without playorm you could do partitioning yourself by the way….It's as simple as store every row and add to the partitions index.
> 
> Later,
> Dean
> 
> 
> From: Andrey Ilinykh <ai...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Date: Tuesday, December 11, 2012 10:45 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Subject: Re: Selecting rows efficiently from a Cassandra CF containing time series data
> 
> would consider to use wide rows. If you add timestamp to your column name you have naturally sorted data. You can easily select any time range without any indexes.

Re: Selecting rows efficiently from a Cassandra CF containing time series data

Posted by "Hiller, Dean" <De...@nrel.gov>.

Wide rows does not work well if you start getting past 10,000,000 columns though so be very very careful there.  PlayOrm does some wide row indices for us and each row length is as large as the number of rows in a partition so without playorm you could do partitioning yourself by the way….It's as simple as store every row and add to the partitions index.

Later,
Dean

From: Andrey Ilinykh <ai...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, December 11, 2012 10:45 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Selecting rows efficiently from a Cassandra CF containing time series data

 would consider to use wide rows. If you add timestamp to your column name you have naturally sorted data. You can easily select any time range without any indexes.

Re: Selecting rows efficiently from a Cassandra CF containing time series data

Posted by Andrey Ilinykh <ai...@gmail.com>.

I would consider to use wide rows. If you add timestamp to your column name
you have naturally sorted data. You can easily select any time range
without any indexes.

Thank you,
  Andrey


On Tue, Dec 11, 2012 at 6:23 AM, Chin Ko <ck...@gmail.com> wrote:

> I would like to get some opinions on how to select an incremental range of
> rows efficiently from a Cassandra CF containing time series data.
>
> Background:
> We have a web application that uses a Cassandra CF as logging storage. We
> insert a row into the CF for every "event" of each user of the web
> application. The row key is timestamp+userid. The column values are
> unstructured data. We only insert rows but never update or delete any rows
> in the CF.
>
> Data volume:
> The CF grows by about 0.5 million rows per day. We have a 4 node cluster
> and use the RandomPartitioner to spread the rows across the nodes.
>
> Requirements:
> There is a need to transfer the Cassandra data to another relational
> database periodically. Due to the large size of the CF, instead of
> truncating the relational table and reloading all rows into it each time,
> we plan to run a job to select the "delta" rows since the last run and
> insert them into the relational database.
>
> We would like to have some flexibility in how often the data transfer job
> is done. It may be run several times each day, or it may be not run at all
> on a day.
>
> Options considered:
> - We are using RandomPartitioner, so range scan by row key is not feasible.
> - Add a secondary index on the timestamp column, but reading rows via
> secondary index still requires an equality condition and does not support
> range scan.
> - Add a secondary index on a column containing the date and hour of the
> timestamp. Iterate each hour between the time job was last run and now.
> Fetch all rows of each hour.
>
> I would appreciate any ideas of other design options of the Cassandra CF
> to enable extracting the rows efficiently.
>
> Besides Java, has anyone used any ETL tools to do this kind of delta
> extraction from Cassandra?
>
> Thanks,
> Chin

Re: Selecting rows efficiently from a Cassandra CF containing time series data

Posted by "Hiller, Dean" <De...@nrel.gov>.

We use PlayOrm to do something similar

We have an object like this(typing all this from memory)….

TimeSeries {

   @NoSqlPartitionedByField
   private long beginOfMonth;
   @NoSqlIndexed
   Private long timestamp;

}

Then we just use the ScalableSQL to query into the partition itself.  This is all on random partitioner as well.  We could partition by day if we had way more of a dataload, but we tend not to need that.  The query looks something like this  "PARTITIONS s(:beginOfMonth) select s from TimeSeries as s";  OR "PARTITIONS s(:beginOfMonth) select s from TimeSeries as s where s.time > :start and s.time < :end"

Later,
Dean

From: Chin Ko <ck...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, December 11, 2012 7:23 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Selecting rows efficiently from a Cassandra CF containing time series data

I would like to get some opinions on how to select an incremental range of rows efficiently from a Cassandra CF containing time series data.

Background:
We have a web application that uses a Cassandra CF as logging storage. We insert a row into the CF for every "event" of each user of the web application. The row key is timestamp+userid. The column values are unstructured data. We only insert rows but never update or delete any rows in the CF.

Data volume:
The CF grows by about 0.5 million rows per day. We have a 4 node cluster and use the RandomPartitioner to spread the rows across the nodes.

Requirements:
There is a need to transfer the Cassandra data to another relational database periodically. Due to the large size of the CF, instead of truncating the relational table and reloading all rows into it each time, we plan to run a job to select the "delta" rows since the last run and insert them into the relational database.

We would like to have some flexibility in how often the data transfer job is done. It may be run several times each day, or it may be not run at all on a day.

Options considered:
- We are using RandomPartitioner, so range scan by row key is not feasible.
- Add a secondary index on the timestamp column, but reading rows via secondary index still requires an equality condition and does not support range scan.
- Add a secondary index on a column containing the date and hour of the timestamp. Iterate each hour between the time job was last run and now. Fetch all rows of each hour.

I would appreciate any ideas of other design options of the Cassandra CF to enable extracting the rows efficiently.

Besides Java, has anyone used any ETL tools to do this kind of delta extraction from Cassandra?

Thanks,
Chin