You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Rumph, Frens Jan" <ma...@frensjan.nl> on 2015/03/23 08:04:52 UTC

Cassandra time series + Spark

Hi,

I'm working on a system which has to deal with time series data. I've been
happy using Cassandra for time series and Spark looks promising as a
computational platform.

I consider chunking time series in Cassandra necessary, e.g. by 3 weeks as
kairosdb does it. This allows an 8 byte chunk start timestamp with 4 byte
offsets for the individual measurements. And it keeps the data below 2x10^9
even at 1000 Hz.

This schema works quite okay when dealing with one time series at a time.
Because the data is partitioned by time series id and chunk of time (e.g.
the three weeks mentioned above), it requires a little client side logic to
retrieve the partitions and glue them together, but this is quite okay.

However, when working with many / all of the time series in a table at
once, e.g. in Spark, the story changes dramatically. Say I'd want to
compute something simple as a moving average, I have to deal with data all
over the place. I can't currently think of anything but performing
aggregateByKey causing a shuffle every time.

Anyone have experience with combining time series chunking and computation
on all / many time series at once? Any advice?

Cheers,
Frens Jan