You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Tim Wintle <ti...@gmail.com> on 2012/05/01 19:20:02 UTC

Data modeling advice (time series)

I believe that the general design for time-series schemas looks
something like this (correct me if I'm wrong):

(storing time series for X dimensions for Y different users)

Row Keys:  "{USET_ID}_{TIMESTAMP/BUCKETSIZE}"
Columns: "{DIMENSION_ID}_{TIMESTAMP%BUCKETSIZE}" -> {Counter}

But I've not found much advice on calculating optimal bucket sizes (i.e.
optimal number of columns per row), and how that decision might be
affected by compression (or how significant the performance differences
between the two options might be).

Are the calculations here are still considered valid (proportionally) in
1.X, with the changes to SSTables, or is it significantly different?

<http://btoddb-cass-storage.blogspot.co.uk/2011/07/column-overhead-and-sizing-every-column.html> 


Thanks,

Tim

Re: Data modeling advice (time series)

Posted by Aaron Turner <sy...@gmail.com>.

On Wed, May 2, 2012 at 8:22 AM, Tim Wintle <ti...@gmail.com> wrote:
> On Tue, 2012-05-01 at 11:00 -0700, Aaron Turner wrote:
>> Tens or a few hundred MB per row seems reasonable.  You could do
>> thousands/MB if you wanted to, but that can make things harder to
>> manage.
>
> thanks (Both Aarons)
>
>> Depending on the size of your data, you may find that the overhead of
>> each column becomes significant; far more then the per-row overhead.
>> Since all of my data is just 64bit integers, I ended up taking a days
>> worth of values (288/day @ 5min intervals) and storing it as a single
>> column as a vector.
>
> By "vector" do you mean a raw binary array of long ints?

Yep.  I've also done a few small optimizations for when an entire days
data is 0, etc.

> That sounds very nice for reducing overhead - but I'd like to to work
> with counters (I was going to rely on them for streaming "real-time"
> updates).

I was going to use counters for aggregates... but I ended up doing all
the work in the client and storing them the same way as individual
data sources.  Depends on what you're counting really.  Basically with
counters, if you get an error incrementing them, you have no idea if
the value changed or not.  There's other issues too, which have been
discussed here on list and should be in the archives.  Not a big deal
if you're just counting the number of times people have clicked
"Like", but if you're building network traffic aggregates and you fail
to include or double count a 10 slot switch full of 10Gbps ports your
graphs end up looking really bad!

> Is that why you've got the two CFs described below (to have an archived
> summary and a live version that can have counters), or do you have no
> contention over writes/increments for individual values?

Basically if I inserted data as it came in as a vector, I'd have to do
a read for every write (read the current vector, and then write a new
vector with the new value appended to it).  That would destroy
performance, hence the two CF's.  By doing it nightly, it's a lot more
efficient.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: Data modeling advice (time series)

Posted by Tim Wintle <ti...@gmail.com>.

On Tue, 2012-05-01 at 11:00 -0700, Aaron Turner wrote:
> Tens or a few hundred MB per row seems reasonable.  You could do
> thousands/MB if you wanted to, but that can make things harder to
> manage.

thanks (Both Aarons)

> Depending on the size of your data, you may find that the overhead of
> each column becomes significant; far more then the per-row overhead.
> Since all of my data is just 64bit integers, I ended up taking a days
> worth of values (288/day @ 5min intervals) and storing it as a single
> column as a vector.

By "vector" do you mean a raw binary array of long ints?

That sounds very nice for reducing overhead - but I'd like to to work
with counters (I was going to rely on them for streaming "real-time"
updates).

Is that why you've got the two CFs described below (to have an archived
summary and a live version that can have counters), or do you have no
contention over writes/increments for individual values?

>   Hence I have two CF's:
> 
> StatsDaily  -- each row == 1 day, each column = 1 stat @ 5min intervals
> StatsDailyVector -- each row == 1 year, each column = 288 stats @ 1
> day intervals
> 
> Every night a job kicks off and converts each row's worth of
> StatsDaily into a column in StatsDailyVector.  By doing it 1:1 this
> way, I also reduce the number of tombstones I need to write in
> StatsDaily since I only need one tombstone for the row delete, rather
> then 288 for each column deleted.
> 
> I don't use compression.

Re: Data modeling advice (time series)

Posted by aaron morton <aa...@thelastpickle.com>.

I would try to avoid 100's on MB's per row. It will take longer to compact and repair. 

10's is fine. Take a look at in_memory_compaction_limit and thrift_frame_size in the yaml file for some guidance.

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 2/05/2012, at 6:00 AM, Aaron Turner wrote:

> On Tue, May 1, 2012 at 10:20 AM, Tim Wintle <ti...@gmail.com> wrote:
>> I believe that the general design for time-series schemas looks
>> something like this (correct me if I'm wrong):
>> 
>> (storing time series for X dimensions for Y different users)
>> 
>> Row Keys:  "{USET_ID}_{TIMESTAMP/BUCKETSIZE}"
>> Columns: "{DIMENSION_ID}_{TIMESTAMP%BUCKETSIZE}" -> {Counter}
>> 
>> But I've not found much advice on calculating optimal bucket sizes (i.e.
>> optimal number of columns per row), and how that decision might be
>> affected by compression (or how significant the performance differences
>> between the two options might be).
>> 
>> Are the calculations here are still considered valid (proportionally) in
>> 1.X, with the changes to SSTables, or is it significantly different?
>> 
>> <http://btoddb-cass-storage.blogspot.co.uk/2011/07/column-overhead-and-sizing-every-column.html>
> 
> 
> Tens or a few hundred MB per row seems reasonable.  You could do
> thousands/MB if you wanted to, but that can make things harder to
> manage.
> 
> Depending on the size of your data, you may find that the overhead of
> each column becomes significant; far more then the per-row overhead.
> Since all of my data is just 64bit integers, I ended up taking a days
> worth of values (288/day @ 5min intervals) and storing it as a single
> column as a vector.  Hence I have two CF's:
> 
> StatsDaily  -- each row == 1 day, each column = 1 stat @ 5min intervals
> StatsDailyVector -- each row == 1 year, each column = 288 stats @ 1
> day intervals
> 
> Every night a job kicks off and converts each row's worth of
> StatsDaily into a column in StatsDailyVector.  By doing it 1:1 this
> way, I also reduce the number of tombstones I need to write in
> StatsDaily since I only need one tombstone for the row delete, rather
> then 288 for each column deleted.
> 
> I don't use compression.
> 
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> "carpe diem quam minimum credula postero"

Re: Data modeling advice (time series)

Posted by Aaron Turner <sy...@gmail.com>.

On Tue, May 1, 2012 at 10:20 AM, Tim Wintle <ti...@gmail.com> wrote:
> I believe that the general design for time-series schemas looks
> something like this (correct me if I'm wrong):
>
> (storing time series for X dimensions for Y different users)
>
> Row Keys:  "{USET_ID}_{TIMESTAMP/BUCKETSIZE}"
> Columns: "{DIMENSION_ID}_{TIMESTAMP%BUCKETSIZE}" -> {Counter}
>
> But I've not found much advice on calculating optimal bucket sizes (i.e.
> optimal number of columns per row), and how that decision might be
> affected by compression (or how significant the performance differences
> between the two options might be).
>
> Are the calculations here are still considered valid (proportionally) in
> 1.X, with the changes to SSTables, or is it significantly different?
>
> <http://btoddb-cass-storage.blogspot.co.uk/2011/07/column-overhead-and-sizing-every-column.html>

Tens or a few hundred MB per row seems reasonable.  You could do
thousands/MB if you wanted to, but that can make things harder to
manage.

Depending on the size of your data, you may find that the overhead of
each column becomes significant; far more then the per-row overhead.
Since all of my data is just 64bit integers, I ended up taking a days
worth of values (288/day @ 5min intervals) and storing it as a single
column as a vector.  Hence I have two CF's:

StatsDaily  -- each row == 1 day, each column = 1 stat @ 5min intervals
StatsDailyVector -- each row == 1 year, each column = 288 stats @ 1
day intervals

Every night a job kicks off and converts each row's worth of
StatsDaily into a column in StatsDailyVector.  By doing it 1:1 this
way, I also reduce the number of tombstones I need to write in
StatsDaily since I only need one tombstone for the row delete, rather
then 288 for each column deleted.

I don't use compression.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"