You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jean-Pierre Bergamin <ja...@ractive.ch> on 2010/04/14 15:02:29 UTC

Time-series data model

Hello everyone

We are currently evaluating a new DB system (replacing MySQL) to store
massive amounts of time-series data. The data are various metrics from
various network and IT devices and systems. Metrics i.e. could be CPU usage
of the server "xy" in percent, memory usage of server "xy" in MB, ping
response time of server "foo" in milliseconds, network traffic of router
"bar" in MB/s and so on. Different metrics can be collected for different
devices in different intervals.

The metrics are stored together with a timestamp. The queries we want to
perform are:
 * The last value of a specific metric of a device
 * The values of a specific metric of a device between two timestamps t1 and
t2

I stumbled across this blog post which describes a very similar setup with
Cassandra:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
This post gave me confidence that what we want is definitively doable with
Cassandra.

But since I'm just digging into columns and super-columns and their
families, I still have some problems understanding everything.

Our data model could look in json'isch notation like this:
{
"my_server_1": {
	"cpu_usage": {
		{ts: 1271248215, value: 87 },
		{ts: 1271248220, value: 34 },
		{ts: 1271248225, value: 23 },
		{ts: 1271248230, value: 49 }
	}
	"ping_response": {
		{ts: 1271248201, value: 0.345 },
		{ts: 1271248211, value: 0.423 },
		{ts: 1271248221, value: 0.311 },
		{ts: 1271248232, value: 0.582 }
	}
}

"my_server_2": {
	"cpu_usage": {
		{ts: 1271248215, value: 23 },
		...
	}
	"disk_usage": {
		{ts: 1271243451, value: 123445 },
		...
	}
}

"my_router_1": {
	"bytes_in": {
		{ts: 1271243451, value: 2452346 },
		...
	}
	"bytes_out": {
		{ts: 1271243451, value: 13468 },
		...
	}
	"errors": {
		{ts: 1271243451, value: 24 },
		...
	}
}
}

What I don't get is how to created the two level hierarchy [device][metric].

Am I right that the devices would be kept in a super column family? The
ordering of those is not important.

But the metrics per device are also a super column, where the columns would
be the metric values ({ts: 1271243451, value: 24 }), isn't it?

So I'd need a super column in a super column... Hm.
My brain is definitively RDBMS-damaged and I don't see through columns and
super-columns yet. :-)

How could this be modeled in Cassandra?


Thank you very much
James



Re: Time-series data model

Posted by alex kamil <al...@gmail.com>.
James,

i'm a big fan of Cassandra, but have you looked at
http://en.wikipedia.org/wiki/RRDtool
is is natively built for this type of problem

Alex

On Wed, Apr 14, 2010 at 9:02 AM, Jean-Pierre Bergamin <ja...@ractive.ch>wrote:

> Hello everyone
>
> We are currently evaluating a new DB system (replacing MySQL) to store
> massive amounts of time-series data. The data are various metrics from
> various network and IT devices and systems. Metrics i.e. could be CPU usage
> of the server "xy" in percent, memory usage of server "xy" in MB, ping
> response time of server "foo" in milliseconds, network traffic of router
> "bar" in MB/s and so on. Different metrics can be collected for different
> devices in different intervals.
>
> The metrics are stored together with a timestamp. The queries we want to
> perform are:
>  * The last value of a specific metric of a device
>  * The values of a specific metric of a device between two timestamps t1
> and
> t2
>
> I stumbled across this blog post which describes a very similar setup with
> Cassandra:
> https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
> This post gave me confidence that what we want is definitively doable with
> Cassandra.
>
> But since I'm just digging into columns and super-columns and their
> families, I still have some problems understanding everything.
>
> Our data model could look in json'isch notation like this:
> {
> "my_server_1": {
>        "cpu_usage": {
>                {ts: 1271248215, value: 87 },
>                {ts: 1271248220, value: 34 },
>                {ts: 1271248225, value: 23 },
>                {ts: 1271248230, value: 49 }
>        }
>        "ping_response": {
>                {ts: 1271248201, value: 0.345 },
>                {ts: 1271248211, value: 0.423 },
>                {ts: 1271248221, value: 0.311 },
>                {ts: 1271248232, value: 0.582 }
>        }
> }
>
> "my_server_2": {
>        "cpu_usage": {
>                {ts: 1271248215, value: 23 },
>                ...
>        }
>        "disk_usage": {
>                {ts: 1271243451, value: 123445 },
>                ...
>        }
> }
>
> "my_router_1": {
>        "bytes_in": {
>                {ts: 1271243451, value: 2452346 },
>                ...
>        }
>        "bytes_out": {
>                {ts: 1271243451, value: 13468 },
>                ...
>        }
>        "errors": {
>                {ts: 1271243451, value: 24 },
>                ...
>        }
> }
> }
>
> What I don't get is how to created the two level hierarchy
> [device][metric].
>
> Am I right that the devices would be kept in a super column family? The
> ordering of those is not important.
>
> But the metrics per device are also a super column, where the columns would
> be the metric values ({ts: 1271243451, value: 24 }), isn't it?
>
> So I'd need a super column in a super column... Hm.
> My brain is definitively RDBMS-damaged and I don't see through columns and
> super-columns yet. :-)
>
> How could this be modeled in Cassandra?
>
>
> Thank you very much
> James
>
>
>

Re: Time-series data model

Posted by Dan Di Spaltro <da...@gmail.com>.
This is actually fairly similar to how we store metrics at Cloudkick.
Below has a much more in depth explanation of some of that

https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

So we store each natural point in the NumericArchive table.

<ColumnFamily CompareWith="LongType"
              Name="NumericArchive" />

<ColumnFamily CompareWith="LongType" Name="Rollup5m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup20m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup30m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup60m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup4h"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup12h"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup1d"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />

our keys look like:
<serviceuuid>.<metric-name>

Anyways, this has been working out very well for us.

2010/4/15 Ted Zlatanov <tz...@lifelogs.com>:
> On Thu, 15 Apr 2010 11:27:47 +0200 Jean-Pierre Bergamin <ja...@ractive.ch> wrote:
>
> JB> Am 14.04.2010 15:22, schrieb Ted Zlatanov:
>>> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"<ja...@ractive.ch>  wrote:
>>>
> JB> The metrics are stored together with a timestamp. The queries we want to
> JB> perform are:
> JB> * The last value of a specific metric of a device
> JB> * The values of a specific metric of a device between two timestamps t1 and
> JB> t2
>>>
>>> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
>>> sharding makes sense to you; I use UTC by-hours and by-day in my
>>> environment).  Then your supercolumn is the collection time as a
>>> LongType and your columns inside the supercolumn can express the metric
>>> in detail (collector agent, detailed breakdown, etc.).
>>>
> JB> Just for my understanding. What is "time sharding"? I couldn't find an
> JB> explanation somewhere. Do you mean that the time-series data is rolled
> JB> up in 5 minues, 1 hour, 1 day etc. slices?
>
> Yes.  The usual meaning of "shard" in RDBMS world is to segment your
> database by some criteria, e.g. US vs. Europe in Amazon AWS because
> their data centers are laid out so.  I was taking a linguistic shortcut
> to mean "break down your rows by some convenient criteria."  You can
> actually set up your Partitioner in Cassandra to literally shard your
> keyspace rows based on the key, but I just meant "slice" in my note.
>
> JB> So this would be defined as:
> JB> <ColumnFamily Name="measurements" ColumnType="Super"
> JB> CompareWith="UTF8Type"  CompareSubcolumnsWith="LongType" />
>
> JB> So when i want to read all values of one metric between two timestamps
> JB> t0 and t1, I'd have to read the supercolumns that match a key range
> JB> (device1:metric1:t0 - device1:metric1:t1) and then all the
> JB> supercolumns for this key?
>
> Yes.  This is a single multiget if you can construct the key range
> explicitly.  Cassandra loads a lot of this in memory already and filters
> it after the fact, that's why it pays to slice your keys and to stitch
> them together on the client side if you have to go across a time
> boundary.  You'll also get better key load balancing with deeper slicing
> if you use the randomizing partitioner.
>
> In the result set, you'll get each matching supercolumn with all the
> columns inside it.  You may have to page through supercolumns.
>
> Ted
>
>



-- 
Dan Di Spaltro

Re: Time-series data model

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Thu, 15 Apr 2010 11:27:47 +0200 Jean-Pierre Bergamin <ja...@ractive.ch> wrote: 

JB> Am 14.04.2010 15:22, schrieb Ted Zlatanov:
>> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"<ja...@ractive.ch>  wrote:
>> 
JB> The metrics are stored together with a timestamp. The queries we want to
JB> perform are:
JB> * The last value of a specific metric of a device
JB> * The values of a specific metric of a device between two timestamps t1 and
JB> t2
>> 
>> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
>> sharding makes sense to you; I use UTC by-hours and by-day in my
>> environment).  Then your supercolumn is the collection time as a
>> LongType and your columns inside the supercolumn can express the metric
>> in detail (collector agent, detailed breakdown, etc.).
>> 
JB> Just for my understanding. What is "time sharding"? I couldn't find an
JB> explanation somewhere. Do you mean that the time-series data is rolled
JB> up in 5 minues, 1 hour, 1 day etc. slices?

Yes.  The usual meaning of "shard" in RDBMS world is to segment your
database by some criteria, e.g. US vs. Europe in Amazon AWS because
their data centers are laid out so.  I was taking a linguistic shortcut
to mean "break down your rows by some convenient criteria."  You can
actually set up your Partitioner in Cassandra to literally shard your
keyspace rows based on the key, but I just meant "slice" in my note.

JB> So this would be defined as:
JB> <ColumnFamily Name="measurements" ColumnType="Super"
JB> CompareWith="UTF8Type"  CompareSubcolumnsWith="LongType" />

JB> So when i want to read all values of one metric between two timestamps
JB> t0 and t1, I'd have to read the supercolumns that match a key range
JB> (device1:metric1:t0 - device1:metric1:t1) and then all the
JB> supercolumns for this key?

Yes.  This is a single multiget if you can construct the key range
explicitly.  Cassandra loads a lot of this in memory already and filters
it after the fact, that's why it pays to slice your keys and to stitch
them together on the client side if you have to go across a time
boundary.  You'll also get better key load balancing with deeper slicing
if you use the randomizing partitioner.

In the result set, you'll get each matching supercolumn with all the
columns inside it.  You may have to page through supercolumns.

Ted


Re: Time-series data model

Posted by Jean-Pierre Bergamin <ja...@ractive.ch>.
Am 14.04.2010 15:22, schrieb Ted Zlatanov:
> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"<ja...@ractive.ch>  wrote:
>
> JB>  The metrics are stored together with a timestamp. The queries we want to
> JB>  perform are:
> JB>   * The last value of a specific metric of a device
> JB>   * The values of a specific metric of a device between two timestamps t1 and
> JB>  t2
>
> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
> sharding makes sense to you; I use UTC by-hours and by-day in my
> environment).  Then your supercolumn is the collection time as a
> LongType and your columns inside the supercolumn can express the metric
> in detail (collector agent, detailed breakdown, etc.).
>    
Just for my understanding. What is "time sharding"? I couldn't find an 
explanation somewhere. Do you mean that the time-series data is rolled 
up in 5 minues, 1 hour, 1 day etc. slices?

So this would be defined as:
<ColumnFamily Name="measurements" ColumnType="Super" 
CompareWith="UTF8Type"  CompareSubcolumnsWith="LongType" />

So when i want to read all values of one metric between two timestamps 
t0 and t1, I'd have to read the supercolumns that match a key range 
(device1:metric1:t0 - device1:metric1:t1) and then all the supercolumns 
for this key?


Regards
James

Re: Time-series data model

Posted by Ilya Maykov <iv...@gmail.com>.
Hi Jean-Pierre,

I'm investigating using Cassandra for a very similar use case, maybe
we can chat and compare notes sometime. But basically, I think you
want to pull the metric name into the row key and use simple CF
instead of SCF. So, your example:

"my_server_1": {
       "cpu_usage": {
               {ts: 1271248215, value: 87 },
               {ts: 1271248220, value: 34 },
               {ts: 1271248225, value: 23 },
               {ts: 1271248230, value: 49 }
       }
       "ping_response": {
               {ts: 1271248201, value: 0.345 },
               {ts: 1271248211, value: 0.423 },
               {ts: 1271248221, value: 0.311 },
               {ts: 1271248232, value: 0.582 }
       }
}

becomes:

"my_server_1:cpu_usage" : {
               {ts: 1271248215, value: 87 },
               {ts: 1271248220, value: 34 },
               {ts: 1271248225, value: 23 },
               {ts: 1271248230, value: 49 }
},
"my_server_1:ping_response": {
               {ts: 1271248201, value: 0.345 },
               {ts: 1271248211, value: 0.423 },
               {ts: 1271248221, value: 0.311 },
               {ts: 1271248232, value: 0.582 }
       }

This keeps your rows smaller and row count higher (which I think will
load-balance better). It also avoids large super columns, which you
don't want because columns inside a super column are not indexed so
accessing them can be expensive.

The time-based sharding will be necessary eventually if you plan to
keep your data forever, because without it your rows will get so big
that they don't fit in memory and crash Cassandra during a compaction.
But realistically, Cassandra can support A LOT of columns and pretty
big rows. Suppose you sample your stats every minute and use
"device-id:metric-name" as the row key. Google calculator claims there
are ~526k minutes in a year, so if you keep high-resolution data
forever you would only have half a million columns per row after 1
year. Assuming 128 bytes per data point (which seems way high for a
(long, double, long) 3-tuple), that's only 64MB of data per row. If
you thin out older, less relevant data, you could last a lot longer
before you have to split rows. Furthermore, splitting old data off
into another row is easy because you know old data is not being
modified at the time of the split, so you don't have to worry about
the RMW problem or external locking of any kind. So I would start
without time-based sharding instead of over-engineering for it, it
makes everything else much simpler.

-- Ilya

P.S. Credit for the above view point goes to Ryan King, who made this
argument to me in a discussion we had recently about this exact
problem.

2010/4/14 Ted Zlatanov <tz...@lifelogs.com>:
> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin" <ja...@ractive.ch> wrote:
>
> JB> The metrics are stored together with a timestamp. The queries we want to
> JB> perform are:
> JB>  * The last value of a specific metric of a device
> JB>  * The values of a specific metric of a device between two timestamps t1 and
> JB> t2
>
> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
> sharding makes sense to you; I use UTC by-hours and by-day in my
> environment).  Then your supercolumn is the collection time as a
> LongType and your columns inside the supercolumn can express the metric
> in detail (collector agent, detailed breakdown, etc.).
>
> If you want your clients to discover the available metrics, you may need
> to keep an external index.  But from your spec that doesn't seem necessary.
>
> Ted
>
>

Re: Time-series data model

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin" <ja...@ractive.ch> wrote: 

JB> The metrics are stored together with a timestamp. The queries we want to
JB> perform are:
JB>  * The last value of a specific metric of a device
JB>  * The values of a specific metric of a device between two timestamps t1 and
JB> t2

Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
sharding makes sense to you; I use UTC by-hours and by-day in my
environment).  Then your supercolumn is the collection time as a
LongType and your columns inside the supercolumn can express the metric
in detail (collector agent, detailed breakdown, etc.).

If you want your clients to discover the available metrics, you may need
to keep an external index.  But from your spec that doesn't seem necessary.

Ted


Re: Time-series data model

Posted by Zhiguo Zhang <mi...@gmail.com>.
first of all I am a new bee by Non-SQL. I try write down my opinions as
references:

If I were you, I will use 2 columnfamilys:

1.CF,  key is devices
2.CF,  key is timeuuid

how do u think about that?

Mike


On Wed, Apr 14, 2010 at 3:02 PM, Jean-Pierre Bergamin <ja...@ractive.ch>wrote:

> Hello everyone
>
> We are currently evaluating a new DB system (replacing MySQL) to store
> massive amounts of time-series data. The data are various metrics from
> various network and IT devices and systems. Metrics i.e. could be CPU usage
> of the server "xy" in percent, memory usage of server "xy" in MB, ping
> response time of server "foo" in milliseconds, network traffic of router
> "bar" in MB/s and so on. Different metrics can be collected for different
> devices in different intervals.
>
> The metrics are stored together with a timestamp. The queries we want to
> perform are:
>  * The last value of a specific metric of a device
>  * The values of a specific metric of a device between two timestamps t1
> and
> t2
>
> I stumbled across this blog post which describes a very similar setup with
> Cassandra:
> https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
> This post gave me confidence that what we want is definitively doable with
> Cassandra.
>
> But since I'm just digging into columns and super-columns and their
> families, I still have some problems understanding everything.
>
> Our data model could look in json'isch notation like this:
> {
> "my_server_1": {
>        "cpu_usage": {
>                {ts: 1271248215, value: 87 },
>                {ts: 1271248220, value: 34 },
>                {ts: 1271248225, value: 23 },
>                {ts: 1271248230, value: 49 }
>        }
>        "ping_response": {
>                {ts: 1271248201, value: 0.345 },
>                {ts: 1271248211, value: 0.423 },
>                {ts: 1271248221, value: 0.311 },
>                {ts: 1271248232, value: 0.582 }
>        }
> }
>
> "my_server_2": {
>        "cpu_usage": {
>                {ts: 1271248215, value: 23 },
>                ...
>        }
>        "disk_usage": {
>                {ts: 1271243451, value: 123445 },
>                ...
>        }
> }
>
> "my_router_1": {
>        "bytes_in": {
>                {ts: 1271243451, value: 2452346 },
>                ...
>        }
>        "bytes_out": {
>                {ts: 1271243451, value: 13468 },
>                ...
>        }
>        "errors": {
>                {ts: 1271243451, value: 24 },
>                ...
>        }
> }
> }
>
> What I don't get is how to created the two level hierarchy
> [device][metric].
>
> Am I right that the devices would be kept in a super column family? The
> ordering of those is not important.
>
> But the metrics per device are also a super column, where the columns would
> be the metric values ({ts: 1271243451, value: 24 }), isn't it?
>
> So I'd need a super column in a super column... Hm.
> My brain is definitively RDBMS-damaged and I don't see through columns and
> super-columns yet. :-)
>
> How could this be modeled in Cassandra?
>
>
> Thank you very much
> James
>
>
>