You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by di...@gridcore.se on 2012/08/09 14:52:26 UTC

Cassandra data model help

Hi,
I am trying to create a Cassandra schema for cluster monitoring system, where one cluster can have multiple nodes and I am monitoring multiple matrices from a node. My raw data schema looks like and taking values in every 5 min interval

matrix_name + daily time stamp as row key, composite column name of node name and time stamp and matrix value as column value

the problem I am facing is a node can go back and forth between the clusters(system can have more than one clusters) so if i need monthly statistics plotting of a cluster I have to consider the nodes that are leaving and joining during this period of time, some node might be part of the cluster for just 15 days and some could join the cluster last 10 day of month, so to plot data for a particular cluster for a time interval I need to know the nodes which were part of that cluster for that period of time, what could be the best schema for this solution ? I have tried few ideas so far no luck, any suggestions ?

thanks
Ds

Re: Cassandra data model help

Posted by Aaron Turner <sy...@gmail.com>.
You need to track node membership separately.  I do that in a SQL
database, but you can use cassandra for that.  For example:

rowkey = cluster name
column name  Composite[ <epoch_time>:<node_name>] = [join|leave]

Then every time a node joins or leaves a cluster, write an entry.
Then you can just read the row (ordered by epoch times) to build your
list of active nodes for a given time period.  Note, you can set a
ending read range, but you basically have to start reading from 0.

Notice that is really for figuring out which nodes are in a cluster
for a given period of time.  You wouldn't want to model it that way if
you wanted to know which cluster(s) a single node was  in over a given
period of time.  In that case you'd model it this way:

rowkey = node name
column name  Composite[ <epoch_time>:<cluster_name>] = [join|leave]

Depending on your needs, you may end up using both!



On Fri, Aug 10, 2012 at 1:34 AM,  <di...@gridcore.se> wrote:
> Thanks Aaron for your reply,
> creating vector for raw data is good work around for decreasing disk space, but I am not still clear tracking time for nodes, say if we want a query like give me the list of nodes for a cluster between this period of time then how do we get that information? do we scan through each node row as we will have row for each node?
>
> thanks
>
> -----Aaron Turner <sy...@gmail.com> wrote: -----
> To: user@cassandra.apache.org
> From: Aaron Turner <sy...@gmail.com>
> Date: 08/09/2012 07:38PM
> Subject: Re: Cassandra data model help
>
> On Thu, Aug 9, 2012 at 5:52 AM,  <di...@gridcore.se> wrote:
>> Hi,
>> I am trying to create a Cassandra schema for cluster monitoring system, where one cluster can have multiple nodes and I am monitoring multiple matrices from a node. My raw data schema looks like and taking values in every 5 min interval
>>
>> matrix_name + daily time stamp as row key, composite column name of node name and time stamp and matrix value as column value
>>
>> the problem I am facing is a node can go back and forth between the clusters(system can have more than one clusters) so if i need monthly statistics plotting of a cluster I have to consider the nodes that are leaving and joining during this period of time, some node might be part of the cluster for just 15 days and some could join the cluster last 10 day of month, so to plot data for a particular cluster for a time interval I need to know the nodes which were part of that cluster for that period of time, what could be the best schema for this solution ? I have tried few ideas so far no luck, any suggestions ?
>
> Store each node stat in it's own row.  Then decide if you want to
> track when a node joins/leaves a cluster so you can build the aggs on
> the fly or just store cluster aggregates in their own row as well.  If
> the latter, depending on your polling methodology, you may want to use
> counters for the cluster aggregates.
>
> Also, if you're doing 5 min intervals with each row = 1 day, then your
> disk space usage is going to grow pretty quickly due to per-column
> overhead.   You didn't say what the values are that you're storing,
> but if they're just 64bit integers or something like that, most of
> your disk space is actually being used for column overhead not your
> data.
>
> I worked around this by creating a 2nd CF, where each row = 1 year
> worth of data and each column = 1 days worth of data.  The values are
> just a vector of the 5min values from the original CF.  Then I just
> have a cron job which reads the previous days data and builds the
> vectors in the new CF and then deletes the original row.  By doing
> this, my disk space requirements (before replication) went from over
> 1.1TB/year to 305GB/year.
>
>
> --
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> "carpe diem quam minimum credula postero"
>



-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: Cassandra data model help

Posted by di...@gridcore.se.
Thanks Aaron for your reply,
creating vector for raw data is good work around for decreasing disk space, but I am not still clear tracking time for nodes, say if we want a query like give me the list of nodes for a cluster between this period of time then how do we get that information? do we scan through each node row as we will have row for each node? 

thanks

-----Aaron Turner <sy...@gmail.com> wrote: -----
To: user@cassandra.apache.org
From: Aaron Turner <sy...@gmail.com>
Date: 08/09/2012 07:38PM
Subject: Re: Cassandra data model help

On Thu, Aug 9, 2012 at 5:52 AM,  <di...@gridcore.se> wrote:
> Hi,
> I am trying to create a Cassandra schema for cluster monitoring system, where one cluster can have multiple nodes and I am monitoring multiple matrices from a node. My raw data schema looks like and taking values in every 5 min interval
>
> matrix_name + daily time stamp as row key, composite column name of node name and time stamp and matrix value as column value
>
> the problem I am facing is a node can go back and forth between the clusters(system can have more than one clusters) so if i need monthly statistics plotting of a cluster I have to consider the nodes that are leaving and joining during this period of time, some node might be part of the cluster for just 15 days and some could join the cluster last 10 day of month, so to plot data for a particular cluster for a time interval I need to know the nodes which were part of that cluster for that period of time, what could be the best schema for this solution ? I have tried few ideas so far no luck, any suggestions ?

Store each node stat in it's own row.  Then decide if you want to
track when a node joins/leaves a cluster so you can build the aggs on
the fly or just store cluster aggregates in their own row as well.  If
the latter, depending on your polling methodology, you may want to use
counters for the cluster aggregates.

Also, if you're doing 5 min intervals with each row = 1 day, then your
disk space usage is going to grow pretty quickly due to per-column
overhead.   You didn't say what the values are that you're storing,
but if they're just 64bit integers or something like that, most of
your disk space is actually being used for column overhead not your
data.

I worked around this by creating a 2nd CF, where each row = 1 year
worth of data and each column = 1 days worth of data.  The values are
just a vector of the 5min values from the original CF.  Then I just
have a cron job which reads the previous days data and builds the
vectors in the new CF and then deletes the original row.  By doing
this, my disk space requirements (before replication) went from over
1.1TB/year to 305GB/year.


-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"


Re: Cassandra data model help

Posted by Aaron Turner <sy...@gmail.com>.
On Thu, Aug 9, 2012 at 5:52 AM,  <di...@gridcore.se> wrote:
> Hi,
> I am trying to create a Cassandra schema for cluster monitoring system, where one cluster can have multiple nodes and I am monitoring multiple matrices from a node. My raw data schema looks like and taking values in every 5 min interval
>
> matrix_name + daily time stamp as row key, composite column name of node name and time stamp and matrix value as column value
>
> the problem I am facing is a node can go back and forth between the clusters(system can have more than one clusters) so if i need monthly statistics plotting of a cluster I have to consider the nodes that are leaving and joining during this period of time, some node might be part of the cluster for just 15 days and some could join the cluster last 10 day of month, so to plot data for a particular cluster for a time interval I need to know the nodes which were part of that cluster for that period of time, what could be the best schema for this solution ? I have tried few ideas so far no luck, any suggestions ?

Store each node stat in it's own row.  Then decide if you want to
track when a node joins/leaves a cluster so you can build the aggs on
the fly or just store cluster aggregates in their own row as well.  If
the latter, depending on your polling methodology, you may want to use
counters for the cluster aggregates.

Also, if you're doing 5 min intervals with each row = 1 day, then your
disk space usage is going to grow pretty quickly due to per-column
overhead.   You didn't say what the values are that you're storing,
but if they're just 64bit integers or something like that, most of
your disk space is actually being used for column overhead not your
data.

I worked around this by creating a 2nd CF, where each row = 1 year
worth of data and each column = 1 days worth of data.  The values are
just a vector of the 5min values from the original CF.  Then I just
have a cron job which reads the previous days data and builds the
vectors in the new CF and then deletes the original row.  By doing
this, my disk space requirements (before replication) went from over
1.1TB/year to 305GB/year.


-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"