You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by RobinUs2 <ro...@us2.nl> on 2011/11/17 21:36:03 UTC

Datastructure time tracking

We're currently developing a system with a time tracking part. We need to
store following details:
- user
- time (in minutes)
- description
- billable
- project
- task ID

What would be a proper data structure for this in Cassandra?

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Datastructure-time-tracking-tp7005672p7005672.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Datastructure time tracking

Posted by RobinUs2 <ro...@us2.nl>.
Thank you very much. This was very helpfull. I'll post an update here when I
managed to finish my datastructure design.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Datastructure-time-tracking-tp7005672p7011370.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Datastructure time tracking

Posted by Tyler Hobbs <ty...@datastax.com>.
On Fri, Nov 18, 2011 at 1:59 PM, RobinUs2 <ro...@us2.nl> wrote:

> We should be able to:
> - find all time records from all users within a given project (with
> optionally a certain date range)
>

You'll need a timeline per project.  I would use one row per day, week, or
month, depending on the size.  Your row keys could be something like
'fooproject:2011-11-18'.  The column names could be timestamps or
TimeUUIDs, which will cause the columns in the row to be chronologically
ordered.


> - find total time per task
>

One row per taskID, with each column being an entry that contributes to the
total time for that project.  You would need to read the entire row and
calculate the sum client-side.  Depending on how big these might get, you
might want a process that periodically processes the rows and reduces them
down to their current sum (or you might build it into your read behavior --
a write after read, if you will -- depending on your requirements).  I'm
assuming you want exact numbers here, so the built-in counters probably
aren't a suitable option.


> - find all time records from user X (with optionally a certain date range)
>

This will look almost identical to the project timeline described above.
You could probably split the timelines into bigger chunks, though (like one
month per row).

As I understand from your answer this would require atleast 3 (or 5?) CF's
> for the queries?
>

Correct, it looks like you would need 3 CFs for this set of queries.

Hopefully that helps to illustrate the thought process.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Datastructure time tracking

Posted by RobinUs2 <ro...@us2.nl>.
We should be able to:
- find all time records from all users within a given project (with
optionally a certain date range)
- find total time per task
- find all time records from user X (with optionally a certain date range)

As I understand from your answer this would require atleast 3 (or 5?) CF's
for the queries?

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Datastructure-time-tracking-tp7005672p7009508.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Datastructure time tracking

Posted by Tyler Hobbs <ty...@datastax.com>.
On Thu, Nov 17, 2011 at 2:36 PM, RobinUs2 <ro...@us2.nl> wrote:

> We're currently developing a system with a time tracking part. We need to
> store following details:
> - user
> - time (in minutes)
> - description
> - billable
> - project
> - task ID
>
> What would be a proper data structure for this in Cassandra?


 How do you need to be able to query the data? Specific details matter.
For example, do you just need to know what happened for a specific user
during a given time period?  Or do you need to know what happened across
all users during a given time period?  All users of a given project?

These details matter because in Cassandra you tend to have one column
family per type of query that you need to be able to answer efficiently
(i.e. in real time).  Ad-hoc queries aren't efficient on large, distributed
data sets like you tend to use Cassandra for; you need to know what your
reads will look like to know how to model your data.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>