You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Colin Yates <co...@gmail.com> on 2010/04/12 11:07:02 UTC
Newbie modelling question

Hi all,

I have read the docs and lots of posts on this
forum and it feels like things are starting to 
click into place, but I wanted to make sure.  Bear
with me please :)

Basically we have the following model:

Client {
  projects: Collection<Project>
  users: Collection<User>
}

Project {
  owner: User (in a firm)
  cost: number
  startDate: Date
  endDate: Date
}

A project might last for 6 months and we snapshot it every day:

ProjectSnapshot {
  snapshotDate: Date
  cost: number  -- estimated cost of the project
  project: Project
  predictedEndDate: Date -- estimated endDate 
}

So quite simple: Client 1..* User 1.. * Project 1..ProjectSnapshot

Actually, not so simple - a Project might be decomposed into smaller 
projects - Project 1..* Project but let's not worry about that 
for now.

In terms of reporting we need to be able to answer the following 
questions:

 - for a client, how many projects were closed each month

 - for a client, how many projects were on-going each month

 - for a client, what is the sum of the costs of all on-going projects 
(where cost(project) == avg(value(snapshots(project)))

 - for *all* clients, how many projects were closed each month
etc.

Note: the choice of month is arbitrary - it could be 36 seconds :) so I 
cannot have an explicit grouping by that.

In terms of numbers, there might be 100 clients with 10 users each
with 100 projects.

Each project lives for 6 months and they are snapshotted each day
while they are 'alive' so every year there are 
(100 * 10 * 100 * (365/2)) = 18,250,000 rows.  

Of course some can be archived after a year or so.

I was thinking about the following structure:

KS:Client<clientId>.CF:Project[<projectId>]: {
  owner:<text>, cost:<number>, startDate:<date>, endDate: <Date>
}

KS:Client<clientId>.CF:ProjectSnapshots[<snapshotDate_inNanos>]: {
  projectId:UUID, cost:<number>, predictedEndDate:<Date>
}

If I understand everything (big if!) this means that:

 - I can lookup all snapshots for all projects (within a client) 
using a key range across the keys in Client<clientId>.ProjectSnapshots.
I can then shovel this into a map/reduce to group by projectId 
and aggregate the snapshots(project)

 - The keys in CF:Project and CF:ProjectSnapshot will spread equally 
over a cluster so that each node has a chunk of contiguous projects 
and/or a chunk of contiguous snapshots.  

 - Adding a new snapshot should be really quick

Some questions (assuming the above statements are true):

- The cluster nodes with the most project projectIds will become hot 
spots. I really do need to lookup a project by its ID so I cannot have
 a random key for CF:Project.

I don't know how to handle this.

 - If I wanted to load all the snapshots for the a project do I need 
to map/reduce to find the snapshots for a project and then bulk-delete
those projectIds or can I filter on CF:ProjectSnapshot[*]{projectId:X}?

I realise I could have a 
CF:SnapshotsForProject[<projectId>]: { 
  snapshotDate1:<snapshotUUID>, snapshotDate2:<snapshotUUID>
}.  

 - what about removing a project?  I think I understand that updates
on a CF are atomic but I need to delete two entries from two 
CFs - each of those will be atomic but they are two separate operations
right?

I realise that any reports across all clients will need to be manually
aggregated (as operations are scoped within a keystore right?)- that is
fine.

Finally, I think I can do this within a single SCF:

KS:Client<clientId>: SCF:Project[<projectId>]: {
  CF:Projects, CF:SnapshotsForProject
}.

But this isn't ideal for a few reasons (I think):

 - how can I use the keys to search for all snapshots across all projects? 
Would I need to do this in map/reduce?

 - adding a new snapshot is now quite expensive in terms of I/O
 
It does make all operations on a project atomic now, which is excellent.

Ok - many many thanks for reading this - all advice/thoughts/hints are
welcome.  After looking at this (and document DBs) for a few days
and just not getting it, thinking 'what is all the hype about?', I think it
is slowly starting to sink in and I am *very* excited about it.  

Assuming I haven't completely missed the point :)

Col