You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Colin Yates <co...@gmail.com> on 2010/04/12 11:07:02 UTC
Newbie modelling question
Hi all,
I have read the docs and lots of posts on this
forum and it feels like things are starting to
click into place, but I wanted to make sure. Bear
with me please :)
Basically we have the following model:
Client {
projects: Collection<Project>
users: Collection<User>
}
Project {
owner: User (in a firm)
cost: number
startDate: Date
endDate: Date
}
A project might last for 6 months and we snapshot it every day:
ProjectSnapshot {
snapshotDate: Date
cost: number -- estimated cost of the project
project: Project
predictedEndDate: Date -- estimated endDate
}
So quite simple: Client 1..* User 1.. * Project 1..ProjectSnapshot
Actually, not so simple - a Project might be decomposed into smaller
projects - Project 1..* Project but let's not worry about that
for now.
In terms of reporting we need to be able to answer the following
questions:
- for a client, how many projects were closed each month
- for a client, how many projects were on-going each month
- for a client, what is the sum of the costs of all on-going projects
(where cost(project) == avg(value(snapshots(project)))
- for *all* clients, how many projects were closed each month
etc.
Note: the choice of month is arbitrary - it could be 36 seconds :) so I
cannot have an explicit grouping by that.
In terms of numbers, there might be 100 clients with 10 users each
with 100 projects.
Each project lives for 6 months and they are snapshotted each day
while they are 'alive' so every year there are
(100 * 10 * 100 * (365/2)) = 18,250,000 rows.
Of course some can be archived after a year or so.
I was thinking about the following structure:
KS:Client<clientId>.CF:Project[<projectId>]: {
owner:<text>, cost:<number>, startDate:<date>, endDate: <Date>
}
KS:Client<clientId>.CF:ProjectSnapshots[<snapshotDate_inNanos>]: {
projectId:UUID, cost:<number>, predictedEndDate:<Date>
}
If I understand everything (big if!) this means that:
- I can lookup all snapshots for all projects (within a client)
using a key range across the keys in Client<clientId>.ProjectSnapshots.
I can then shovel this into a map/reduce to group by projectId
and aggregate the snapshots(project)
- The keys in CF:Project and CF:ProjectSnapshot will spread equally
over a cluster so that each node has a chunk of contiguous projects
and/or a chunk of contiguous snapshots.
- Adding a new snapshot should be really quick
Some questions (assuming the above statements are true):
- The cluster nodes with the most project projectIds will become hot
spots. I really do need to lookup a project by its ID so I cannot have
a random key for CF:Project.
I don't know how to handle this.
- If I wanted to load all the snapshots for the a project do I need
to map/reduce to find the snapshots for a project and then bulk-delete
those projectIds or can I filter on CF:ProjectSnapshot[*]{projectId:X}?
I realise I could have a
CF:SnapshotsForProject[<projectId>]: {
snapshotDate1:<snapshotUUID>, snapshotDate2:<snapshotUUID>
}.
- what about removing a project? I think I understand that updates
on a CF are atomic but I need to delete two entries from two
CFs - each of those will be atomic but they are two separate operations
right?
I realise that any reports across all clients will need to be manually
aggregated (as operations are scoped within a keystore right?)- that is
fine.
Finally, I think I can do this within a single SCF:
KS:Client<clientId>: SCF:Project[<projectId>]: {
CF:Projects, CF:SnapshotsForProject
}.
But this isn't ideal for a few reasons (I think):
- how can I use the keys to search for all snapshots across all projects?
Would I need to do this in map/reduce?
- adding a new snapshot is now quite expensive in terms of I/O
It does make all operations on a project atomic now, which is excellent.
Ok - many many thanks for reading this - all advice/thoughts/hints are
welcome. After looking at this (and document DBs) for a few days
and just not getting it, thinking 'what is all the hype about?', I think it
is slowly starting to sink in and I am *very* excited about it.
Assuming I haven't completely missed the point :)
Col