You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by chris h <ch...@gmail.com> on 2012/08/28 19:34:57 UTC

Looking for help analyzing Hama for our use case

Hello there,

I'm architecting an analytics system and I'm having trouble building an
elegant data model in SQL or MongoDB.  I THINK a distributed graph DB would
be more suited to the task, but I've never worked with one! :)  I have a
quick description of my data and the issues I'm having below.  I'd love to
get some input from someone who could tell me if this is a solid use case
for Hama, and/or point me in the right direction.


The system has users who click on links, and clients who get custom reports
on what happened.

Here's the data objects we're dealing with:

Clients: Up to a few thousand
 - The clients of the service, they are the root of all the other data

Users: 0 - 10,000,000 per client
 - The users whos actions are being tracked, they are owned by a client

Links: 0 - 10,000 per client
 - The links that a user can click on, they are owned by a client

Clicks: 0 - 5 generated per client each day
 - An instance where a user "clicks" on a link...


So far nothing is too crazy here, though the volume of records is pretty
high.  The problem is with reporting the data.  The app is VERY report
heavy, clients will run a ton of them and they will need them back in under
a few seconds (few = like 3).  I don't foresee an issue running reports
like this:

   - show total clicks on this link
   - show unique clicks on this link
   - show which users clicked on this link the most
   - show which users did not click on this link

Those may take a map/reduce job but it should be very do-able in most
databases.  The problem is when we add filters (time and demographics) to
the data like this:

   - show unique clicks on this link between oct 23rd 1000 and oct 24th 0959
   - show unique clicks on this link by females between the ages of 35 and
   45
   - show unique clicks on this link by females inside date range "A".


With basically a infinite number of time and demographic filters, a massive
amount of users and actions, and the necessity for results to be returned
back to the client in under a few seconds... is this something Hama is well
suited for?

Thanks for taking the time,
Chris.