You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wilm Schumacher <wi...@gmail.com> on 2015/02/13 15:57:33 UTC

hbase as logging dump => design for mapred

Hi,

I have a design question and I'm kind of stuck. I do not find an easy
solution, but I think there is one.

The problem: consider you have an application where users can "open" an
object. And then they can make an operation on that object. Or go
further to another object. And now I want to make some statics for all
users and all objects. E.g. how long is one object "viewed" by all
users. How long is the average time a user looks at one type of objects
before an operation occurs (e.g. go further or do something on the
object). I can log an "open event" and an "operation event". Not a
"close event".

The number of objects is around 1M objects, and the number of operations
on the objects is around 100M operations, perhaps more. And both numbers
can grow fast.

What I want to do is a result like:
obejct1 =>  users = { "userX" : 2s , "userY" : 60s , ...  } ,
operationby = "userX" , objecttype = "foobar"
for all objects. And then I can calculate the average time one user
spent on typeX etc. By now I plan to dump the above structure to another
hbase table and do a mapred there to get my averages. But this could be
changed if some of you comes up with a better plan. I actually only need
the averaged and aggregated data.

As I use one hbase table as logging dump I naivly came up with a
structure like

rowkey = <timestamp> , data:user => <username> , data:object => <object
ref> , data:type => <object type> , data:operation => <(open|do
something)> ....

But now I'm stuck how to make a clever mapred job to get my information.
E.g. if I want to know how long the object is viewed by the particular
user I would have find an "open" operation and then scan further for
either an "open operation" for another object or a "task operation" on
the same object.

Thus I came up with

rowkey = <object-ref>-<timestamp> , data:user => <username> , data:type
=> <type> , data:operation => <(open|do something)>

By this I could scan over the object-ref and calculate all infos for all
users for ONE object. However, I have to do that for all objects not
just a specific one.

How can I run over that table in a mapred fashion? My first idea was to
mapred over the objects and then do a appropriate scan in the log table.
But by this way I cannot detect a "go further", as the next open event
is in another scan range.

My next idea was a rokey design by <user>-<timestamp>. But than I would
have to map over the users and make a full scan for the full user log,
which would kill the advantage of mapred (as the number of users is <<
number of objects) and one map task would be very huge. This would be
more of a parallel scan. Or is this a good idea? But if so ... what to
do to get the information from above?

Thus ... I'm stuck :(.

Any ideas how to achieve my goal?

Best wishes

Wilm

ps: It would be possible to create a "close event" if there is no other
solution. But I'd rather do not do that.