You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2012/01/06 18:55:55 UTC

Hbase for real-time data aggregation

I need to design a near real-time system where documents ( with
fields:id,keywords,timestamp ) are getting added to the system. The
requirement is to get top-k keywords from the documents added to the
system in last x minutes. The typical document addition rate is around
100 documents/sec, which may increase in the future ( hence technology
should be horizontally scalable ).

I am thinking of using hbase. For each document we can add a set of
keys ( for all the keywords in that doc )  with timestamp_keywords.
During query time we can run a map-reduce job over a keyrange ( from
ts1_* to ts2* ) to compute the the keyword frequency for that range.

Any other better technologies  for this use-case ? Like MomgoDB,
Cassandra, Storm etc. The use case is primarily on aggregation.

-prasen

Re: Hbase for real-time data aggregation

Posted by shashwat shriparv <dw...@gmail.com>.
As far as my exp it not bad to go wid hbase. only proble is you will not
get redimade things. if your going wid it you can look for indexing option
available wid hbase. you cn try hsearch and lily project for indexing and
fast retrieval.

On Fri, Jan 6, 2012 at 11:25 PM, prasenjit mukherjee
<pr...@gmail.com>wrote:

> I need to design a near real-time system where documents ( with
> fields:id,keywords,timestamp ) are getting added to the system. The
> requirement is to get top-k keywords from the documents added to the
> system in last x minutes. The typical document addition rate is around
> 100 documents/sec, which may increase in the future ( hence technology
> should be horizontally scalable ).
>
> I am thinking of using hbase. For each document we can add a set of
> keys ( for all the keywords in that doc )  with timestamp_keywords.
> During query time we can run a map-reduce job over a keyrange ( from
> ts1_* to ts2* ) to compute the the keyword frequency for that range.
>
> Any other better technologies  for this use-case ? Like MomgoDB,
> Cassandra, Storm etc. The use case is primarily on aggregation.
>
> -prasen
>



-- 
Shashwat Shriparv
09900059620
09663531241



<iframe src="
http://rcm.amazon.com/e/cm?t=shriparv-20&o=1&p=48&l=ur1&category=kindlerotating&f=ifr"
width="728" height="90" scrolling="no" border="0" marginwidth="0"
style="border:none;" frameborder="0"></iframe>