You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by kfarmer <kf...@camstar.com> on 2012/01/11 19:59:04 UTC

HBase for ad-hoc aggregate queries

I'm taking a look at moving our datastore from Oracle to HBase, and trying to
understand how HBase could be used for ad-hoc aggregation queries across our
data.

My understanding is MapReduce is more of a batch framework, so if we want a
query to come back to the user's request in a few seconds, that won't work
because of the overheard of running MR and because the MR jobs write back to
a new table.  Is that correct?

Instead should we be pre-aggregating data as we load into separate tables,
and then when a user queries instead just do a scan on these pre-aggregated
tables?

Thanks.
-- 
View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase for ad-hoc aggregate queries

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Bottom line, imo you have to consider how your data is organized. for
90% of relational schema (but perhaps 10% of volume) the move to hbase
based solutions is not warranted.

However, for 10% of the schema (and 90% of the volume) you may
consider using HBase-based solutions. Most typically time series data
feeds.

-d

On Wed, Jan 11, 2012 at 11:48 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> IMO You will never get the same flexibility. There are also numerous
> differences in data modelling approach (TTL, uniformly-distributed ids
> requirement to scale query volume, etc.)
>
> The most flexibility in that regard we reached so far w.r.t.
> aggregation queries is OLAPish model (see link on HBase wiki,
> supported projects, HBase-Lattice).
>
> This is for aggregating really high qps  RT fact streams and the list
> of current limitations is huge but it serves our purpose so far.
>
> Most obvious benefits are that queries are fast (because of
> precomputed cuboids in a lattice, similar to cuboid lattice approach
> in ROLAP), short incremental compilation cycle (one can grow and
> update the cube in just a few minutes after the fact got fed into
> system), and one can scale compilation horizontally for high volume
> fact feeds. There's a fairly limited query language and a basic set of
> aggregate functions (along with some weighted time series aggregates
> as well).
>
> Most severe limitation right now is lack of commonly used
> multidimensional query dialect such as MDX which prevents use of the
> widely used UI pivoting exploratory clients such as excel or JPivot or
> Tableau etc. So it is either custom UI integration or custom data
> source providers for canned reports with tools like pentaho and
> jasper, or some RT decisioning framework that doesn't require any UI
> at all and can use java API. I also plan to enable R to run queries
> against it (cause i personally don't beleive in doing ml or analytics
> using Excel).
>
> -d
>
> On Wed, Jan 11, 2012 at 10:59 AM, kfarmer <kf...@camstar.com> wrote:
>>
>> I'm taking a look at moving our datastore from Oracle to HBase, and trying to
>> understand how HBase could be used for ad-hoc aggregation queries across our
>> data.
>>
>> My understanding is MapReduce is more of a batch framework, so if we want a
>> query to come back to the user's request in a few seconds, that won't work
>> because of the overheard of running MR and because the MR jobs write back to
>> a new table.  Is that correct?
>>
>> Instead should we be pre-aggregating data as we load into separate tables,
>> and then when a user queries instead just do a scan on these pre-aggregated
>> tables?
>>
>> Thanks.
>> --
>> View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>

Re: HBase for ad-hoc aggregate queries

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

IMO You will never get the same flexibility. There are also numerous
differences in data modelling approach (TTL, uniformly-distributed ids
requirement to scale query volume, etc.)

The most flexibility in that regard we reached so far w.r.t.
aggregation queries is OLAPish model (see link on HBase wiki,
supported projects, HBase-Lattice).

This is for aggregating really high qps  RT fact streams and the list
of current limitations is huge but it serves our purpose so far.

Most obvious benefits are that queries are fast (because of
precomputed cuboids in a lattice, similar to cuboid lattice approach
in ROLAP), short incremental compilation cycle (one can grow and
update the cube in just a few minutes after the fact got fed into
system), and one can scale compilation horizontally for high volume
fact feeds. There's a fairly limited query language and a basic set of
aggregate functions (along with some weighted time series aggregates
as well).

Most severe limitation right now is lack of commonly used
multidimensional query dialect such as MDX which prevents use of the
widely used UI pivoting exploratory clients such as excel or JPivot or
Tableau etc. So it is either custom UI integration or custom data
source providers for canned reports with tools like pentaho and
jasper, or some RT decisioning framework that doesn't require any UI
at all and can use java API. I also plan to enable R to run queries
against it (cause i personally don't beleive in doing ml or analytics
using Excel).

-d

On Wed, Jan 11, 2012 at 10:59 AM, kfarmer <kf...@camstar.com> wrote:
>
> I'm taking a look at moving our datastore from Oracle to HBase, and trying to
> understand how HBase could be used for ad-hoc aggregation queries across our
> data.
>
> My understanding is MapReduce is more of a batch framework, so if we want a
> query to come back to the user's request in a few seconds, that won't work
> because of the overheard of running MR and because the MR jobs write back to
> a new table.  Is that correct?
>
> Instead should we be pre-aggregating data as we load into separate tables,
> and then when a user queries instead just do a scan on these pre-aggregated
> tables?
>
> Thanks.
> --
> View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Re: HBase for ad-hoc aggregate queries

Posted by Ian Varley <iv...@salesforce.com>.

And in case no one else says it ...

I'm taking a look at moving our datastore from Oracle to HBase

This is a questionable project in the general case. HBase is not a relational store and lacks indexes, transactions, isolation, easy ad-hoc querying, and nearly everything else you get from Oracle. It may work for specific cases, but it's not usually prudent to think of it as "simply" converting from one database to another.



On Jan 11, 2012, at 11:10 AM, "kisalay" <ki...@gmail.com>> wrote:

I'm taking a look at moving our datastore from Oracle to HBase

Re: HBase for ad-hoc aggregate queries

Posted by kisalay <ki...@gmail.com>.

U can have a look at opentsdb which does aggregations on the data:
http://opentsdb.net/
Also, you can use endpoint coprocessors to do aggregations on a per region
and then merge the results.
http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html

Both of these approaches will give you alternatives apart from traditional
MR.

On Thu, Jan 12, 2012 at 12:29 AM, kfarmer <kf...@camstar.com> wrote:

>
> I'm taking a look at moving our datastore from Oracle to HBase, and trying
> to
> understand how HBase could be used for ad-hoc aggregation queries across
> our
> data.
>
> My understanding is MapReduce is more of a batch framework, so if we want a
> query to come back to the user's request in a few seconds, that won't work
> because of the overheard of running MR and because the MR jobs write back
> to
> a new table.  Is that correct?
>
> Instead should we be pre-aggregating data as we load into separate tables,
> and then when a user queries instead just do a scan on these pre-aggregated
> tables?
>
> Thanks.
> --
> View this message in context:
> http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: HBase for ad-hoc aggregate queries

Posted by Doug Meil <do...@explorysmedical.com>.

re:  "My understanding is MapReduce is more of a batch framework,"

Yes.

re:  "and because the MR jobs write back to a new table."

They can write to where-ever they need to write (HDFS, Hbase, etc.)


Probably want to check out the Hbase Book/RefGuide on the Architecture,
DataModel, and MapReduce chapters.

http://hbase.apache.org/book.html



On 1/11/12 1:59 PM, "kfarmer" <kf...@camstar.com> wrote:

>
>I'm taking a look at moving our datastore from Oracle to HBase, and
>trying to
>understand how HBase could be used for ad-hoc aggregation queries across
>our
>data.
>
>My understanding is MapReduce is more of a batch framework, so if we want
>a
>query to come back to the user's request in a few seconds, that won't work
>because of the overheard of running MR and because the MR jobs write back
>to
>a new table.  Is that correct?
>
>Instead should we be pre-aggregating data as we load into separate tables,
>and then when a user queries instead just do a scan on these
>pre-aggregated
>tables?
>
>Thanks.
>-- 
>View this message in context:
>http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p331233
>13.html
>Sent from the HBase User mailing list archive at Nabble.com.
>
>