You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Paul Nickerson <pa...@escapemg.com> on 2011/07/25 04:43:49 UTC

Fanning out hbase queries in parallel

I would like to implement a multidimensional query system that aggregates large amounts of data on-the-fly by fanning out queries in parallel. It should be fast enough for interactive exploration of the data and extensible enough to take sets of hundreds or thousands of dimensions with high cardinality, and aggregate them from high granularity to low granularity. Dimensions and their values are stored in the row key. For instance, row keys look like this 
Foo=bar,blah=123 
and each row contains numerical values within their column families, such as plays=100, versioned by the date of calculation. 
User wants the top "Foo" values with blah=123 sorted downward by total plays in july. My current thinking is that a query would get executed by grouping all Foo-prefixed row keys by region server, and send the query to each of those. Each region server iterates through all of it's row keys that start with Foo=something,blah=, and passes the query on to all regions containing blahs that equal 123, which then contain play counts. Matching row keys, as well as the sum of all their play values within july, are passed back up the chain and sorted/truncated when possible. 


It seems quite complicated and would involve either modifying hbase source code or at the very least using the deep internals of the api. Does this seem like a practical solution or could someone offer some ideas? 


Thank you! 

Re: Fanning out hbase queries in parallel

Posted by Gary Helmling <gh...@gmail.com>.
Unfortunately there's no easy patch set to pull coprocessors into any 0.90
HBase version (including CDH3 HBase).  The changes are extensive and
invasive and include RPC protocol changes.  Internally at Trend Micro we run
a heavily, heavily patched 0.90-based version of HBase that includes
coprocessors and security.  But that is only possible with a lot of effort
to keep things up to date with the HBase 0.90 development.

At one point we had made a 0.90-coprocessor branch available, but it's
simply too much work to keep it up to date.  It's in everyone's best
interests if we instead focus on getting out a 0.92 release that includes
coprocessors.

HBase trunk (and by extension 0.92) of course supports running on CDH3, so
you should have no problem plugging in the new version once HBase 0.92 is
out.

--gh


On Mon, Jul 25, 2011 at 1:23 PM, Paul Nickerson <paul.nickerson@escapemg.com
> wrote:

> We currently run on the cloudera stack. Would this be something that we can
> pull, compile, and plug right into that stack?
>
> ----- Original Message -----
>
> From: "Gary Helmling" <gh...@gmail.com>
> To: user@hbase.apache.org
> Sent: Monday, July 25, 2011 2:02:50 PM
> Subject: Re: Fanning out hbase queries in parallel
>
> Coprocessors are currently only in trunk. They will be in the 0.92 release
> once we get that out. There's no set date for that, but personally I'll be
> trying to help get it out sooner than later.
>
>
> On Mon, Jul 25, 2011 at 7:37 AM, Michel Segel <michael_segel@hotmail.com
> >wrote:
>
> > Which release(s) have coprocessors enabled?
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
> > On Jul 24, 2011, at 11:03 PM, Sonal Goyal <so...@gmail.com> wrote:
> >
> > > Hi Paul,
> > >
> > > Have you taken a look at HBase coprocessors? I think you will find them
> > > useful.
> > >
> > > Best Regards,
> > > Sonal
> > > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> > > Integration<https://github.com/sonalgoyal/hiho>
> > > Nube Technologies <http://www.nubetech.co>
> > >
> > > <http://in.linkedin.com/in/sonalgoyal>
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <
> > paul.nickerson@escapemg.com
> > >> wrote:
> > >
> > >>
> > >> I would like to implement a multidimensional query system that
> > aggregates
> > >> large amounts of data on-the-fly by fanning out queries in parallel.
> It
> > >> should be fast enough for interactive exploration of the data and
> > extensible
> > >> enough to take sets of hundreds or thousands of dimensions with high
> > >> cardinality, and aggregate them from high granularity to low
> > granularity.
> > >> Dimensions and their values are stored in the row key. For instance,
> row
> > >> keys look like this
> > >> Foo=bar,blah=123
> > >> and each row contains numerical values within their column families,
> > such
> > >> as plays=100, versioned by the date of calculation.
> > >> User wants the top "Foo" values with blah=123 sorted downward by total
> > >> plays in july. My current thinking is that a query would get executed
> by
> > >> grouping all Foo-prefixed row keys by region server, and send the
> query
> > to
> > >> each of those. Each region server iterates through all of it's row
> keys
> > that
> > >> start with Foo=something,blah=, and passes the query on to all regions
> > >> containing blahs that equal 123, which then contain play counts.
> > Matching
> > >> row keys, as well as the sum of all their play values within july, are
> > >> passed back up the chain and sorted/truncated when possible.
> > >>
> > >>
> > >> It seems quite complicated and would involve either modifying hbase
> > source
> > >> code or at the very least using the deep internals of the api. Does
> this
> > >> seem like a practical solution or could someone offer some ideas?
> > >>
> > >>
> > >> Thank you!
> >
>
>

Re: Fanning out hbase queries in parallel

Posted by Stack <st...@duboce.net>.
Yes.
St.Ack

On Mon, Jul 25, 2011 at 1:23 PM, Paul Nickerson
<pa...@escapemg.com> wrote:
> We currently run on the cloudera stack. Would this be something that we can pull, compile, and plug right into that stack?
>
> ----- Original Message -----
>
> From: "Gary Helmling" <gh...@gmail.com>
> To: user@hbase.apache.org
> Sent: Monday, July 25, 2011 2:02:50 PM
> Subject: Re: Fanning out hbase queries in parallel
>
> Coprocessors are currently only in trunk. They will be in the 0.92 release
> once we get that out. There's no set date for that, but personally I'll be
> trying to help get it out sooner than later.
>
>
> On Mon, Jul 25, 2011 at 7:37 AM, Michel Segel <mi...@hotmail.com>wrote:
>
>> Which release(s) have coprocessors enabled?
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On Jul 24, 2011, at 11:03 PM, Sonal Goyal <so...@gmail.com> wrote:
>>
>> > Hi Paul,
>> >
>> > Have you taken a look at HBase coprocessors? I think you will find them
>> > useful.
>> >
>> > Best Regards,
>> > Sonal
>> > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
>> > Integration<https://github.com/sonalgoyal/hiho>
>> > Nube Technologies <http://www.nubetech.co>
>> >
>> > <http://in.linkedin.com/in/sonalgoyal>
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <
>> paul.nickerson@escapemg.com
>> >> wrote:
>> >
>> >>
>> >> I would like to implement a multidimensional query system that
>> aggregates
>> >> large amounts of data on-the-fly by fanning out queries in parallel. It
>> >> should be fast enough for interactive exploration of the data and
>> extensible
>> >> enough to take sets of hundreds or thousands of dimensions with high
>> >> cardinality, and aggregate them from high granularity to low
>> granularity.
>> >> Dimensions and their values are stored in the row key. For instance, row
>> >> keys look like this
>> >> Foo=bar,blah=123
>> >> and each row contains numerical values within their column families,
>> such
>> >> as plays=100, versioned by the date of calculation.
>> >> User wants the top "Foo" values with blah=123 sorted downward by total
>> >> plays in july. My current thinking is that a query would get executed by
>> >> grouping all Foo-prefixed row keys by region server, and send the query
>> to
>> >> each of those. Each region server iterates through all of it's row keys
>> that
>> >> start with Foo=something,blah=, and passes the query on to all regions
>> >> containing blahs that equal 123, which then contain play counts.
>> Matching
>> >> row keys, as well as the sum of all their play values within july, are
>> >> passed back up the chain and sorted/truncated when possible.
>> >>
>> >>
>> >> It seems quite complicated and would involve either modifying hbase
>> source
>> >> code or at the very least using the deep internals of the api. Does this
>> >> seem like a practical solution or could someone offer some ideas?
>> >>
>> >>
>> >> Thank you!
>>
>
>

Re: Fanning out hbase queries in parallel

Posted by Paul Nickerson <pa...@escapemg.com>.
We currently run on the cloudera stack. Would this be something that we can pull, compile, and plug right into that stack? 

----- Original Message -----

From: "Gary Helmling" <gh...@gmail.com> 
To: user@hbase.apache.org 
Sent: Monday, July 25, 2011 2:02:50 PM 
Subject: Re: Fanning out hbase queries in parallel 

Coprocessors are currently only in trunk. They will be in the 0.92 release 
once we get that out. There's no set date for that, but personally I'll be 
trying to help get it out sooner than later. 


On Mon, Jul 25, 2011 at 7:37 AM, Michel Segel <mi...@hotmail.com>wrote: 

> Which release(s) have coprocessors enabled? 
> 
> Sent from a remote device. Please excuse any typos... 
> 
> Mike Segel 
> 
> On Jul 24, 2011, at 11:03 PM, Sonal Goyal <so...@gmail.com> wrote: 
> 
> > Hi Paul, 
> > 
> > Have you taken a look at HBase coprocessors? I think you will find them 
> > useful. 
> > 
> > Best Regards, 
> > Sonal 
> > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data 
> > Integration<https://github.com/sonalgoyal/hiho> 
> > Nube Technologies <http://www.nubetech.co> 
> > 
> > <http://in.linkedin.com/in/sonalgoyal> 
> > 
> > 
> > 
> > 
> > 
> > On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson < 
> paul.nickerson@escapemg.com 
> >> wrote: 
> > 
> >> 
> >> I would like to implement a multidimensional query system that 
> aggregates 
> >> large amounts of data on-the-fly by fanning out queries in parallel. It 
> >> should be fast enough for interactive exploration of the data and 
> extensible 
> >> enough to take sets of hundreds or thousands of dimensions with high 
> >> cardinality, and aggregate them from high granularity to low 
> granularity. 
> >> Dimensions and their values are stored in the row key. For instance, row 
> >> keys look like this 
> >> Foo=bar,blah=123 
> >> and each row contains numerical values within their column families, 
> such 
> >> as plays=100, versioned by the date of calculation. 
> >> User wants the top "Foo" values with blah=123 sorted downward by total 
> >> plays in july. My current thinking is that a query would get executed by 
> >> grouping all Foo-prefixed row keys by region server, and send the query 
> to 
> >> each of those. Each region server iterates through all of it's row keys 
> that 
> >> start with Foo=something,blah=, and passes the query on to all regions 
> >> containing blahs that equal 123, which then contain play counts. 
> Matching 
> >> row keys, as well as the sum of all their play values within july, are 
> >> passed back up the chain and sorted/truncated when possible. 
> >> 
> >> 
> >> It seems quite complicated and would involve either modifying hbase 
> source 
> >> code or at the very least using the deep internals of the api. Does this 
> >> seem like a practical solution or could someone offer some ideas? 
> >> 
> >> 
> >> Thank you! 
> 


Re: Fanning out hbase queries in parallel

Posted by Gary Helmling <gh...@gmail.com>.
Coprocessors are currently only in trunk.  They will be in the 0.92 release
once we get that out.  There's no set date for that, but personally I'll be
trying to help get it out sooner than later.


On Mon, Jul 25, 2011 at 7:37 AM, Michel Segel <mi...@hotmail.com>wrote:

> Which release(s) have coprocessors enabled?
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jul 24, 2011, at 11:03 PM, Sonal Goyal <so...@gmail.com> wrote:
>
> > Hi Paul,
> >
> > Have you taken a look at HBase coprocessors? I think you will find them
> > useful.
> >
> > Best Regards,
> > Sonal
> > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> > Integration<https://github.com/sonalgoyal/hiho>
> > Nube Technologies <http://www.nubetech.co>
> >
> > <http://in.linkedin.com/in/sonalgoyal>
> >
> >
> >
> >
> >
> > On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <
> paul.nickerson@escapemg.com
> >> wrote:
> >
> >>
> >> I would like to implement a multidimensional query system that
> aggregates
> >> large amounts of data on-the-fly by fanning out queries in parallel. It
> >> should be fast enough for interactive exploration of the data and
> extensible
> >> enough to take sets of hundreds or thousands of dimensions with high
> >> cardinality, and aggregate them from high granularity to low
> granularity.
> >> Dimensions and their values are stored in the row key. For instance, row
> >> keys look like this
> >> Foo=bar,blah=123
> >> and each row contains numerical values within their column families,
> such
> >> as plays=100, versioned by the date of calculation.
> >> User wants the top "Foo" values with blah=123 sorted downward by total
> >> plays in july. My current thinking is that a query would get executed by
> >> grouping all Foo-prefixed row keys by region server, and send the query
> to
> >> each of those. Each region server iterates through all of it's row keys
> that
> >> start with Foo=something,blah=, and passes the query on to all regions
> >> containing blahs that equal 123, which then contain play counts.
> Matching
> >> row keys, as well as the sum of all their play values within july, are
> >> passed back up the chain and sorted/truncated when possible.
> >>
> >>
> >> It seems quite complicated and would involve either modifying hbase
> source
> >> code or at the very least using the deep internals of the api. Does this
> >> seem like a practical solution or could someone offer some ideas?
> >>
> >>
> >> Thank you!
>

Re: Fanning out hbase queries in parallel

Posted by Michel Segel <mi...@hotmail.com>.
Which release(s) have coprocessors enabled?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 24, 2011, at 11:03 PM, Sonal Goyal <so...@gmail.com> wrote:

> Hi Paul,
> 
> Have you taken a look at HBase coprocessors? I think you will find them
> useful.
> 
> Best Regards,
> Sonal
> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> Integration<https://github.com/sonalgoyal/hiho>
> Nube Technologies <http://www.nubetech.co>
> 
> <http://in.linkedin.com/in/sonalgoyal>
> 
> 
> 
> 
> 
> On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <paul.nickerson@escapemg.com
>> wrote:
> 
>> 
>> I would like to implement a multidimensional query system that aggregates
>> large amounts of data on-the-fly by fanning out queries in parallel. It
>> should be fast enough for interactive exploration of the data and extensible
>> enough to take sets of hundreds or thousands of dimensions with high
>> cardinality, and aggregate them from high granularity to low granularity.
>> Dimensions and their values are stored in the row key. For instance, row
>> keys look like this
>> Foo=bar,blah=123
>> and each row contains numerical values within their column families, such
>> as plays=100, versioned by the date of calculation.
>> User wants the top "Foo" values with blah=123 sorted downward by total
>> plays in july. My current thinking is that a query would get executed by
>> grouping all Foo-prefixed row keys by region server, and send the query to
>> each of those. Each region server iterates through all of it's row keys that
>> start with Foo=something,blah=, and passes the query on to all regions
>> containing blahs that equal 123, which then contain play counts. Matching
>> row keys, as well as the sum of all their play values within july, are
>> passed back up the chain and sorted/truncated when possible.
>> 
>> 
>> It seems quite complicated and would involve either modifying hbase source
>> code or at the very least using the deep internals of the api. Does this
>> seem like a practical solution or could someone offer some ideas?
>> 
>> 
>> Thank you!

Re: Fanning out hbase queries in parallel

Posted by Paul Nickerson <pa...@escapemg.com>.
This looks to be exactly what I need. Thanks :) 

----- Original Message -----

From: "Sonal Goyal" <so...@gmail.com> 
To: user@hbase.apache.org 
Sent: Monday, July 25, 2011 12:03:30 AM 
Subject: Re: Fanning out hbase queries in parallel 

Hi Paul, 

Have you taken a look at HBase coprocessors? I think you will find them 
useful. 

Best Regards, 
Sonal 
<https://github.com/sonalgoyal/hiho>Hadoop ETL and Data 
Integration<https://github.com/sonalgoyal/hiho> 
Nube Technologies <http://www.nubetech.co> 

<http://in.linkedin.com/in/sonalgoyal> 





On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <paul.nickerson@escapemg.com 
> wrote: 

> 
> I would like to implement a multidimensional query system that aggregates 
> large amounts of data on-the-fly by fanning out queries in parallel. It 
> should be fast enough for interactive exploration of the data and extensible 
> enough to take sets of hundreds or thousands of dimensions with high 
> cardinality, and aggregate them from high granularity to low granularity. 
> Dimensions and their values are stored in the row key. For instance, row 
> keys look like this 
> Foo=bar,blah=123 
> and each row contains numerical values within their column families, such 
> as plays=100, versioned by the date of calculation. 
> User wants the top "Foo" values with blah=123 sorted downward by total 
> plays in july. My current thinking is that a query would get executed by 
> grouping all Foo-prefixed row keys by region server, and send the query to 
> each of those. Each region server iterates through all of it's row keys that 
> start with Foo=something,blah=, and passes the query on to all regions 
> containing blahs that equal 123, which then contain play counts. Matching 
> row keys, as well as the sum of all their play values within july, are 
> passed back up the chain and sorted/truncated when possible. 
> 
> 
> It seems quite complicated and would involve either modifying hbase source 
> code or at the very least using the deep internals of the api. Does this 
> seem like a practical solution or could someone offer some ideas? 
> 
> 
> Thank you! 


Re: Fanning out hbase queries in parallel

Posted by Sonal Goyal <so...@gmail.com>.
Hi Paul,

Have you taken a look at HBase coprocessors? I think you will find them
useful.

Best Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
Integration<https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <paul.nickerson@escapemg.com
> wrote:

>
> I would like to implement a multidimensional query system that aggregates
> large amounts of data on-the-fly by fanning out queries in parallel. It
> should be fast enough for interactive exploration of the data and extensible
> enough to take sets of hundreds or thousands of dimensions with high
> cardinality, and aggregate them from high granularity to low granularity.
> Dimensions and their values are stored in the row key. For instance, row
> keys look like this
> Foo=bar,blah=123
> and each row contains numerical values within their column families, such
> as plays=100, versioned by the date of calculation.
> User wants the top "Foo" values with blah=123 sorted downward by total
> plays in july. My current thinking is that a query would get executed by
> grouping all Foo-prefixed row keys by region server, and send the query to
> each of those. Each region server iterates through all of it's row keys that
> start with Foo=something,blah=, and passes the query on to all regions
> containing blahs that equal 123, which then contain play counts. Matching
> row keys, as well as the sum of all their play values within july, are
> passed back up the chain and sorted/truncated when possible.
>
>
> It seems quite complicated and would involve either modifying hbase source
> code or at the very least using the deep internals of the api. Does this
> seem like a practical solution or could someone offer some ideas?
>
>
> Thank you!