You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Piotr Praczyk <pi...@gmail.com> on 2009/06/12 15:12:46 UTC

Row filters

Hi

I am developing a MapReduce task which operates on a very big HBase table.
Each time it is run across a relatively small subset of this table although.
I have read that HBase stores the rows in an alphabetical order. Rows
interesing for a particular MapReduce task always form a consistent areas
within the table saved in such order.
Using a binary search on the entire table would give a desirable performance
so I suppose, there must exist some mechanism to achieve this.

I am thinking, how can I provide the input for such task. I have found a
RowFilters mechanism. What is going to be performance of a solution using it
?
I want to have only the rows having the keys starting witrh a given prefix.
I want to avoid scanning the entire table, which filters seem to do.
Do you have any suggestions ? I was trying to find some inormation in the
archives of this mailing list, since it seems to be quite an obvious
problem. Although I have not found anything.


Thank You
Piotr

Re: Row filters

Posted by Ryan Rawson <ry...@gmail.com>.
The scanner api does not support that. You can use multiple scanners to get
the same effect. The speed won't be much slower either (in 0.20).

In 0.21 with a new api we will cut down on the number of server roundtrips
thus improving the speed even more,

On Jun 15, 2009 2:04 AM, "Piotr Praczyk" <pi...@gmail.com> wrote:

Thanks. I meant something a little different although. By fragment I meant
all the rows in the table lying ( in the lexicographiocal order) between the
row X and Y.
The getScanner calls of HTable allow me to specify such rows. Although I
wanted to have a sequence of such ragments : X_1 Y_1 ... X_n Y_n
after ending the range X_i Y_i  I would like the scanner to jump to X_{i+1}
Y_{i+1}. for example lets assume we have a table with rows

aa
ab
ac
ad
ae
ba
bb
bc
bd
be

n=2
X_1 = aa
Y_1 = ac
X_2 = bc
Y_2 = bd

I would like the scanner to return following rows: aa, ab, ac, bc, bd
[without using fileters to avoid linear searching].
It seems to be not very difficult to implement it myself, but probably there
must be some built-in mechanism since this usage looks like a common one.


cheers
Piotr

2009/6/15 Ryan Rawson <ry...@gmail.com>

> And let me follow up a bit... > > The best configuration for a m-r job is
to have the # of map ta...

Re: Row filters

Posted by Piotr Praczyk <pi...@gmail.com>.
Thanks. I meant something a little different although. By fragment I meant
all the rows in the table lying ( in the lexicographiocal order) between the
row X and Y.
The getScanner calls of HTable allow me to specify such rows. Although I
wanted to have a sequence of such ragments : X_1 Y_1 ... X_n Y_n
after ending the range X_i Y_i  I would like the scanner to jump to X_{i+1}
Y_{i+1}. for example lets assume we have a table with rows

aa
ab
ac
ad
ae
ba
bb
bc
bd
be

n=2
X_1 = aa
Y_1 = ac
X_2 = bc
Y_2 = bd

I would like the scanner to return following rows: aa, ab, ac, bc, bd
[without using fileters to avoid linear searching].
It seems to be not very difficult to implement it myself, but probably there
must be some built-in mechanism since this usage looks like a common one.


cheers
Piotr

2009/6/15 Ryan Rawson <ry...@gmail.com>

> And let me follow up a bit...
>
> The best configuration for a m-r job is to have the # of map tasks = # of
> regions in the table.  While a scanner can iterate between regions, once
> the
> table size gets really big, it's best in my experience, more reliable as
> well, to have a 1:1 correspondence between map tasks and regions.
>
> -ryan
>
> On Mon, Jun 15, 2009 at 1:55 AM, Ryan Rawson <ry...@gmail.com> wrote:
>
> > Hey,
> >
> > The client-side scanner code already will move it to the next region when
> > it hits the end of a region.
> >
> > -ryan
> >
> >
> >
> > On Mon, Jun 15, 2009 at 1:52 AM, Piotr Praczyk <piotr.praczyk@gmail.com
> >wrote:
> >
> >> 2009/6/12 stack <st...@duboce.net>
> >>
> >> > On Fri, Jun 12, 2009 at 8:41 AM, Erik Holstad <er...@gmail.com>
> >> > wrote:
> >> >
> >> > > ...
> >> > > not really sure how this
> >> > > was done in 0.19 and earlier.
> >> >
> >> >
> >> > There's a stoprow filter in 0.19.x and earlier.  There is also a
> >> getScanner
> >> > override that takes a start and stop row in 0.19.x (under the wraps it
> >> uses
> >> > stop row filter -- check the client source).
> >> > St>Ack
> >> >
> >>
> >> Thanks :-) It was very helpful.
> >> Do you know if there is any standard Scanner allowing to iterate over
> more
> >> than one table fragments ? [when one chunk finishes, jumping to the
> >> beginning of another] Or rather should I implement it myself ?
> >>
> >>
> >> Piotr
> >>
> >
> >
>

Re: Row filters

Posted by Ryan Rawson <ry...@gmail.com>.
And let me follow up a bit...

The best configuration for a m-r job is to have the # of map tasks = # of
regions in the table.  While a scanner can iterate between regions, once the
table size gets really big, it's best in my experience, more reliable as
well, to have a 1:1 correspondence between map tasks and regions.

-ryan

On Mon, Jun 15, 2009 at 1:55 AM, Ryan Rawson <ry...@gmail.com> wrote:

> Hey,
>
> The client-side scanner code already will move it to the next region when
> it hits the end of a region.
>
> -ryan
>
>
>
> On Mon, Jun 15, 2009 at 1:52 AM, Piotr Praczyk <pi...@gmail.com>wrote:
>
>> 2009/6/12 stack <st...@duboce.net>
>>
>> > On Fri, Jun 12, 2009 at 8:41 AM, Erik Holstad <er...@gmail.com>
>> > wrote:
>> >
>> > > ...
>> > > not really sure how this
>> > > was done in 0.19 and earlier.
>> >
>> >
>> > There's a stoprow filter in 0.19.x and earlier.  There is also a
>> getScanner
>> > override that takes a start and stop row in 0.19.x (under the wraps it
>> uses
>> > stop row filter -- check the client source).
>> > St>Ack
>> >
>>
>> Thanks :-) It was very helpful.
>> Do you know if there is any standard Scanner allowing to iterate over more
>> than one table fragments ? [when one chunk finishes, jumping to the
>> beginning of another] Or rather should I implement it myself ?
>>
>>
>> Piotr
>>
>
>

Re: Row filters

Posted by Ryan Rawson <ry...@gmail.com>.
Hey,

The client-side scanner code already will move it to the next region when it
hits the end of a region.

-ryan


On Mon, Jun 15, 2009 at 1:52 AM, Piotr Praczyk <pi...@gmail.com>wrote:

> 2009/6/12 stack <st...@duboce.net>
>
> > On Fri, Jun 12, 2009 at 8:41 AM, Erik Holstad <er...@gmail.com>
> > wrote:
> >
> > > ...
> > > not really sure how this
> > > was done in 0.19 and earlier.
> >
> >
> > There's a stoprow filter in 0.19.x and earlier.  There is also a
> getScanner
> > override that takes a start and stop row in 0.19.x (under the wraps it
> uses
> > stop row filter -- check the client source).
> > St>Ack
> >
>
> Thanks :-) It was very helpful.
> Do you know if there is any standard Scanner allowing to iterate over more
> than one table fragments ? [when one chunk finishes, jumping to the
> beginning of another] Or rather should I implement it myself ?
>
>
> Piotr
>

Re: Row filters

Posted by Piotr Praczyk <pi...@gmail.com>.
2009/6/12 stack <st...@duboce.net>

> On Fri, Jun 12, 2009 at 8:41 AM, Erik Holstad <er...@gmail.com>
> wrote:
>
> > ...
> > not really sure how this
> > was done in 0.19 and earlier.
>
>
> There's a stoprow filter in 0.19.x and earlier.  There is also a getScanner
> override that takes a start and stop row in 0.19.x (under the wraps it uses
> stop row filter -- check the client source).
> St>Ack
>

Thanks :-) It was very helpful.
Do you know if there is any standard Scanner allowing to iterate over more
than one table fragments ? [when one chunk finishes, jumping to the
beginning of another] Or rather should I implement it myself ?


Piotr

Re: Row filters

Posted by stack <st...@duboce.net>.
On Fri, Jun 12, 2009 at 8:41 AM, Erik Holstad <er...@gmail.com> wrote:

> ...
> not really sure how this
> was done in 0.19 and earlier.


There's a stoprow filter in 0.19.x and earlier.  There is also a getScanner
override that takes a start and stop row in 0.19.x (under the wraps it uses
stop row filter -- check the client source).
St>Ack

Re: Row filters

Posted by Erik Holstad <er...@gmail.com>.
Hi Piotr!
Yes, HBase is storing data in a sorted fashion, rows with similar row keys
will be stores close to
eachother. So if you want to only scan rows in a specific range  and use
that as an input for your
map-reduce job this is possible in two ways. You can either use a filter
that filters out the rows not
included in this range or as soon as we get 0.20 out you can specify the
start and the stop row in the
Scan object. So when you are out of that range you can safely stop scanning,
not really sure how this
was done in 0.19 and earlier.

Hope that this can help.

Regards Erik