You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jerry Lam <ch...@gmail.com> on 2012/08/01 23:44:49 UTC

Filter with State

Hi HBase guru:

>From Lars George talk, he mentions that filter has no state. What if I need
to scan rows in which the decision to filter one row or not is based on the
previous row's column values? Any idea how one can implement this type of
logic?

Best Regards,

Jerry

Re: Filter with State

Posted by lars hofhansl <lh...@yahoo.com>.
Hi Jerry,

you could create a RegionObserver implementation and have that implement the postScannerOpen
hook and wrap the passed scanner with your own RegionScanner to do the filtering.

Now, RegionObservers are still per region, so actually that would not help you either.

Your best bet might be to do as much work on the server (i.e. filter rows as much as possible based on rows in the same region),
and then do a post step on the client to do the final filtering.


-- Lars



----- Original Message -----
From: Jerry Lam <ch...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Thursday, August 2, 2012 6:50 AM
Subject: Re: Filter with State

Hi Lars:

That is useful. I appreciate it. The idea about cross row transaction is an
interesting one.

Can I have an iterator on the client side that get rows from a coprocessor?
(i.e. Filtered rows are streamed into the client application and client can
access them via iterator)

Best Regards,

Jerry


On Thu, Aug 2, 2012 at 12:13 AM, lars hofhansl <lh...@yahoo.com> wrote:

> The Filter is initialized per Region as part of a RegionScannerImpl.
>
> So as long as all the rows you are interested are co-located in the same
> region you can keep that state in the Filter instance.
>
> You can use a custom RegionSplitPolicy to control (to some extend at
> least) how the rows are colocated (KeyPrefixRegionSplitPolicy is an
> example).
>
> I also blogged about this here (in the context of cross row transactions):
> http://hadoop-hbase.blogspot.com/2012/02/limited-cross-row-transactions-in-hbase.html
>
>
> Maybe what you really are looking for are coprocessors?
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Wednesday, August 1, 2012 7:06 PM
> Subject: Re: Filter with State
>
> Hi Lars,
>
> I understand that it is more difficult to carry states across
> regions/servers, how about in a single region? Knowing that the rows in a
> single region have dependencies, can we have filter with state? If filter
> doesn't provide this ability, is there other mechanism in hbase to offer
> this kind of functionalities?
>
> I think this is a good feature because it allows efficient scanning on
> dependent rows. Instead of fetching each row to the client side and check
> if we should fetch the next row, the filter on the server side handles this
> logic.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPad (sorry for spelling mistakes)
>
> On 2012-08-01, at 21:52, lars hofhansl <lh...@yahoo.com> wrote:
>
> > The issue here is that different rows can be located in different
> regions or even different region servers, so no local state will carry over
> all rows.
> >
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Wednesday, August 1, 2012 5:48 PM
> > Subject: Re: Filter with State
> >
> > Hi St.Ack:
> >
> > Schema cannot be changed to a single row.
> > The API describes "Do not rely on filters carrying state across rows;
> its not reliable in current hbase as we have no handlers in place for when
> regions split, close or server crashes." If we manage region splitting
> ourselves, so the split issue doesn't apply. Other failures can be handled
> on the application level. Does each invocation of scanner.next instantiate
> a new filter at the server side even on the same region (I.e. Does scanning
> on the same region use the same filter or different filter depending on the
> scanner.next calls??)
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPad (sorry for spelling mistakes)
> >
> > On 2012-08-01, at 18:44, Stack <st...@duboce.net> wrote:
> >
> >> On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com>
> wrote:
> >>> Hi HBase guru:
> >>>
> >>> From Lars George talk, he mentions that filter has no state. What if I
> need
> >>> to scan rows in which the decision to filter one row or not is based
> on the
> >>> previous row's column values? Any idea how one can implement this type
> of
> >>> logic?
> >>
> >> You could try carrying state in the client (but if client dies state
> dies).
> >>
> >> You can't have scanners carry state across rows.  It says so in API
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> >> (Whatever about the API, if LarsG says it, it must be so!).
> >>
> >> Here is the issue: If row X is in region A on server 1 there is
> >> nothing to prevent row X+1 from being on region B on server 2.  How do
> >> you carry the state between such rows reliably?
> >>
> >> Can you redo your schema such that the state you need to carry remains
> >> within a row?
> >> St.Ack
> >
>
>


Re: Filter with State

Posted by Jerry Lam <ch...@gmail.com>.
Hi Lars:

That is useful. I appreciate it. The idea about cross row transaction is an
interesting one.

Can I have an iterator on the client side that get rows from a coprocessor?
(i.e. Filtered rows are streamed into the client application and client can
access them via iterator)

Best Regards,

Jerry


On Thu, Aug 2, 2012 at 12:13 AM, lars hofhansl <lh...@yahoo.com> wrote:

> The Filter is initialized per Region as part of a RegionScannerImpl.
>
> So as long as all the rows you are interested are co-located in the same
> region you can keep that state in the Filter instance.
>
> You can use a custom RegionSplitPolicy to control (to some extend at
> least) how the rows are colocated (KeyPrefixRegionSplitPolicy is an
> example).
>
> I also blogged about this here (in the context of cross row transactions):
> http://hadoop-hbase.blogspot.com/2012/02/limited-cross-row-transactions-in-hbase.html
>
>
> Maybe what you really are looking for are coprocessors?
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Wednesday, August 1, 2012 7:06 PM
> Subject: Re: Filter with State
>
> Hi Lars,
>
> I understand that it is more difficult to carry states across
> regions/servers, how about in a single region? Knowing that the rows in a
> single region have dependencies, can we have filter with state? If filter
> doesn't provide this ability, is there other mechanism in hbase to offer
> this kind of functionalities?
>
> I think this is a good feature because it allows efficient scanning on
> dependent rows. Instead of fetching each row to the client side and check
> if we should fetch the next row, the filter on the server side handles this
> logic.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPad (sorry for spelling mistakes)
>
> On 2012-08-01, at 21:52, lars hofhansl <lh...@yahoo.com> wrote:
>
> > The issue here is that different rows can be located in different
> regions or even different region servers, so no local state will carry over
> all rows.
> >
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Wednesday, August 1, 2012 5:48 PM
> > Subject: Re: Filter with State
> >
> > Hi St.Ack:
> >
> > Schema cannot be changed to a single row.
> > The API describes "Do not rely on filters carrying state across rows;
> its not reliable in current hbase as we have no handlers in place for when
> regions split, close or server crashes." If we manage region splitting
> ourselves, so the split issue doesn't apply. Other failures can be handled
> on the application level. Does each invocation of scanner.next instantiate
> a new filter at the server side even on the same region (I.e. Does scanning
> on the same region use the same filter or different filter depending on the
> scanner.next calls??)
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPad (sorry for spelling mistakes)
> >
> > On 2012-08-01, at 18:44, Stack <st...@duboce.net> wrote:
> >
> >> On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com>
> wrote:
> >>> Hi HBase guru:
> >>>
> >>> From Lars George talk, he mentions that filter has no state. What if I
> need
> >>> to scan rows in which the decision to filter one row or not is based
> on the
> >>> previous row's column values? Any idea how one can implement this type
> of
> >>> logic?
> >>
> >> You could try carrying state in the client (but if client dies state
> dies).
> >>
> >> You can't have scanners carry state across rows.  It says so in API
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> >> (Whatever about the API, if LarsG says it, it must be so!).
> >>
> >> Here is the issue: If row X is in region A on server 1 there is
> >> nothing to prevent row X+1 from being on region B on server 2.  How do
> >> you carry the state between such rows reliably?
> >>
> >> Can you redo your schema such that the state you need to carry remains
> >> within a row?
> >> St.Ack
> >
>
>

Re: Filter with State

Posted by lars hofhansl <lh...@yahoo.com>.
The Filter is initialized per Region as part of a RegionScannerImpl.

So as long as all the rows you are interested are co-located in the same region you can keep that state in the Filter instance.

You can use a custom RegionSplitPolicy to control (to some extend at least) how the rows are colocated (KeyPrefixRegionSplitPolicy is an example).

I also blogged about this here (in the context of cross row transactions): http://hadoop-hbase.blogspot.com/2012/02/limited-cross-row-transactions-in-hbase.html


Maybe what you really are looking for are coprocessors?


-- Lars



----- Original Message -----
From: Jerry Lam <ch...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Wednesday, August 1, 2012 7:06 PM
Subject: Re: Filter with State

Hi Lars,

I understand that it is more difficult to carry states across regions/servers, how about in a single region? Knowing that the rows in a single region have dependencies, can we have filter with state? If filter doesn't provide this ability, is there other mechanism in hbase to offer this kind of functionalities?

I think this is a good feature because it allows efficient scanning on dependent rows. Instead of fetching each row to the client side and check if we should fetch the next row, the filter on the server side handles this logic. 

Best Regards,

Jerry 

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-01, at 21:52, lars hofhansl <lh...@yahoo.com> wrote:

> The issue here is that different rows can be located in different regions or even different region servers, so no local state will carry over all rows.
> 
> 
> 
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Wednesday, August 1, 2012 5:48 PM
> Subject: Re: Filter with State
> 
> Hi St.Ack:
> 
> Schema cannot be changed to a single row.
> The API describes "Do not rely on filters carrying state across rows; its not reliable in current hbase as we have no handlers in place for when regions split, close or server crashes." If we manage region splitting ourselves, so the split issue doesn't apply. Other failures can be handled on the application level. Does each invocation of scanner.next instantiate a new filter at the server side even on the same region (I.e. Does scanning on the same region use the same filter or different filter depending on the scanner.next calls??)
> 
> Best Regards,
> 
> Jerry 
> 
> Sent from my iPad (sorry for spelling mistakes)
> 
> On 2012-08-01, at 18:44, Stack <st...@duboce.net> wrote:
> 
>> On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com> wrote:
>>> Hi HBase guru:
>>> 
>>> From Lars George talk, he mentions that filter has no state. What if I need
>>> to scan rows in which the decision to filter one row or not is based on the
>>> previous row's column values? Any idea how one can implement this type of
>>> logic?
>> 
>> You could try carrying state in the client (but if client dies state dies).
>> 
>> You can't have scanners carry state across rows.  It says so in API
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
>> (Whatever about the API, if LarsG says it, it must be so!).
>> 
>> Here is the issue: If row X is in region A on server 1 there is
>> nothing to prevent row X+1 from being on region B on server 2.  How do
>> you carry the state between such rows reliably?
>> 
>> Can you redo your schema such that the state you need to carry remains
>> within a row?
>> St.Ack
> 


Re: Filter with State

Posted by Jerry Lam <ch...@gmail.com>.
Hi Lars,

I understand that it is more difficult to carry states across regions/servers, how about in a single region? Knowing that the rows in a single region have dependencies, can we have filter with state? If filter doesn't provide this ability, is there other mechanism in hbase to offer this kind of functionalities?

I think this is a good feature because it allows efficient scanning on dependent rows. Instead of fetching each row to the client side and check if we should fetch the next row, the filter on the server side handles this logic. 

Best Regards,

Jerry 

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-01, at 21:52, lars hofhansl <lh...@yahoo.com> wrote:

> The issue here is that different rows can be located in different regions or even different region servers, so no local state will carry over all rows.
> 
> 
> 
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Wednesday, August 1, 2012 5:48 PM
> Subject: Re: Filter with State
> 
> Hi St.Ack:
> 
> Schema cannot be changed to a single row.
> The API describes "Do not rely on filters carrying state across rows; its not reliable in current hbase as we have no handlers in place for when regions split, close or server crashes." If we manage region splitting ourselves, so the split issue doesn't apply. Other failures can be handled on the application level. Does each invocation of scanner.next instantiate a new filter at the server side even on the same region (I.e. Does scanning on the same region use the same filter or different filter depending on the scanner.next calls??)
> 
> Best Regards,
> 
> Jerry 
> 
> Sent from my iPad (sorry for spelling mistakes)
> 
> On 2012-08-01, at 18:44, Stack <st...@duboce.net> wrote:
> 
>> On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com> wrote:
>>> Hi HBase guru:
>>> 
>>> From Lars George talk, he mentions that filter has no state. What if I need
>>> to scan rows in which the decision to filter one row or not is based on the
>>> previous row's column values? Any idea how one can implement this type of
>>> logic?
>> 
>> You could try carrying state in the client (but if client dies state dies).
>> 
>> You can't have scanners carry state across rows.  It says so in API
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
>> (Whatever about the API, if LarsG says it, it must be so!).
>> 
>> Here is the issue: If row X is in region A on server 1 there is
>> nothing to prevent row X+1 from being on region B on server 2.  How do
>> you carry the state between such rows reliably?
>> 
>> Can you redo your schema such that the state you need to carry remains
>> within a row?
>> St.Ack
> 

Re: Filter with State

Posted by lars hofhansl <lh...@yahoo.com>.
The issue here is that different rows can be located in different regions or even different region servers, so no local state will carry over all rows.



----- Original Message -----
From: Jerry Lam <ch...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
Sent: Wednesday, August 1, 2012 5:48 PM
Subject: Re: Filter with State

Hi St.Ack:

Schema cannot be changed to a single row.
The API describes "Do not rely on filters carrying state across rows; its not reliable in current hbase as we have no handlers in place for when regions split, close or server crashes." If we manage region splitting ourselves, so the split issue doesn't apply. Other failures can be handled on the application level. Does each invocation of scanner.next instantiate a new filter at the server side even on the same region (I.e. Does scanning on the same region use the same filter or different filter depending on the scanner.next calls??)

Best Regards,

Jerry 

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-01, at 18:44, Stack <st...@duboce.net> wrote:

> On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com> wrote:
>> Hi HBase guru:
>> 
>> From Lars George talk, he mentions that filter has no state. What if I need
>> to scan rows in which the decision to filter one row or not is based on the
>> previous row's column values? Any idea how one can implement this type of
>> logic?
> 
> You could try carrying state in the client (but if client dies state dies).
> 
> You can't have scanners carry state across rows.  It says so in API
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> (Whatever about the API, if LarsG says it, it must be so!).
> 
> Here is the issue: If row X is in region A on server 1 there is
> nothing to prevent row X+1 from being on region B on server 2.  How do
> you carry the state between such rows reliably?
> 
> Can you redo your schema such that the state you need to carry remains
> within a row?
> St.Ack


Re: Filter with State

Posted by Jerry Lam <ch...@gmail.com>.
Hi St.Ack:

Schema cannot be changed to a single row.
The API describes "Do not rely on filters carrying state across rows; its not reliable in current hbase as we have no handlers in place for when regions split, close or server crashes." If we manage region splitting ourselves, so the split issue doesn't apply. Other failures can be handled on the application level. Does each invocation of scanner.next instantiate a new filter at the server side even on the same region (I.e. Does scanning on the same region use the same filter or different filter depending on the scanner.next calls??)

Best Regards,

Jerry 

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-01, at 18:44, Stack <st...@duboce.net> wrote:

> On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com> wrote:
>> Hi HBase guru:
>> 
>> From Lars George talk, he mentions that filter has no state. What if I need
>> to scan rows in which the decision to filter one row or not is based on the
>> previous row's column values? Any idea how one can implement this type of
>> logic?
> 
> You could try carrying state in the client (but if client dies state dies).
> 
> You can't have scanners carry state across rows.  It says so in API
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> (Whatever about the API, if LarsG says it, it must be so!).
> 
> Here is the issue: If row X is in region A on server 1 there is
> nothing to prevent row X+1 from being on region B on server 2.  How do
> you carry the state between such rows reliably?
> 
> Can you redo your schema such that the state you need to carry remains
> within a row?
> St.Ack

Re: Filter with State

Posted by Stack <st...@duboce.net>.
On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam <ch...@gmail.com> wrote:
> Hi HBase guru:
>
> From Lars George talk, he mentions that filter has no state. What if I need
> to scan rows in which the decision to filter one row or not is based on the
> previous row's column values? Any idea how one can implement this type of
> logic?

You could try carrying state in the client (but if client dies state dies).

You can't have scanners carry state across rows.  It says so in API
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html#package_description
(Whatever about the API, if LarsG says it, it must be so!).

Here is the issue: If row X is in region A on server 1 there is
nothing to prevent row X+1 from being on region B on server 2.  How do
you carry the state between such rows reliably?

Can you redo your schema such that the state you need to carry remains
within a row?
St.Ack