You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Juhani Connolly (JIRA)" <ji...@apache.org> on 2010/04/19 09:23:50 UTC

[jira] Commented: (HBASE-2466) Improving filter API to allow for modification of keyvalue list by filter

    [ https://issues.apache.org/jira/browse/HBASE-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858406#action_12858406 ] 

Juhani Connolly commented on HBASE-2466:
----------------------------------------

Original exchange on mailing list:

-------------------------------------------------------------------------------------------------------------------

Yes you are correct, filterRow() only offers the chance to reject the
row, editing the row was expected to be done in the filterKeyValue()
call.

The problem with the filter "interface" is it is highly tied to the
implementation, which is why things look perhaps a little weird and
not super generic. Previously the filter was expected to be run only
at the StoreScanner level, so that might explain a few things.

I think an additional edit call to allow a filter to have ultimate
last minute decision making on a row's worth of results might be
workable now.

I'd review such a patch.

-ryan

On Sun, Apr 18, 2010 at 10:30 PM, Juhani Connolly <ju...@ninja.co.jp> wrote:
> > Thanks for your response
> >
> > On 04/19/2010 12:59 PM, Ryan Rawson wrote:
>> >>
>> >> I think all the functionality is there between these 2 calls:
>> >>
>> >> Filter#filterKeyValue(KeyValue kv);
>> >> and
>> >> Filter#filterRow();
>> >>
>> >> In the first call you can cache the KeyValues locally in the filter
>> >> state (in a List<KeyValue>  for example).  In the last call you can do
>> >> your custom logic based on all the KeyValues you have seen.  There is
>> >> little to no cost to do this, since retaining references to a KeyValue
>> >> is cheap (ish, relatively, etc).
>> >>
>> >>
> >
> > But ultimately the only thing I can do with Filter#filterRow() is drop the
> > full row? Am I missing something here? Were I to store references to all the
> > key values that have passed through at most I could zero out their buffers
> > in the #filterRow call? I'm not sure what the consequences of this might be
> > afterwords as the scanner tries to send a load of empty cells. Looking at
> > HRegionServer#next(final long scannerId, int nbRows), it seems to me that
> > they would get packed into Result to get sent back to the client. I could
> > certainly cut down on a lot of transfer by just sending "empty" keyvalues,
> > but it still seems like a lot of overhead that could be lost by a small api
> > change. Or am I missing something here?
> >
>> >> The filter implementation has changed a bit since August 2009, and it
>> >> might be possible to create a call like
>> >> Filter#filterRow(List<KeyValue>  results) that is called at the "end"
>> >> of a row... you can get the same effect as I noted above.  It is just
>> >> a matter of API, not of semantics.
>> >>
>> >>
> >
> > Having followed the code, it did seem like it would be trivial to implement
> > such an extra api either before or after the Filter#filterRow(). I believe
> > the option of having the ability to knock keyvals out of the list would save
> > on processing later.
> > I would be happy to try putting together the minor modification to
> > RegionScanner and adding a unit test if such a modification were welcome.
> >
>> >> I would generally discourage you from structuring your data to fit an
>> >> internal implementation detail.  While there are no current plans to
>> >> change sorting order, it would make your code more brittle.
>> >>
>> >>
> >
> > I certainly wouldn't want to do it :) I'm going to have to see how much
> > overhead I get with a) just dealing with it client end or b) keeping
> > references and zeroing the keyvals and go from there.
> >
>> >> -ryan
>> >>
>> >> On Sun, Apr 18, 2010 at 8:48 PM, Juhani Connolly<ju...@ninja.co.jp>
>> >>  wrote:
>> >>
>>> >>>
>>> >>> I've spent some time looking through the regionscanner logic, in
>>> >>> particular
>>> >>> the filter related parts and would want to check if a) my current
>>> >>> understanding is correct and b) if this may be subject to change.
>>> >>>
>>> >>> short/simplified version to avoid getting sidetracked:
>>> >>> - A RegionScanner is built from a series of scanners attached to each
>>> >>> Store.
>>> >>> - This list of scanners is stored in a KeyValueHeap which compares
>>> >>> KeyValues
>>> >>> to sort the order in which entries are retrieved by RegionScanner->next
>>> >>>  - To check the order in which keys will be returned, and thus filtered
>>> >>> one
>>> >>> can look at KeyValue.KeyComparator->compare. It's something like: sort by
>>> >>> row, then column family, then column, then timestamp
>>> >>>
>>> >>> Filters are applied as described in
>>> >>>
>>> >>> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/filter/Filter.html
>>> >>>
>>> >>> In the end, when using filterKeyValue(KeyValue) one can expect the
>>> >>> keyValues
>>> >>> to be sent to it in a sorted order. Will this always be the case?
>>> >>>
>>> >>> I ask this because I currently plan to filter the values of col-b based
>>> >>> on
>>> >>> the values in col-a. This could be achieved by making sure col-a compares
>>> >>> lower than col-b and storing some kind of data(e.g. a list of "ok"
>>> >>> timestamps) within the custom filter. Does this all sound ok?
>>> >>>
>>> >>> Finally it would be nice to see the option to filter a full set, as
>>> >>> naming
>>> >>> columns to guarrantee a certain sorting for filters seems pretty dubious:
>>> >>> - Probably in HRegion.Regionserver->next after nextInternal, before
>>> >>> filterRow?
>>> >>> - This would allow a potential filter to go through the gathered results
>>> >>> and
>>> >>> prune them depending on intercolumn dependencies?
>>> >>> - I believe it would unlock a lot of possibilities for custom filters
>>> >>> that
>>> >>> could cut down on significant amount of transfers where a rows data could
>>> >>> be
>>> >>> pruned regionserver side rather than at the client. My particular
>>> >>> application is to only store col-b where there is a col-a with a
>>> >>> corresponding timestamp that matches specific conditions. In my
>>> >>> particular
>>> >>> case this results in massive reductions in the amount of cells being sent
>>> >>> from the regionserver.
>>> >>>
>>> >>> Any thoughts would be appreciated.
>>> >>>
>>> >>> As an aside, I believe HRegion.RegionScanner->nextInternal is doing
>>> >>> filterRowKey for every key in a row even if it has passed once? Is this
>>> >>> intentional behaviour(it seems somewhat unexpected), as otherwise it
>>> >>> could
>>> >>> be optimised by just checking the samerow variable.
>>> >>>
>>> >>>
>> >>
>> >>
> >
> >


> Improving filter API to allow for modification of keyvalue list by filter
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2466
>                 URL: https://issues.apache.org/jira/browse/HBASE-2466
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: filters, regionserver
>            Reporter: Juhani Connolly
>            Priority: Minor
>
> As it stands, the Filter interface allows filtering by
> Filter#filterAllRemaining() -> true indicates scan is over, false, keep going on.
> Filter#filterRowKey(byte[],int,int) -> true to drop this row, if false, we will also call
> Filter#filterKeyValue(KeyValue) -> true to drop this key/value
> Filter#filterRow() -> last chance to drop entire row based on the sequence of filterValue() calls. Eg: filter a row if it doesn't contain a specified column.
> It would be useful to allow for an additional API in the form of a step to prune the list of KeyValues to be sent by implementing an additional
> Filter#filterRow(List<KeyValue>)
> This would allow for a user to write a custom filter against the api that drops unnecessary KeyValues according to user-defined rules.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.