You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dave Latham <la...@davelink.net> on 2009/02/05 01:09:31 UTC

Row Filters in TableInputFormatBase

In order to speed up a map reduce job operating on HBase input data, we
recently added a RowFilter to the input format.  However, when trying to
execute it, map tasks (one per region) that used to take 1-2 minutes began
timing out after 10 minutes.  So I dug in to TableInputFormatBase to see how
it handles a row filter, and it appears to take out filter and combine it
with a StopRowFilter in order to scan the proper split, since there is no
getScanner method that can accept both a stop row and a row filter.  Digging
further in to the scanning / filtering, it looks like it continues scanning
filterAllRemaining returns true.  However,
StopRowFilter.filterAllRemaining() always returns false.  So if my
understanding is correct, every split in this task will end up scanning to
the end of the table and testing every row with the filter instead of simply
stopping at the end of it's given split.  That would explain why my map
tasks began taking longer (instead of shorter).

1. Is my understanding correct?  (aka is this a bug?  If so, I don't see an
existing JIRA issue for it -- I can open one if no one else does.)
2. If so, should the StopRowFilter filterAllRemaining once the stop row has
been reached?  Or should the TableInputFormatBase wrap it in a
WhileMatchRowFilter for the same effect?
3. Is there a reason why HTable does not support requesting a scanner with
both an end row and a row filter - forcing all clients to add these extra
filters?

Thanks!
Dave

Hadoop / Hbase 0.19.0

Re: Row Filters in TableInputFormatBase

Posted by Dave Latham <la...@davelink.net>.
I've opened a HBASE-1190 for it.  Looking through the other code, it seems
the pattern is to wrap a StopRowFilter in a WhileMatchRowFilter so that it
will filterAllRemaining once it hits the stop row, so I've submitted a patch
to do that.  It does seem, however, like the StopRowFilter should know to
filterAllRemaining itself once the stop row is reached, and not require a
WhileMatchRowFilter.

Dave

On Sat, Feb 7, 2009 at 1:21 PM, stack <st...@duboce.net> wrote:

> On Wed, Feb 4, 2009 at 4:09 PM, Dave Latham <la...@davelink.net> wrote:
>
> > In order to speed up a map reduce job operating on HBase input data, we
> > recently added a RowFilter to the input format.  However, when trying to
> > execute it, map tasks (one per region) that used to take 1-2 minutes
> began
> > timing out after 10 minutes.  So I dug in to TableInputFormatBase to see
> > how
> > it handles a row filter, and it appears to take out filter and combine it
> > with a StopRowFilter in order to scan the proper split, since there is no
> > getScanner method that can accept both a stop row and a row filter.
> >  Digging
> > further in to the scanning / filtering, it looks like it continues
> scanning
> > filterAllRemaining returns true.  However,
> > StopRowFilter.filterAllRemaining() always returns false.  So if my
> > understanding is correct, every split in this task will end up scanning
> to
> > the end of the table and testing every row with the filter instead of
> > simply
> > stopping at the end of it's given split.  That would explain why my map
> > tasks began taking longer (instead of shorter).
>
>
> > 1. Is my understanding correct?  (aka is this a bug?  If so, I don't see
> an
> > existing JIRA issue for it -- I can open one if no one else does.)
>
>
> Sounds like a bug (and an explanation for long-running jobs) but, IIUC,
> stop
> row filter supposed to have a 'stop row' embedded and once filter passes it
> out, then we stop filltering?  If thats not going on, lets fix it.
>
> St.Ack
> P.S. Thanks for digging in.
>

Re: Row Filters in TableInputFormatBase

Posted by stack <st...@duboce.net>.
On Wed, Feb 4, 2009 at 4:09 PM, Dave Latham <la...@davelink.net> wrote:

> In order to speed up a map reduce job operating on HBase input data, we
> recently added a RowFilter to the input format.  However, when trying to
> execute it, map tasks (one per region) that used to take 1-2 minutes began
> timing out after 10 minutes.  So I dug in to TableInputFormatBase to see
> how
> it handles a row filter, and it appears to take out filter and combine it
> with a StopRowFilter in order to scan the proper split, since there is no
> getScanner method that can accept both a stop row and a row filter.
>  Digging
> further in to the scanning / filtering, it looks like it continues scanning
> filterAllRemaining returns true.  However,
> StopRowFilter.filterAllRemaining() always returns false.  So if my
> understanding is correct, every split in this task will end up scanning to
> the end of the table and testing every row with the filter instead of
> simply
> stopping at the end of it's given split.  That would explain why my map
> tasks began taking longer (instead of shorter).


> 1. Is my understanding correct?  (aka is this a bug?  If so, I don't see an
> existing JIRA issue for it -- I can open one if no one else does.)


Sounds like a bug (and an explanation for long-running jobs) but, IIUC, stop
row filter supposed to have a 'stop row' embedded and once filter passes it
out, then we stop filltering?  If thats not going on, lets fix it.

St.Ack
P.S. Thanks for digging in.