You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Kevin M <ke...@gmail.com> on 2012/04/18 15:50:02 UTC

Applying filters to ResultScanner

Hello,

I am running HBase 0.92.0, and I am wondering if there is a way to scan a
table, cache the ResultScanner, and then continuously filter the
ResultScanner. The use case is to represent the functionality of a faceted
search. The client would select an attribute/facet, a scan would be done
and a ResultScanner would be returned. If the user applied another facet
(forming a breadcrumb trial), then a filter would need to be applied to the
ResultScanner and only those rows that passed the filter would remain.

I am having trouble thinking about how to do this because I read that the
the ResultScanner instance needs to be released as quickly as possible
because of the amount of remote resources it holds on the server-side, and
I don't see a filter mechanism that provides this type of functionality on
the ResultScanner object. I went through the mailing list but I was unable
to find a post that resembled this idea.

Thanks.

Re: Applying filters to ResultScanner

Posted by Alok Kumar <al...@gmail.com>.

Thanks for pointing about setCacheBlocks() ,
its HBase default value will provide better performance for following
Filters as well as for Kevin's multiple Facet search.

-Alok

On Fri, Apr 20, 2012 at 7:02 AM, Kevin M <ke...@gmail.com> wrote:

> Thanks for pointing me towards setCacheBlocks() and explaining the
> difference between those two types of caching in HBase.
>
> According to the API documentation, setCacheBlocks defaults to true, so it
> looks like HBase will take care of what I am looking for automatically.
> Thanks so much for your answer, Alex.
>
> -Kevin
>
> On Thu, Apr 19, 2012 at 9:03 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > Regarding caching during scans there are two types of caches:
> > * caching (bufferring) the records before returning them to the client,
> > enabled via scan.setCaching(numRows)
> > * block cache on a regionserver, enabled via setCacheBlocks(true)
> >
> > The latter one (block cache) is what you are looking for.
> >
> > Note: setCacheBlocks(true) will not override your columnfamily settings,
> so
> > do not disable it on that level.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
> >
> > On Thu, Apr 19, 2012 at 12:52 PM, Kevin M <ke...@gmail.com>
> > wrote:
> >
> > > Thanks for the reply.
> > >
> > > I see. Would HBase cache the results of the first scan so it wouldn't
> > take
> > > as
> > > long to collect the results? Say there were 5 facets selected one after
> > > another.
> > > A new scan would take place with more strict filtering each time on the
> > > whole
> > > table rather than to use the results of the previous scan?
> > >
> > > Thank you.
> > >
> > >
> > >
> > >
> >
>

Re: Applying filters to ResultScanner

Posted by Kevin M <ke...@gmail.com>.

Thanks for pointing me towards setCacheBlocks() and explaining the
difference between those two types of caching in HBase.

According to the API documentation, setCacheBlocks defaults to true, so it
looks like HBase will take care of what I am looking for automatically.
Thanks so much for your answer, Alex.

-Kevin

On Thu, Apr 19, 2012 at 9:03 PM, Alex Baranau <al...@gmail.com>wrote:

> Regarding caching during scans there are two types of caches:
> * caching (bufferring) the records before returning them to the client,
> enabled via scan.setCaching(numRows)
> * block cache on a regionserver, enabled via setCacheBlocks(true)
>
> The latter one (block cache) is what you are looking for.
>
> Note: setCacheBlocks(true) will not override your columnfamily settings, so
> do not disable it on that level.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
>
> On Thu, Apr 19, 2012 at 12:52 PM, Kevin M <ke...@gmail.com>
> wrote:
>
> > Thanks for the reply.
> >
> > I see. Would HBase cache the results of the first scan so it wouldn't
> take
> > as
> > long to collect the results? Say there were 5 facets selected one after
> > another.
> > A new scan would take place with more strict filtering each time on the
> > whole
> > table rather than to use the results of the previous scan?
> >
> > Thank you.
> >
> >
> >
> >
>

Re: Applying filters to ResultScanner

Posted by Alex Baranau <al...@gmail.com>.

Regarding caching during scans there are two types of caches:
* caching (bufferring) the records before returning them to the client,
enabled via scan.setCaching(numRows)
* block cache on a regionserver, enabled via setCacheBlocks(true)

The latter one (block cache) is what you are looking for.

Note: setCacheBlocks(true) will not override your columnfamily settings, so
do not disable it on that level.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

On Thu, Apr 19, 2012 at 12:52 PM, Kevin M <ke...@gmail.com> wrote:

> Thanks for the reply.
>
> I see. Would HBase cache the results of the first scan so it wouldn't take
> as
> long to collect the results? Say there were 5 facets selected one after
> another.
> A new scan would take place with more strict filtering each time on the
> whole
> table rather than to use the results of the previous scan?
>
> Thank you.
>
>
>
>

Re: Applying filters to ResultScanner

Posted by Kevin M <ke...@gmail.com>.

Thanks for the reply.

I see. Would HBase cache the results of the first scan so it wouldn't take as
long to collect the results? Say there were 5 facets selected one after another.
A new scan would take place with more strict filtering each time on the whole
table rather than to use the results of the previous scan?

Thank you.

Re: Applying filters to ResultScanner

Posted by Alok Kumar <al...@gmail.com>.

Hi,

I think you need to recreate "a Filter + attach it to Scan" and make a call
to Hbase again in order to get a new set of results or ResultScanner.

You are right, ResultScanner object need to be released quickly when u r
done with it at middle tier. Below are the text from HBase book...
*10.8.4. Close ResultScanners*

*This isn't so much about improving performance but rather
avoidingperformance problems. If you forget to close
ResultScanners<http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html>you
can cause problems on the RegionServers. Always have ResultScanner
processing enclosed in try/catch blocks... *

*Scan scan = new Scan();
// set attrs...
ResultScanner rs = htable.getScanner(scan);
try {
  for (Result r = rs.next(); r != null; r = rs.next()) {
  // process result...
} finally {
  rs.close();  // always close the ResultScanner!
}
htable.close();*

Thanks,

On Wed, Apr 18, 2012 at 7:20 PM, Kevin M <ke...@gmail.com> wrote:

> Hello,
>
> I am running HBase 0.92.0, and I am wondering if there is a way to scan a
> table, cache the ResultScanner, and then continuously filter the
> ResultScanner. The use case is to represent the functionality of a faceted
> search. The client would select an attribute/facet, a scan would be done
> and a ResultScanner would be returned. If the user applied another facet
> (forming a breadcrumb trial), then a filter would need to be applied to the
> ResultScanner and only those rows that passed the filter would remain.
>
> I am having trouble thinking about how to do this because I read that the
> the ResultScanner instance needs to be released as quickly as possible
> because of the amount of remote resources it holds on the server-side, and
> I don't see a filter mechanism that provides this type of functionality on
> the ResultScanner object. I went through the mailing list but I was unable
> to find a post that resembled this idea.
>
> Thanks.
>

-- 
Alok Kumar