You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by David Koch <og...@googlemail.com> on 2012/08/17 14:58:03 UTC

Eliminating rows with many KVs using a custom filter.

Hello,

I implemented and deployed a custom HBase filter. All it does is omit rows
which contain more than <max> KeyValue pairs. The central part is
implementing Filter filterKeyValue():

// "excludeRow" and "numKVs" are reset in reset() method.
@Override
public ReturnCode filterKeyValue(KeyValue kv) {
if (++numKVs > maxKVs) {
excludeRow = true;
return ReturnCode.NEXT_ROW;
}
return ReturnCode.INCLUDE;
}

I was wondering if from a performance point of view it would be faster to
instead override filterRow(List<KeyValue> kvs) and have something like:

@Override
public void filterRow(List<KeyValue> kvs) {
       if (kvs.size() > maxKVs) {
            excludeRow = true
       }
}

The disadvantage I see with this method is that it would have to load the
entire list of kvs for each row first to establish whether or not to drop
the row. This is potentially enough to bring down our cluster - see below.
My implementation on the other hand has the overhead of the loop.

I use this filter to eliminate abnormally large rows from the scan - rows
contain about 10 KeyValues on average with low variance but a few outlier
rows contain 1million+ KeyValue pairs. Doing a simple scan/get of those
large rows brings down our region servers (using batch is not an option).
Hence, the need to eliminate these rows as efficiently as possible from the
processing pipeline.

Thank you,

/David


PS: My options to compare both filter variants on big data are limited
since we have only one HBase cluster - the production one ;-)

RE: Eliminating rows with many KVs using a custom filter.

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

Hi David

The first approach should be better.  If you know what are the columns that
you will always be retrieving, you can also use scan.addColumn() which is
much better.  May be you would have tried this already.

Regards
Ram

> -----Original Message-----
> From: David Koch [mailto:ogdude@googlemail.com]
> Sent: Friday, August 17, 2012 6:28 PM
> To: user@hbase.apache.org
> Subject: Eliminating rows with many KVs using a custom filter.
> 
> Hello,
> 
> I implemented and deployed a custom HBase filter. All it does is omit
> rows
> which contain more than <max> KeyValue pairs. The central part is
> implementing Filter filterKeyValue():
> 
> // "excludeRow" and "numKVs" are reset in reset() method.
> @Override
> public ReturnCode filterKeyValue(KeyValue kv) {
> if (++numKVs > maxKVs) {
> excludeRow = true;
> return ReturnCode.NEXT_ROW;
> }
> return ReturnCode.INCLUDE;
> }
> 
> I was wondering if from a performance point of view it would be faster
> to
> instead override filterRow(List<KeyValue> kvs) and have something like:
> 
> @Override
> public void filterRow(List<KeyValue> kvs) {
>        if (kvs.size() > maxKVs) {
>             excludeRow = true
>        }
> }
> 
> The disadvantage I see with this method is that it would have to load
> the
> entire list of kvs for each row first to establish whether or not to
> drop
> the row. This is potentially enough to bring down our cluster - see
> below.
> My implementation on the other hand has the overhead of the loop.
> 
> I use this filter to eliminate abnormally large rows from the scan -
> rows
> contain about 10 KeyValues on average with low variance but a few
> outlier
> rows contain 1million+ KeyValue pairs. Doing a simple scan/get of those
> large rows brings down our region servers (using batch is not an
> option).
> Hence, the need to eliminate these rows as efficiently as possible from
> the
> processing pipeline.
> 
> Thank you,
> 
> /David
> 
> 
> PS: My options to compare both filter variants on big data are limited
> since we have only one HBase cluster - the production one ;-)