You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Vincent Poon (vinpoon)" <vi...@cisco.com> on 2009/04/09 23:45:19 UTC

Scan across multiple columns

Say I want to scan down a table that looks like this:
 
            Col A      Col B
row1        x             x
row2                       x
row3        x             x
 
Normally a scanner would return all three rows, but what's the best way
to scan so that only row1 and row3 are returned?  i.e. only the rows
with data in both columns.
 
Thanks,
Vincent

Re: Scan across multiple columns

Posted by Lars George <la...@worldlingo.com>.

Vincent,

As Ryan says, you need apparently a working index now. Why not use 
Lucene to query for the row keys you need and then do the random reads 
on the HBase table. Given the small record size you report the index 
should be manageable.

Lars

Ryan Rawson wrote:
> Unless the row is read from disk, how can one know its not the one you want?
> This is true for any db system, relational dbs can hide the extra reads
> better.
>
> Hbase doesn't provide any query language, so the full cost is realized and
> apparent. Server side filters can help reduce network io, but ultimately
> you'll need to build secondary indexes if this becomes a primary use case
> with high volume. If its analysis, typically people just throw a map reduce
> at it and call it a day.
>
> Good luck!
>
> On Apr 11, 2009 9:34 AM, "Lars George" <la...@worldlingo.com> wrote:
>
> Hi Vincent,
>
> What I did is also have a custom getSplits() implementation in the
> TableInputFormat. When the splits are determined I mask out those regions
> that have no key of interest. Since the start and end key are ordered as a
> total list I can safely assume that if I scan the last few thousand entries
> that I can skip the regions beforehand. Of course, if you have a complete
> random key or the rows are spread across every region then this is futile.
>
> Lars
>
> Vincent Poon (vinpoon) wrote: > > Thanks for the reply.  I have been using
> ColumnValueFilter, but ...
>
>

Re: Scan across multiple columns

Posted by Ryan Rawson <ry...@gmail.com>.

Unless the row is read from disk, how can one know its not the one you want?
This is true for any db system, relational dbs can hide the extra reads
better.

Hbase doesn't provide any query language, so the full cost is realized and
apparent. Server side filters can help reduce network io, but ultimately
you'll need to build secondary indexes if this becomes a primary use case
with high volume. If its analysis, typically people just throw a map reduce
at it and call it a day.

Good luck!

On Apr 11, 2009 9:34 AM, "Lars George" <la...@worldlingo.com> wrote:

Hi Vincent,

What I did is also have a custom getSplits() implementation in the
TableInputFormat. When the splits are determined I mask out those regions
that have no key of interest. Since the start and end key are ordered as a
total list I can safely assume that if I scan the last few thousand entries
that I can skip the regions beforehand. Of course, if you have a complete
random key or the rows are spread across every region then this is futile.

Lars

Vincent Poon (vinpoon) wrote: > > Thanks for the reply.  I have been using
ColumnValueFilter, but ...

Re: Scan across multiple columns

Posted by Lars George <la...@worldlingo.com>.

Hi Vincent,

What I did is also have a custom getSplits() implementation in the 
TableInputFormat. When the splits are determined I mask out those 
regions that have no key of interest. Since the start and end key are 
ordered as a total list I can safely assume that if I scan the last few 
thousand entries that I can skip the regions beforehand. Of course, if 
you have a complete random key or the rows are spread across every 
region then this is futile.

Lars

Vincent Poon (vinpoon) wrote:
> Thanks for the reply.  I have been using ColumnValueFilter, but was
> wondering if there was a faster solution, as it seems ColumnValueFilter
> must apply the filter on the entire row range (in my case I need to scan
> the entire table, with millions of rows).  I also tried using indirect
> queries - scanning down Col A and then using the rowIds to get the cell
> under ColB.  This is ok until the number of values under Col A is very
> large.
>
> Vincent 
>
> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com] 
> Sent: Thursday, April 09, 2009 6:34 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Scan across multiple columns
>
> Check out the org.apache.hadoop.hbase.filter package.  The
> ColumnValueFilter might be of help specifically.
>
> The other solution is to do it client side.
>
> -ryan
>
> On Thu, Apr 9, 2009 at 2:45 PM, Vincent Poon (vinpoon)
> <vi...@cisco.com>wrote:
>
>   
>> Say I want to scan down a table that looks like this:
>>
>>            Col A      Col B
>> row1        x             x
>> row2                       x
>> row3        x             x
>>
>> Normally a scanner would return all three rows, but what's the best 
>> way to scan so that only row1 and row3 are returned?  i.e. only the 
>> rows with data in both columns.
>>
>> Thanks,
>> Vincent
>>
>>     
>
>

Re: Scan across multiple columns

Posted by Vaibhav Puranik <vp...@gmail.com>.

I tried to solve the same problem a week ago. Here is what I learned:

There are no good indexing solutions. 0.19.1 has indexing in it, but it's
not very helpful if you are using column name as data.

All the other current solutions involve iterating over rows.

The only good way is to denormaliz your schema and store your data
redundantly in multiple tables so that you can get to it using the row key
(Even the current indexing makes separate tables for indexes).

Regards,
Vaibhav

On Fri, Apr 10, 2009 at 10:09 AM, Vincent Poon (vinpoon)
<vi...@cisco.com>wrote:

> Thanks for the reply.  I have been using ColumnValueFilter, but was
> wondering if there was a faster solution, as it seems ColumnValueFilter
> must apply the filter on the entire row range (in my case I need to scan
> the entire table, with millions of rows).  I also tried using indirect
> queries - scanning down Col A and then using the rowIds to get the cell
> under ColB.  This is ok until the number of values under Col A is very
> large.
>
> Vincent
>
> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> Sent: Thursday, April 09, 2009 6:34 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Scan across multiple columns
>
> Check out the org.apache.hadoop.hbase.filter package.  The
> ColumnValueFilter might be of help specifically.
>
> The other solution is to do it client side.
>
> -ryan
>
> On Thu, Apr 9, 2009 at 2:45 PM, Vincent Poon (vinpoon)
> <vi...@cisco.com>wrote:
>
> > Say I want to scan down a table that looks like this:
> >
> >            Col A      Col B
> > row1        x             x
> > row2                       x
> > row3        x             x
> >
> > Normally a scanner would return all three rows, but what's the best
> > way to scan so that only row1 and row3 are returned?  i.e. only the
> > rows with data in both columns.
> >
> > Thanks,
> > Vincent
> >
>

RE: Scan across multiple columns

Posted by "Vincent Poon (vinpoon)" <vi...@cisco.com>.

Thanks for the reply.  I have been using ColumnValueFilter, but was
wondering if there was a faster solution, as it seems ColumnValueFilter
must apply the filter on the entire row range (in my case I need to scan
the entire table, with millions of rows).  I also tried using indirect
queries - scanning down Col A and then using the rowIds to get the cell
under ColB.  This is ok until the number of values under Col A is very
large.

Vincent 

-----Original Message-----
From: Ryan Rawson [mailto:ryanobjc@gmail.com] 
Sent: Thursday, April 09, 2009 6:34 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Scan across multiple columns

Check out the org.apache.hadoop.hbase.filter package.  The
ColumnValueFilter might be of help specifically.

The other solution is to do it client side.

-ryan

On Thu, Apr 9, 2009 at 2:45 PM, Vincent Poon (vinpoon)
<vi...@cisco.com>wrote:

> Say I want to scan down a table that looks like this:
>
>            Col A      Col B
> row1        x             x
> row2                       x
> row3        x             x
>
> Normally a scanner would return all three rows, but what's the best 
> way to scan so that only row1 and row3 are returned?  i.e. only the 
> rows with data in both columns.
>
> Thanks,
> Vincent
>

Re: Scan across multiple columns

Posted by Ryan Rawson <ry...@gmail.com>.

Check out the org.apache.hadoop.hbase.filter package.  The ColumnValueFilter
might be of help specifically.

The other solution is to do it client side.

-ryan

On Thu, Apr 9, 2009 at 2:45 PM, Vincent Poon (vinpoon) <vi...@cisco.com>wrote:

> Say I want to scan down a table that looks like this:
>
>            Col A      Col B
> row1        x             x
> row2                       x
> row3        x             x
>
> Normally a scanner would return all three rows, but what's the best way
> to scan so that only row1 and row3 are returned?  i.e. only the rows
> with data in both columns.
>
> Thanks,
> Vincent
>