You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Wayne <wa...@gmail.com> on 2011/08/10 11:39:20 UTC

Row+Col Range Read/Scan

As we load more and more data into HBase we are seeing the "millions of
columns" to be a challenge for us. We have some very wide rows and we are
taking 12-15 seconds to read those rows. Since HBase does not sort columns
and thereby can not support a scan of columns we are really seeing some
serious limitations to how we can model data in hbase. We always need to
read the entire row thus taking a 15 sec hit.

Is/has there been any talk about building in some support for sorted columns
and the ability to read/scan across columns? Millions of columns are
challenging if you can only read a single column/list of columns or the
entire thing. How does bigtable support this? It seems that hbase is limited
as a column based data store unless it can support this. Our columns are
truly dynamic so we do not even necessarily know what they are to request
them by name in a list. We want to be able to read/scan them just like for
rows.

We would love the ability to support the following read method (through
Thrift). We can of course do this on our own from the entire row but it
requires reading the 2 million col row into memory first.

getRowWithColumnRange(tableName, row, startColumn, stopColumn)

The above would be even better if it could be set up like a scanner where we
could stop at any point. Basically instead of scanning rows we would scan
columns for a given row. This would be the best way to support an offset,
limit pattern.

colScanID = colScannerOpenWithStop(tableName, row, startColumn, stopColumn)
colScannerGetList (colSanID,1000)

Of course once these changes occurred people would be pushing the size of
rows even more. We have seen somewhere around 20+ million columns cause OOM
errors. One row per region should be the theoretical limit to the row size,
but there is more work needed I am sure to ensure that this is true.

Thanks.

Re: Row+Col Range Read/Scan

Posted by Wayne <wa...@gmail.com>.

I think you are right in that Thrift is all we see and it is very limited.
Comments in-line.

On Wed, Aug 10, 2011 at 8:33 PM, Stack <st...@duboce.net> wrote:

> On Wed, Aug 10, 2011 at 2:39 AM, Wayne <wa...@gmail.com> wrote:
> > As we load more and more data into HBase we are seeing the "millions of
> > columns" to be a challenge for us. We have some very wide rows and we are
> > taking 12-15 seconds to read those rows.
>
> How many columns when its taking this long Wayne?
>

~2 million columns take 15 seconds.

>
>
> > Since HBase does not sort columns
>
> They are sorted.
>

They are not sorted (that we see). Columns come back in the order they were
saved to the row or some other logic I am not sure, but column are not
sorted like rows. We have always expected/wanted columns to be sorted on
retrieval but they are not. Googling this it seems consistent with comments
out there.

>
> > and thereby can not support a scan of columns
>
> How do you mean?  You only want a subset of the columns?  Can you add
> a filter or add some subset of the columns to the Scan specification?
>

Yes we only want a subset of columns. Thrift has no filters...that I know
of? We can ask for a specific column or a list of columns, but since we do
not know up front what the columns even are it does not help us.

>
> You can also read a piece of the row only if that is all you are
> interested in (though you are on other side of thrift, right, and this
> facility may not be exposed -- I have not checked)
>

Not exposed...that I know of in Thrift...this would be great to be able to
do. We would love to "chunk" the row back and thereby start getting data
faster. Waiting 15 sec for anything is a real problem for us.

>
> > Is/has there been any talk about building in some support for sorted
> columns
> > and the ability to read/scan across columns? Millions of columns are
> > challenging if you can only read a single column/list of columns or the
> > entire thing.
>
> When you say read/scan across columns, can you say more what you'd
> like?  You'd like to read N columns at a time?
>

We would like most of all to read N columns. It is the Offset, Limit
problem. Give me the first 100, give me the first 100 starting at 100 etc.
That is what we are trying to support. We can easily get something like this
to work by scanning the columns and stopping once we get enough or even
chunking back parts of the row and stopping once we have enough. Right now
we read 2 million values in 15 seconds and return 100 of them to the end
user. We prefer to read 100 from hbase and return 100 to the user in 50 ms.

>
> > How does bigtable support this? It seems that hbase is limited
> > as a column based data store unless it can support this. Our columns are
> > truly dynamic so we do not even necessarily know what they are to request
> > them by name in a list. We want to be able to read/scan them just like
> for
> > rows.
> >
>
> In java you'd do
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
>
>
> > We would love the ability to support the following read method (through
> > Thrift). We can of course do this on our own from the entire row but it
> > requires reading the 2 million col row into memory first.
> >
>
> How big are the cells?  How big is the 2M row?  You don't know the
> name but do they fit a pattern that you could filter on?  (Though
> again, filters are not exposed in thrift though that looks like its
> getting fixed)
>

Not sure about filters, will have to look into them more closely as they are
not exposed in Thrift yet. The cells are small. Simple doubles or
varchar(50) type stuff. Our row keys and columns keys are actually much
bigger than the values. Not sure about the row size, but it is pretty big.
The thing is we don't want to read the whole thing back but instead would
prefer to reads parts of the row in a way we can iterate through pages to
support again the offset, limit pattern.

>
> > getRowWithColumnRange(tableName, row, startColumn, stopColumn)
> >
> > The above would be even better if it could be set up like a scanner where
> we
> > could stop at any point. Basically instead of scanning rows we would scan
> > columns for a given row. This would be the best way to support an offset,
> > limit pattern.
> >
> > colScanID = colScannerOpenWithStop(tableName, row, startColumn,
> stopColumn)
> > colScannerGetList (colSanID,1000)
> >
> > Of course once these changes occurred people would be pushing the size of
> > rows even more. We have seen somewhere around 20+ million columns cause
> OOM
> > errors. One row per region should be the theoretical limit to the row
> size,
> > but there is more work needed I am sure to ensure that this is true.
> >
>
> The above look useful.  Stick them into an issue Wayne.
>

Ok

>
> St.Ack
> P.S. I'm still working (slowly) on the recover tool you asked for in
> your last mail.
>

Thanks. I am hopeful something will be there before we need it!!

Re: Row+Col Range Read/Scan

Posted by Stack <st...@duboce.net>.

On Wed, Aug 10, 2011 at 2:39 AM, Wayne <wa...@gmail.com> wrote:
> As we load more and more data into HBase we are seeing the "millions of
> columns" to be a challenge for us. We have some very wide rows and we are
> taking 12-15 seconds to read those rows.

How many columns when its taking this long Wayne?


> Since HBase does not sort columns

They are sorted.

> and thereby can not support a scan of columns

How do you mean?  You only want a subset of the columns?  Can you add
a filter or add some subset of the columns to the Scan specification?

You can also read a piece of the row only if that is all you are
interested in (though you are on other side of thrift, right, and this
facility may not be exposed -- I have not checked)

> Is/has there been any talk about building in some support for sorted columns
> and the ability to read/scan across columns? Millions of columns are
> challenging if you can only read a single column/list of columns or the
> entire thing.

When you say read/scan across columns, can you say more what you'd
like?  You'd like to read N columns at a time?

> How does bigtable support this? It seems that hbase is limited
> as a column based data store unless it can support this. Our columns are
> truly dynamic so we do not even necessarily know what they are to request
> them by name in a list. We want to be able to read/scan them just like for
> rows.
>

In java you'd do
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)


> We would love the ability to support the following read method (through
> Thrift). We can of course do this on our own from the entire row but it
> requires reading the 2 million col row into memory first.
>

How big are the cells?  How big is the 2M row?  You don't know the
name but do they fit a pattern that you could filter on?  (Though
again, filters are not exposed in thrift though that looks like its
getting fixed)

> getRowWithColumnRange(tableName, row, startColumn, stopColumn)
>
> The above would be even better if it could be set up like a scanner where we
> could stop at any point. Basically instead of scanning rows we would scan
> columns for a given row. This would be the best way to support an offset,
> limit pattern.
>
> colScanID = colScannerOpenWithStop(tableName, row, startColumn, stopColumn)
> colScannerGetList (colSanID,1000)
>
> Of course once these changes occurred people would be pushing the size of
> rows even more. We have seen somewhere around 20+ million columns cause OOM
> errors. One row per region should be the theoretical limit to the row size,
> but there is more work needed I am sure to ensure that this is true.
>

The above look useful.  Stick them into an issue Wayne.

St.Ack
P.S. I'm still working (slowly) on the recover tool you asked for in
your last mail.