You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by fnord 99 <fn...@googlemail.com> on 2010/11/22 11:01:40 UTC

LazyFetching of Row Results in MapReduce

Hi all,

I recently filled an hbase table with many millions of columns in each row
(!). The problem that now occured was that I always get a Heap Space Error
from the JVM with a subsequent shutdown of all regionservers in which the
error occurs. Since the error isn't thrown in any of my own classes, I think
that the problem is the following:

* a row is always completely read into memory upon access (at least all
column families that I'm interested in)
* the Result object holds the complete family-qualifier-value pairs in a
KeyValue[]
* this is sometimes too much to be handled by the physical memory each map
can get, therefore a heap space error is thrown

My question is now: is there any lazy fetching technique implemented within
the single key-values within one row? In my opinion it should be but I
couldn't find anything in the source code or wiki that hints to that.

Any ideas on how to go around this problem? I had the idea to rebuild the
table schema to store more data in the row key and less data in the column
families which would make the tables "thinner" and "longer". It would work
in the current setup, however, it wouldn't solve the original problem...

Thanks already in advance for any input on that,

fnord999

Re: LazyFetching of Row Results in MapReduce

Posted by Friso van Vollenhoven <fv...@xebia.com>.

Hi,

We used to have similar problems. We use a data model with wide rows (up to hundreds of MBs). As a solution, we just spread the records across a number of (subsequent) rows by adding a one-byte hash to the row key. This way, you will have the same effect as lazy loading the KeyValues within a row by doing scans instead of gets. AFAIK gets are implemented as scans, so there should be no performance difference there.


Friso



On 22 nov 2010, at 18:06, Michael Segel wrote:

> 
> 
> Hi,
> 
> How much heap space do you set in hbase-env.sh?
> How much memory do you have on your box?
> 
> You may want to up the heap space for hbase if you can.
> 
> HTH
> 
> -Mike
> 
>> Date: Mon, 22 Nov 2010 11:01:40 +0100
>> Subject: LazyFetching of Row Results in MapReduce
>> From: fnord999@googlemail.com
>> To: user@hbase.apache.org
>> 
>> Hi all,
>> 
>> I recently filled an hbase table with many millions of columns in each row
>> (!). The problem that now occured was that I always get a Heap Space Error
>> from the JVM with a subsequent shutdown of all regionservers in which the
>> error occurs. Since the error isn't thrown in any of my own classes, I think
>> that the problem is the following:
>> 
>> * a row is always completely read into memory upon access (at least all
>> column families that I'm interested in)
>> * the Result object holds the complete family-qualifier-value pairs in a
>> KeyValue[]
>> * this is sometimes too much to be handled by the physical memory each map
>> can get, therefore a heap space error is thrown
>> 
>> My question is now: is there any lazy fetching technique implemented within
>> the single key-values within one row? In my opinion it should be but I
>> couldn't find anything in the source code or wiki that hints to that.
>> 
>> Any ideas on how to go around this problem? I had the idea to rebuild the
>> table schema to store more data in the row key and less data in the column
>> families which would make the tables "thinner" and "longer". It would work
>> in the current setup, however, it wouldn't solve the original problem...
>> 
>> Thanks already in advance for any input on that,
>> 
>> fnord999
>

RE: LazyFetching of Row Results in MapReduce

Posted by Michael Segel <mi...@hotmail.com>.


Hi,

How much heap space do you set in hbase-env.sh?
How much memory do you have on your box?

You may want to up the heap space for hbase if you can.

HTH

-Mike

> Date: Mon, 22 Nov 2010 11:01:40 +0100
> Subject: LazyFetching of Row Results in MapReduce
> From: fnord999@googlemail.com
> To: user@hbase.apache.org
> 
> Hi all,
> 
> I recently filled an hbase table with many millions of columns in each row
> (!). The problem that now occured was that I always get a Heap Space Error
> from the JVM with a subsequent shutdown of all regionservers in which the
> error occurs. Since the error isn't thrown in any of my own classes, I think
> that the problem is the following:
> 
> * a row is always completely read into memory upon access (at least all
> column families that I'm interested in)
> * the Result object holds the complete family-qualifier-value pairs in a
> KeyValue[]
> * this is sometimes too much to be handled by the physical memory each map
> can get, therefore a heap space error is thrown
> 
> My question is now: is there any lazy fetching technique implemented within
> the single key-values within one row? In my opinion it should be but I
> couldn't find anything in the source code or wiki that hints to that.
> 
> Any ideas on how to go around this problem? I had the idea to rebuild the
> table schema to store more data in the row key and less data in the column
> families which would make the tables "thinner" and "longer". It would work
> in the current setup, however, it wouldn't solve the original problem...
> 
> Thanks already in advance for any input on that,
> 
> fnord999

Re: LazyFetching of Row Results in MapReduce

Posted by Lars George <la...@gmail.com>.

Hi fnord,

See https://issues.apache.org/jira/browse/HBASE-1537 and
https://issues.apache.org/jira/browse/HBASE-2673 for details. Not sure
when that went in though but you should have that available, no?

Lars

On Tue, Nov 23, 2010 at 2:48 PM, fnord 99 <fn...@googlemail.com> wrote:
> Hi,
>
> our machines have 24GB of RAM (for 8 cores) and HBase gets 6 GBs. The map
> jobs all have 768 MB memory.
>
> Currently we're using CDH3b3.
>
> We'll definitely implement my idea of distributing the rows into multiple
> columns similarly to what Friso said.
>
> A comment from somebody who has really wide rows would be interesting,
> though.
>
> Thanks,
> fnord
>
> 2010/11/22 Todd Lipcon <to...@cloudera.com>
>
>> Hi,
>>
>> Which version are you using?
>>
>> During the 0.89 development series we got a bunch of new work in trunk
>> (mostly thanks to Facebook and TrendMicro) for wide rows. Maybe one of the
>> FB guys can comment, but I believe they have some very wide rows in their
>> application.
>>
>> Thanks
>> -Todd
>>
>> On Mon, Nov 22, 2010 at 2:01 AM, fnord 99 <fn...@googlemail.com> wrote:
>>
>> > Hi all,
>> >
>> > I recently filled an hbase table with many millions of columns in each
>> row
>> > (!). The problem that now occured was that I always get a Heap Space
>> Error
>> > from the JVM with a subsequent shutdown of all regionservers in which the
>> > error occurs. Since the error isn't thrown in any of my own classes, I
>> > think
>> > that the problem is the following:
>> >
>> > * a row is always completely read into memory upon access (at least all
>> > column families that I'm interested in)
>> > * the Result object holds the complete family-qualifier-value pairs in a
>> > KeyValue[]
>> > * this is sometimes too much to be handled by the physical memory each
>> map
>> > can get, therefore a heap space error is thrown
>> >
>> > My question is now: is there any lazy fetching technique implemented
>> within
>> > the single key-values within one row? In my opinion it should be but I
>> > couldn't find anything in the source code or wiki that hints to that.
>> >
>> > Any ideas on how to go around this problem? I had the idea to rebuild the
>> > table schema to store more data in the row key and less data in the
>> column
>> > families which would make the tables "thinner" and "longer". It would
>> work
>> > in the current setup, however, it wouldn't solve the original problem...
>> >
>> > Thanks already in advance for any input on that,
>> >
>> > fnord999
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>

Re: LazyFetching of Row Results in MapReduce

Posted by fnord 99 <fn...@googlemail.com>.

Hi,

our machines have 24GB of RAM (for 8 cores) and HBase gets 6 GBs. The map
jobs all have 768 MB memory.

Currently we're using CDH3b3.

We'll definitely implement my idea of distributing the rows into multiple
columns similarly to what Friso said.

A comment from somebody who has really wide rows would be interesting,
though.

Thanks,
fnord

2010/11/22 Todd Lipcon <to...@cloudera.com>

> Hi,
>
> Which version are you using?
>
> During the 0.89 development series we got a bunch of new work in trunk
> (mostly thanks to Facebook and TrendMicro) for wide rows. Maybe one of the
> FB guys can comment, but I believe they have some very wide rows in their
> application.
>
> Thanks
> -Todd
>
> On Mon, Nov 22, 2010 at 2:01 AM, fnord 99 <fn...@googlemail.com> wrote:
>
> > Hi all,
> >
> > I recently filled an hbase table with many millions of columns in each
> row
> > (!). The problem that now occured was that I always get a Heap Space
> Error
> > from the JVM with a subsequent shutdown of all regionservers in which the
> > error occurs. Since the error isn't thrown in any of my own classes, I
> > think
> > that the problem is the following:
> >
> > * a row is always completely read into memory upon access (at least all
> > column families that I'm interested in)
> > * the Result object holds the complete family-qualifier-value pairs in a
> > KeyValue[]
> > * this is sometimes too much to be handled by the physical memory each
> map
> > can get, therefore a heap space error is thrown
> >
> > My question is now: is there any lazy fetching technique implemented
> within
> > the single key-values within one row? In my opinion it should be but I
> > couldn't find anything in the source code or wiki that hints to that.
> >
> > Any ideas on how to go around this problem? I had the idea to rebuild the
> > table schema to store more data in the row key and less data in the
> column
> > families which would make the tables "thinner" and "longer". It would
> work
> > in the current setup, however, it wouldn't solve the original problem...
> >
> > Thanks already in advance for any input on that,
> >
> > fnord999
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: LazyFetching of Row Results in MapReduce

Posted by Todd Lipcon <to...@cloudera.com>.

Hi,

Which version are you using?

During the 0.89 development series we got a bunch of new work in trunk
(mostly thanks to Facebook and TrendMicro) for wide rows. Maybe one of the
FB guys can comment, but I believe they have some very wide rows in their
application.

Thanks
-Todd

On Mon, Nov 22, 2010 at 2:01 AM, fnord 99 <fn...@googlemail.com> wrote:

> Hi all,
>
> I recently filled an hbase table with many millions of columns in each row
> (!). The problem that now occured was that I always get a Heap Space Error
> from the JVM with a subsequent shutdown of all regionservers in which the
> error occurs. Since the error isn't thrown in any of my own classes, I
> think
> that the problem is the following:
>
> * a row is always completely read into memory upon access (at least all
> column families that I'm interested in)
> * the Result object holds the complete family-qualifier-value pairs in a
> KeyValue[]
> * this is sometimes too much to be handled by the physical memory each map
> can get, therefore a heap space error is thrown
>
> My question is now: is there any lazy fetching technique implemented within
> the single key-values within one row? In my opinion it should be but I
> couldn't find anything in the source code or wiki that hints to that.
>
> Any ideas on how to go around this problem? I had the idea to rebuild the
> table schema to store more data in the row key and less data in the column
> families which would make the tables "thinner" and "longer". It would work
> in the current setup, however, it wouldn't solve the original problem...
>
> Thanks already in advance for any input on that,
>
> fnord999
>



-- 
Todd Lipcon
Software Engineer, Cloudera