You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Martin Arnandze <ma...@gmail.com> on 2010/08/26 17:07:35 UTC

Out Of Memory on region servers upon bulk import

Hi,
 I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files, but some imports require to do gets and scans as well. In the master UI I see that the heap used gets very close to the 6GB limit, but I know hbase is eager for memory and will use the heap as much as possible.I use block caching. Looking at similar posts I see that modifying the handler count and memory store upper/ower limits may be key to solving this issue. Nevertheless I wanted to ask if there is a way to estimate the extra memory used by hbase that makes it crash and if there are other configuration settings I should be looking into to prevent OOME. The job runs correctly for some time but region servers eventually crash.

More information about the cluster:

- All nodes have 16GM total memory. 
- 7 nodes running region server (6GB) +  datanodes (1GB) + task trackers (1GB Heap).  Map reduce jobs running w/ 756MB tops each.
- 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB). 
- 3 of the nodes have zookeeper running with 512MB Heap

Many thanks,
   Martin



2010-08-26 07:19:14,859 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
java.lang.OutOfMemoryError: Java heap space
       at org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538)
       at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806)
       at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
       at org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
       at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
       at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
       at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636)
       at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321)
       at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571)
       at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538)
       at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458)
       at java.lang.Thread.run(Thread.java:619)

RE: hbase

Posted by "Buttler, David" <bu...@llnl.gov>.
I have found that using hbase to manage zookeeper is more convenient than managing zookeeper myself.  But, then again, I am not running the CDH version of zookeeper.

However, I did modify the scripts to explicitly stop and start zookeeper independently of hbase (since I use zookeeper for solr as well)

Dave

-----Original Message-----
From: Witteveen, Tim [mailto:timw@pnl.gov] 
Sent: Thursday, August 26, 2010 10:55 AM
To: user@hbase.apache.org
Subject: RE: hbase

Thanks!  Netstat revealed I was running zookeeper twice.  

I stopped manually starting it, and things are working as expected.  

TimW 

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, August 26, 2010 9:56 AM
To: user@hbase.apache.org
Subject: Re: hbase

Use netstat to see who is occupying port n

Maybe HQuorumPeer wasn't stopped from previous run ?

On Thu, Aug 26, 2010 at 9:49 AM, Witteveen, Tim <ti...@pnl.gov> wrote:

> I'm going through the overiew-summary instructions for setting up and
> running hbase. Right now I'm running hbase in pseudo-distributed mode, and
> looking to go fully-distributed on 25 nodes.
>
> Every time I restart hbase, I get:
> Couldn't start ZK at requested address of "n", instead got: "n+1".
> Aborting. Why? Because clients (eg shell) won't be able to find this ZK
> quorum
>
> If I change the hbase.zookeeper.property.clientPort to the "n+1" from the
> message it starts right up.
>
> Which file do I need to modify to keep this on one port, and what do I need
> to put into it?
>
> Is this something that should be added to the overview-summary page?
>
> Thanks,
> TimW
>

RE: hbase

Posted by "Witteveen, Tim" <ti...@pnl.gov>.
Thanks!  Netstat revealed I was running zookeeper twice.  

I stopped manually starting it, and things are working as expected.  

TimW 

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, August 26, 2010 9:56 AM
To: user@hbase.apache.org
Subject: Re: hbase

Use netstat to see who is occupying port n

Maybe HQuorumPeer wasn't stopped from previous run ?

On Thu, Aug 26, 2010 at 9:49 AM, Witteveen, Tim <ti...@pnl.gov> wrote:

> I'm going through the overiew-summary instructions for setting up and
> running hbase. Right now I'm running hbase in pseudo-distributed mode, and
> looking to go fully-distributed on 25 nodes.
>
> Every time I restart hbase, I get:
> Couldn't start ZK at requested address of "n", instead got: "n+1".
> Aborting. Why? Because clients (eg shell) won't be able to find this ZK
> quorum
>
> If I change the hbase.zookeeper.property.clientPort to the "n+1" from the
> message it starts right up.
>
> Which file do I need to modify to keep this on one port, and what do I need
> to put into it?
>
> Is this something that should be added to the overview-summary page?
>
> Thanks,
> TimW
>

Re: hbase

Posted by Ted Yu <yu...@gmail.com>.
Use netstat to see who is occupying port n

Maybe HQuorumPeer wasn't stopped from previous run ?

On Thu, Aug 26, 2010 at 9:49 AM, Witteveen, Tim <ti...@pnl.gov> wrote:

> I'm going through the overiew-summary instructions for setting up and
> running hbase. Right now I'm running hbase in pseudo-distributed mode, and
> looking to go fully-distributed on 25 nodes.
>
> Every time I restart hbase, I get:
> Couldn't start ZK at requested address of "n", instead got: "n+1".
> Aborting. Why? Because clients (eg shell) won't be able to find this ZK
> quorum
>
> If I change the hbase.zookeeper.property.clientPort to the "n+1" from the
> message it starts right up.
>
> Which file do I need to modify to keep this on one port, and what do I need
> to put into it?
>
> Is this something that should be added to the overview-summary page?
>
> Thanks,
> TimW
>

hbase

Posted by "Witteveen, Tim" <ti...@pnl.gov>.
I'm going through the overiew-summary instructions for setting up and running hbase. Right now I'm running hbase in pseudo-distributed mode, and looking to go fully-distributed on 25 nodes. 

Every time I restart hbase, I get:
Couldn't start ZK at requested address of "n", instead got: "n+1". Aborting. Why? Because clients (eg shell) won't be able to find this ZK quorum

If I change the hbase.zookeeper.property.clientPort to the "n+1" from the message it starts right up.  

Which file do I need to modify to keep this on one port, and what do I need to put into it? 

Is this something that should be added to the overview-summary page? 

Thanks,
TimW 

Re: Out Of Memory on region servers upon bulk import

Posted by Martin Arnandze <ma...@gmail.com>.
Thanks Todd, and Slack for such fast responses,

Its very good to know the expected memory consumption per handler.

Below is my conf. By the way I'm using version 0.20.5. When I introduced the memstore limits things OOME disappeared, but not sure for how long. I also set the setCacheBlocks(false). 


<name>hbase.zookeeper.quorum</name>
<value>zk1,zk2,zk3</value>
</property>

<property>
<name>zookeeper.session.timeout</name>
<value>180000</value>
</property>

<property>
<name>hbase.rootdir</name>
<value>hdfs://hbaseserver:50001/hbase</value>
</property>

<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

<property>
   <name>hbase.regionserver.handler.count</name>
   <value>25</value>
 </property>

<property>
  <name>hbase.zookeeper.property.maxClientCnxns</name>
  <value>30</value>
</property>

<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.3</value>
</property>

<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.25</value>
</property>

 

On Aug 27, 2010, at 12:16 AM, Todd Lipcon wrote:

> Hi Martin,
> 
> Can you paste your conf?
> 
> Have you by any chance upped your handler count a lot? Each handler takes up
> an amount of RAM equal to the largest Puts you do. With normal write buffer
> sizes, you're looking at around 2MB per handler, so while it sounds nice to
> bump the handler count up to a really high number, you can get OOMEs like
> you're seeing.
> 
> Thanks
> -Todd
> 
> On Thu, Aug 26, 2010 at 9:31 AM, Martin Arnandze <ma...@gmail.com>wrote:
> 
>> I provide the answers below.
>> Thanks!
>> Martin
>> 
>> On Aug 26, 2010, at 11:45 AM, Stack wrote:
>> 
>>> On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze <ma...@gmail.com>
>> wrote:
>>>> Hi,
>>>> I'm doing an experiment on an 8 node cluster, each of which has 6GB of
>> RAM allocated to hbase region server. Basically, doing a bulk import
>> processing large files,
>>> 
>>> 
>>> How large?
>> 
>> about 10 million records each a few Kb.
>> 
>>> 
>>> Unless very large, it should not be OOMEing.
>>> 
>>> but some imports require to do gets and scans as well. In the master
>>> UI I see that the heap used gets very close to the 6GB limit, but I
>>> know hbase is eager for memory and will use the heap as much as
>>> possible.I use block caching. Looking at similar posts I see that
>>> modifying the handler count and memory store upper/ower limits may be
>>> key to solving this issue. Nevertheless I wanted to ask if there is a
>>> way to estimate the extra memory used by hbase that makes it crash and
>>> if there are other configuration settings I should be looking into to
>>> prevent OOME. The job runs correctly for some time but region servers
>>> eventually crash.
>>>> 
>>>> More information about the cluster:
>>>> 
>>>> - All nodes have 16GM total memory.
>>>> - 7 nodes running region server (6GB) +  datanodes (1GB) + task trackers
>> (1GB Heap).  Map reduce jobs running w/ 756MB tops each.
>>> 
>>> Good.  How many MR child tasks can run on each node concurrently?
>> 
>> three mappers and two reducers
>> 
>>> 
>>>> - 1 node running hbase master (2GB Heap allocated), namenode (4GB),
>> Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB).
>>>> - 3 of the nodes have zookeeper running with 512MB Heap
>>>> 
>>>> Many thanks,
>>>>  Martin
>>>> 
>>> 
>>> 
>>> Can we see the lines before the below is thrown?   Also, do a listing
>>> (ls -r) on this region in hdfs and lets see if anything pops out about
>>> files sizes, etc.  You'll need to manually map the below region name
>>> to its encoded name to figure the region but the encoded name should
>>> be earlier in the log.  You'll do something like:
>>> 
>>> bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME
>> 
>> /usr/lib/hadoop-0.20/bin/hadoop fs -lsr /hbase/table_import/1698505444
>> -rw-r--r--   3 hadoop supergroup       1450 2010-08-25 23:17
>> /hbase/table_import/1698505444/.regioninfo
>> drwxr-xr-x   - hadoop supergroup          0 2010-08-26 11:07
>> /hbase/table_importl/1698505444/fam
>> -rw-r--r--   3 hadoop supergroup    4244491 2010-08-26 11:07
>> /hbase/table_import/1698505444/fam/5785049964186428982
>> -rw-r--r--   3 hadoop supergroup  180147216 2010-08-26 06:09
>> /hbase/table_import/1698505444/fam/705757673046090229
>> 
>> 
>> Previous log:
>> 
>> 010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> region
>> table_import,f1cbb42c-b6ae-404d-800c-043da5409441-9223370754623831807WkmpwnRDmveKYzWEfw/tb4GpP9yHDl+/G7OCaZWEgrmGcW+XEF131YDTQwDqZsO93tDicdPcOdRq\x0AU7zDBqoxpA==,1282790086498/1451783432
>> available; sequence id is 1518500874
>> 2010-08-26 07:18:14,691 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
>> table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399
>> 2010-08-26 07:18:14,691 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
>> Creating region
>> table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399,
>> encoded=1510556231
>> 2010-08-26 07:18:21,085 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=958.38617MB (1004940736), Free=238.28888MB (249864000), Max=1196.675MB
>> (1254804736), Counts: Blocks=115717, Access=51364517, Hit=231796,
>> Miss=51132721, Evictions=15, Evicted=218920, Ratios: Hit
>> Ratio=0.45127649791538715%, Miss Ratio=99.54872131347656%,
>> Evicted/Run=14594.6669921875
>> 2010-08-26 07:18:27,659 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>> loaded /hbase/table_import/1510556231/fam/2639910770219077750,
>> isReference=false, sequence id=1518500860, length=200014693,
>> majorCompaction=false
>> 2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> region
>> table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399/1510556231
>> available; sequence id is 1518500861
>> 2010-08-26 07:18:35,188 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
>> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
>> 2010-08-26 07:18:35,189 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
>> Creating region
>> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254,
>> encoded=1698505444
>> 2010-08-26 07:19:14,859 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
>> 
>> Note: I had to rename the table name and family above, since its client
>> sensitive data. Hope its OK.
>> Many thanks for your prompt reply!
>>  Martin
>> 
>>> 
>>> Thanks,
>>> St.Ack
>>> 
>>> 
>>>> 
>>>> 
>>>> 2010-08-26 07:19:14,859 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
>>>> java.lang.OutOfMemoryError: Java heap space
>>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538)
>>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538)
>>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458)
>>>>      at java.lang.Thread.run(Thread.java:619)
>>>> 
>> 
>> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Re: Out Of Memory on region servers upon bulk import

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Martin,

Can you paste your conf?

Have you by any chance upped your handler count a lot? Each handler takes up
an amount of RAM equal to the largest Puts you do. With normal write buffer
sizes, you're looking at around 2MB per handler, so while it sounds nice to
bump the handler count up to a really high number, you can get OOMEs like
you're seeing.

Thanks
-Todd

On Thu, Aug 26, 2010 at 9:31 AM, Martin Arnandze <ma...@gmail.com>wrote:

> I provide the answers below.
> Thanks!
>  Martin
>
> On Aug 26, 2010, at 11:45 AM, Stack wrote:
>
> > On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze <ma...@gmail.com>
> wrote:
> >> Hi,
> >>  I'm doing an experiment on an 8 node cluster, each of which has 6GB of
> RAM allocated to hbase region server. Basically, doing a bulk import
> processing large files,
> >
> >
> > How large?
>
> about 10 million records each a few Kb.
>
> >
> > Unless very large, it should not be OOMEing.
> >
> > but some imports require to do gets and scans as well. In the master
> > UI I see that the heap used gets very close to the 6GB limit, but I
> > know hbase is eager for memory and will use the heap as much as
> > possible.I use block caching. Looking at similar posts I see that
> > modifying the handler count and memory store upper/ower limits may be
> > key to solving this issue. Nevertheless I wanted to ask if there is a
> > way to estimate the extra memory used by hbase that makes it crash and
> > if there are other configuration settings I should be looking into to
> > prevent OOME. The job runs correctly for some time but region servers
> > eventually crash.
> >>
> >> More information about the cluster:
> >>
> >> - All nodes have 16GM total memory.
> >> - 7 nodes running region server (6GB) +  datanodes (1GB) + task trackers
> (1GB Heap).  Map reduce jobs running w/ 756MB tops each.
> >
> > Good.  How many MR child tasks can run on each node concurrently?
>
> three mappers and two reducers
>
> >
> >> - 1 node running hbase master (2GB Heap allocated), namenode (4GB),
> Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB).
> >> - 3 of the nodes have zookeeper running with 512MB Heap
> >>
> >> Many thanks,
> >>   Martin
> >>
> >
> >
> > Can we see the lines before the below is thrown?   Also, do a listing
> > (ls -r) on this region in hdfs and lets see if anything pops out about
> > files sizes, etc.  You'll need to manually map the below region name
> > to its encoded name to figure the region but the encoded name should
> > be earlier in the log.  You'll do something like:
> >
> > bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME
>
> /usr/lib/hadoop-0.20/bin/hadoop fs -lsr /hbase/table_import/1698505444
> -rw-r--r--   3 hadoop supergroup       1450 2010-08-25 23:17
> /hbase/table_import/1698505444/.regioninfo
> drwxr-xr-x   - hadoop supergroup          0 2010-08-26 11:07
> /hbase/table_importl/1698505444/fam
> -rw-r--r--   3 hadoop supergroup    4244491 2010-08-26 11:07
> /hbase/table_import/1698505444/fam/5785049964186428982
> -rw-r--r--   3 hadoop supergroup  180147216 2010-08-26 06:09
> /hbase/table_import/1698505444/fam/705757673046090229
>
>
> Previous log:
>
> 010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region
> table_import,f1cbb42c-b6ae-404d-800c-043da5409441-9223370754623831807WkmpwnRDmveKYzWEfw/tb4GpP9yHDl+/G7OCaZWEgrmGcW+XEF131YDTQwDqZsO93tDicdPcOdRq\x0AU7zDBqoxpA==,1282790086498/1451783432
> available; sequence id is 1518500874
> 2010-08-26 07:18:14,691 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399
> 2010-08-26 07:18:14,691 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Creating region
> table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399,
> encoded=1510556231
> 2010-08-26 07:18:21,085 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=958.38617MB (1004940736), Free=238.28888MB (249864000), Max=1196.675MB
> (1254804736), Counts: Blocks=115717, Access=51364517, Hit=231796,
> Miss=51132721, Evictions=15, Evicted=218920, Ratios: Hit
> Ratio=0.45127649791538715%, Miss Ratio=99.54872131347656%,
> Evicted/Run=14594.6669921875
> 2010-08-26 07:18:27,659 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/table_import/1510556231/fam/2639910770219077750,
> isReference=false, sequence id=1518500860, length=200014693,
> majorCompaction=false
> 2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region
> table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399/1510556231
> available; sequence id is 1518500861
> 2010-08-26 07:18:35,188 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
> 2010-08-26 07:18:35,189 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Creating region
> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254,
> encoded=1698505444
> 2010-08-26 07:19:14,859 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
>
> Note: I had to rename the table name and family above, since its client
> sensitive data. Hope its OK.
> Many thanks for your prompt reply!
>   Martin
>
> >
> > Thanks,
> > St.Ack
> >
> >
> >>
> >>
> >> 2010-08-26 07:19:14,859 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
> table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
> >> java.lang.OutOfMemoryError: Java heap space
> >>       at
> org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538)
> >>       at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806)
> >>       at
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
> >>       at
> org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
> >>       at
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
> >>       at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
> >>       at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636)
> >>       at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321)
> >>       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571)
> >>       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538)
> >>       at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458)
> >>       at java.lang.Thread.run(Thread.java:619)
> >>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Out Of Memory on region servers upon bulk import

Posted by Stack <st...@duboce.net>.
On Thu, Aug 26, 2010 at 9:31 AM, Martin Arnandze <ma...@gmail.com> wrote:
> I provide the answers below.
> Thanks!
>  Martin
>
> On Aug 26, 2010, at 11:45 AM, Stack wrote:
>
>> On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze <ma...@gmail.com> wrote:
>>> Hi,
>>>  I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files,
>>
>>
>> How large?
>
> about 10 million records each a few Kb.
>

So the files are not large?  They are but a few kb each?

St.Ack

Re: Out Of Memory on region servers upon bulk import

Posted by Martin Arnandze <ma...@gmail.com>.
I provide the answers below.
Thanks!
  Martin

On Aug 26, 2010, at 11:45 AM, Stack wrote:

> On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze <ma...@gmail.com> wrote:
>> Hi,
>>  I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files,
> 
> 
> How large?

about 10 million records each a few Kb.

> 
> Unless very large, it should not be OOMEing.
> 
> but some imports require to do gets and scans as well. In the master
> UI I see that the heap used gets very close to the 6GB limit, but I
> know hbase is eager for memory and will use the heap as much as
> possible.I use block caching. Looking at similar posts I see that
> modifying the handler count and memory store upper/ower limits may be
> key to solving this issue. Nevertheless I wanted to ask if there is a
> way to estimate the extra memory used by hbase that makes it crash and
> if there are other configuration settings I should be looking into to
> prevent OOME. The job runs correctly for some time but region servers
> eventually crash.
>> 
>> More information about the cluster:
>> 
>> - All nodes have 16GM total memory.
>> - 7 nodes running region server (6GB) +  datanodes (1GB) + task trackers (1GB Heap).  Map reduce jobs running w/ 756MB tops each.
> 
> Good.  How many MR child tasks can run on each node concurrently? 

three mappers and two reducers

> 
>> - 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB).
>> - 3 of the nodes have zookeeper running with 512MB Heap
>> 
>> Many thanks,
>>   Martin
>> 
> 
> 
> Can we see the lines before the below is thrown?   Also, do a listing
> (ls -r) on this region in hdfs and lets see if anything pops out about
> files sizes, etc.  You'll need to manually map the below region name
> to its encoded name to figure the region but the encoded name should
> be earlier in the log.  You'll do something like:
> 
> bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME

/usr/lib/hadoop-0.20/bin/hadoop fs -lsr /hbase/table_import/1698505444
-rw-r--r--   3 hadoop supergroup       1450 2010-08-25 23:17 /hbase/table_import/1698505444/.regioninfo
drwxr-xr-x   - hadoop supergroup          0 2010-08-26 11:07 /hbase/table_importl/1698505444/fam
-rw-r--r--   3 hadoop supergroup    4244491 2010-08-26 11:07 /hbase/table_import/1698505444/fam/5785049964186428982
-rw-r--r--   3 hadoop supergroup  180147216 2010-08-26 06:09 /hbase/table_import/1698505444/fam/705757673046090229


Previous log:

010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegion: region table_import,f1cbb42c-b6ae-404d-800c-043da5409441-9223370754623831807WkmpwnRDmveKYzWEfw/tb4GpP9yHDl+/G7OCaZWEgrmGcW+XEF131YDTQwDqZsO93tDicdPcOdRq\x0AU7zDBqoxpA==,1282790086498/1451783432 available; sequence id is 1518500874
2010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399
2010-08-26 07:18:14,691 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Creating region table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399, encoded=1510556231
2010-08-26 07:18:21,085 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=958.38617MB (1004940736), Free=238.28888MB (249864000), Max=1196.675MB (1254804736), Counts: Blocks=115717, Access=51364517, Hit=231796, Miss=51132721, Evictions=15, Evicted=218920, Ratios: Hit Ratio=0.45127649791538715%, Miss Ratio=99.54872131347656%, Evicted/Run=14594.6669921875
2010-08-26 07:18:27,659 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded /hbase/table_import/1510556231/fam/2639910770219077750, isReference=false, sequence id=1518500860, length=200014693, majorCompaction=false
2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: region table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,1282777778399/1510556231 available; sequence id is 1518500861
2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
2010-08-26 07:18:35,189 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Creating region table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254, encoded=1698505444
2010-08-26 07:19:14,859 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254

Note: I had to rename the table name and family above, since its client sensitive data. Hope its OK.
Many thanks for your prompt reply!
  Martin

> 
> Thanks,
> St.Ack
> 
> 
>> 
>> 
>> 2010-08-26 07:19:14,859 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
>> java.lang.OutOfMemoryError: Java heap space
>>       at org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538)
>>       at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806)
>>       at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
>>       at org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
>>       at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
>>       at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>>       at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636)
>>       at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321)
>>       at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571)
>>       at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538)
>>       at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458)
>>       at java.lang.Thread.run(Thread.java:619)
>> 


Re: Out Of Memory on region servers upon bulk import

Posted by Stack <st...@duboce.net>.
On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze <ma...@gmail.com> wrote:
> Hi,
>  I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files,


How large?

Unless very large, it should not be OOMEing.

but some imports require to do gets and scans as well. In the master
UI I see that the heap used gets very close to the 6GB limit, but I
know hbase is eager for memory and will use the heap as much as
possible.I use block caching. Looking at similar posts I see that
modifying the handler count and memory store upper/ower limits may be
key to solving this issue. Nevertheless I wanted to ask if there is a
way to estimate the extra memory used by hbase that makes it crash and
if there are other configuration settings I should be looking into to
prevent OOME. The job runs correctly for some time but region servers
eventually crash.
>
> More information about the cluster:
>
> - All nodes have 16GM total memory.
> - 7 nodes running region server (6GB) +  datanodes (1GB) + task trackers (1GB Heap).  Map reduce jobs running w/ 756MB tops each.

Good.  How many MR child tasks can run on each node concurrently?

> - 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB).
> - 3 of the nodes have zookeeper running with 512MB Heap
>
> Many thanks,
>   Martin
>


Can we see the lines before the below is thrown?   Also, do a listing
(ls -r) on this region in hdfs and lets see if anything pops out about
files sizes, etc.  You'll need to manually map the below region name
to its encoded name to figure the region but the encoded name should
be earlier in the log.  You'll do something like:

bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME

Thanks,
St.Ack


>
>
> 2010-08-26 07:19:14,859 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254
> java.lang.OutOfMemoryError: Java heap space
>       at org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538)
>       at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806)
>       at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
>       at org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
>       at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
>       at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>       at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636)
>       at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321)
>       at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571)
>       at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538)
>       at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458)
>       at java.lang.Thread.run(Thread.java:619)
>