You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Anusauskas, Laimonas" <LA...@corp.untd.com> on 2013/07/17 18:06:58 UTC

Memory leak in HBase replication ?

Hi,

I am fairly new to Hbase. We are trying to setup OpenTSDB system here and just started setting up production clusters. We have 2 datacenters, on a west/east coasts and we want to have 2 active-passive Hbase clusters with Hbase replication between them. Right now each cluster has 4 nodes (1 master, 3 slave), we will add more nodes as the load ramps up. Setup went fine and data started getting replicating from one cluster to another, but as soon as load picked up regionservers on slave cluster started running out of heap and getting killed. I increased heap size on regionservers from default 1000M to 2000M, but result was the same. I also updated Hbase from the version that came with Hortonworks (hbase-0.94.6.1.3.0.0-107-security) to hbase-0.94.9 - still the same.

Now the load on source cluster is still very little. There is one active table - tsdb, and compressed size is less than 200M. But as soon as I start replication the usedHeapMB metric on regionservers in slave cluster starts going up, then full GC kicks in and eventually process is killed because "-XX:OnOutOfMemoryError=kill -9 %p" is set.

I did the heap dump and ran Eclipse memory analyzer and here is what it reported:

One instance of "java.util.concurrent.LinkedBlockingQueue" loaded by "<system class loader>" occupies 1,411,643,656 (67.87%) bytes. The instance is referenced by org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 0x7831c37f0 , loaded by "sun.misc.Launcher$AppClassLoader @ 0x783130980". The memory is accumulated in one instance of "java.util.concurrent.LinkedBlockingQueue$Node" loaded by "<system class loader>".

And

502,763 instances of "org.apache.hadoop.hbase.client.Put", loaded by "sun.misc.Launcher$AppClassLoader @ 0x783130980" occupy 244,957,616 (11.78%) bytes.

There is nothing in the logs until full GC kicks in at which point all hell breaks loose, things start timing out etc.

I did bunch of searching but came up with nothing. I could add more RAM to the nodes and increase heap size, but I suspect that will only prolong the time until heap gets full.

Any help would be appreciated.

Limus

RE: Memory leak in HBase replication ?

Posted by "Anusauskas, Laimonas" <LA...@corp.untd.com>.

I don't know how this works well enough to suggest lowering default setting, maybe 64MB really helps the throughput for other setups ? At least there could be a note in Hbase requirements about heap sizes and replication. 

Ideally there should be throttling of some kind so that if target regionserver cannot keep up with replication requests the rate of replication is slowed down but at least the regionserver does not run out of free heap space. 

Limus

Re: Memory leak in HBase replication ?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Yes... your master cluster must have helluva backup to replicate :)

Seems to make a good argument to lower the default setting. What do you
think?

J-D


On Wed, Jul 17, 2013 at 3:37 PM, Anusauskas, Laimonas <
LAnusauskas@corp.untd.com> wrote:

> Thanks, setting replication.source.size.capacity to 2MB resolved this. I
> see heap growing to about 700MB but then going down and full GC is only
> triggered occasionally.
>
> And while primary cluster is has very little load (< 100 requests/sec) the
> standby cluster is  now pretty loaded at 5K requests/sec, presumable
> because it has to replicate all the pending changes. So perhaps this is the
> issue that happens when standby cluster goes away for a while and then has
> to catch up.
>
> Really appreciate the help.
>
> Limus
>
>

RE: Memory leak in HBase replication ?

Posted by "Anusauskas, Laimonas" <LA...@corp.untd.com>.

Thanks, setting replication.source.size.capacity to 2MB resolved this. I see heap growing to about 700MB but then going down and full GC is only triggered occasionally. 

And while primary cluster is has very little load (< 100 requests/sec) the standby cluster is  now pretty loaded at 5K requests/sec, presumable because it has to replicate all the pending changes. So perhaps this is the issue that happens when standby cluster goes away for a while and then has to catch up. 

Really appreciate the help.

Limus

Re: Memory leak in HBase replication ?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

1GB is a pretty small heap and it could be that the default size for logs
to replicate is set to high. The default
for replication.source.size.capacity is 64MB. Can you set it much lower on
your master cluster (on each RS), like 2MB, and see if it makes a
difference?

The logs and the jstack seem to correlate in that sense.

Thx,

J-D

On Wed, Jul 17, 2013 at 1:40 PM, Anusauskas, Laimonas <
LAnusauskas@corp.untd.com> wrote:

> And here is the jstack output.
>
> http://pastebin.com/JKnQYqRg
>
>
>

RE: Memory leak in HBase replication ?

Posted by "Anusauskas, Laimonas" <LA...@corp.untd.com>.

And here is the jstack output. 

http://pastebin.com/JKnQYqRg

RE: Memory leak in HBase replication ?

Posted by "Anusauskas, Laimonas" <LA...@corp.untd.com>.

Ok, here is log from data node 1:

http://pastebin.com/yCYYEG2r

And out log containing GC log:

http://pastebin.com/wzt1fbTA

I started replication around 11:16 and with 1000M heap it got full pretty fast.

Limus

Re: Memory leak in HBase replication ?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Yean WARN won't give us anything, and please try to get us a fat log. Post
it on pastebin or such.

Thx,

J-D


On Wed, Jul 17, 2013 at 11:03 AM, Anusauskas, Laimonas <
LAnusauskas@corp.untd.com> wrote:

> J-D,
>
> I have log level org.apache=WARN and there is only following in the logs
> before GC happens:
>
> 2013-07-17 10:56:45,830 ERROR
> org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent
> configuration. Previous configuration for using table name in metrics:
> true, new configuration: false
> 2013-07-17 10:56:47,395 WARN
> org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is
> available
>
> I'll try upping log level to DEBUG to see if that shows anything and will
> run jstack.
>
> Thanks,
>
> Limus
>
>
>
>
>
>

RE: Memory leak in HBase replication ?

Posted by "Anusauskas, Laimonas" <LA...@corp.untd.com>.

J-D,

I have log level org.apache=WARN and there is only following in the logs before GC happens:

2013-07-17 10:56:45,830 ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false
2013-07-17 10:56:47,395 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available

I'll try upping log level to DEBUG to see if that shows anything and will run jstack.

Thanks,

Limus

Re: Memory leak in HBase replication ?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Those puts should get cleared right away, so it could mean that they
live in memory... which usually points to very full IPC queues. If you
jstack those region servers, are all the handlers thread full? What is
the log before it starts doing full GCs? Can we see it?

Thx,

J-D

On Wed, Jul 17, 2013 at 9:06 AM, Anusauskas, Laimonas
<LA...@corp.untd.com> wrote:
> Hi,
>
> I am fairly new to Hbase. We are trying to setup OpenTSDB system here and just started setting up production clusters. We have 2 datacenters, on a west/east coasts and we want to have 2 active-passive Hbase clusters with Hbase replication between them. Right now each cluster has 4 nodes (1 master, 3 slave), we will add more nodes as the load ramps up.  Setup went fine and data started getting replicating from one cluster to another, but as soon as load picked up regionservers on slave cluster started running out of heap and getting killed. I increased heap size on regionservers from default 1000M to 2000M, but result was the same. I also updated Hbase from the version that came with Hortonworks (hbase-0.94.6.1.3.0.0-107-security) to hbase-0.94.9 - still the same.
>
> Now the load on source cluster is still very little. There is one active table - tsdb, and compressed size is less than 200M. But as soon as I start replication the usedHeapMB metric on regionservers in slave cluster starts going up, then full GC kicks in and eventually process is killed because  "-XX:OnOutOfMemoryError=kill -9 %p" is set.
>
> I did the heap dump and ran Eclipse memory analyzer and here is what it reported:
>
> One instance of "java.util.concurrent.LinkedBlockingQueue" loaded by "<system class loader>" occupies 1,411,643,656 (67.87%) bytes. The instance is referenced by org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 0x7831c37f0 , loaded by "sun.misc.Launcher$AppClassLoader @ 0x783130980". The memory is accumulated in one instance of "java.util.concurrent.LinkedBlockingQueue$Node" loaded by "<system class loader>".
>
> And
>
> 502,763 instances of "org.apache.hadoop.hbase.client.Put", loaded by "sun.misc.Launcher$AppClassLoader @ 0x783130980" occupy 244,957,616 (11.78%) bytes.
>
> There is nothing in the logs until full GC kicks in at which point all hell breaks loose, things start timing out etc.
>
> I did bunch of searching but came up with nothing. I could add more RAM to the nodes and increase heap size, but I suspect that will only prolong the time until heap gets full.
>
> Any help would be appreciated.
>
> Limus