You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Ameya Kantikar <am...@groupon.com> on 2013/04/18 07:38:17 UTC

Under Heavy Write Load + Replication On : Brings All My Region Servers Dead

I am running Hbase 0.94.2 from cloudera cdh4.2. (10 machine cluster)

Under heavy write load, and when replication is on, all my region servers
are going down.
I checked with cloudera version, it has HBASE-2611 bug patched in the
version I am using, so not sure whats going on. Here is the stack:

2013-04-18 01:47:33,423 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Atomically moving relevance-hbase5-snc1.snc1,60020,1366247910200's hlogs to
my queue

2013-04-18 01:47:33,424 DEBUG
org.apache.hadoop.hbase.replication.ReplicationZookeeper:  The multi list
size is: 1

2013-04-18 01:47:33,425 WARN
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in
copyQueuesFromRSUsingMulti:

org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
Directory not empty

        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:125)

        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:925)

        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:901)

        at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:538)

        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1457)

        at
org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705)

        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:585)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)


Followed by

2013-04-18 01:47:36,043 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
relevance-hbase2-snc1.snc1,60020,1366247745434: Writing replication status


I checked by turning replication off, and everything seems fine. I can
reproduce this bug almost every time I run my write heavy job.


Here is the complete log:

http://pastebin.com/da0m475T



Any ideas?


Ameya

Re: Under Heavy Write Load + Replication On : Brings All My Region Servers Dead

Posted by Ameya Kantikar <am...@groupon.com>.

Awesome. Thanks Himanshu.


On Wed, Apr 17, 2013 at 10:48 PM, Himanshu Vashishtha <
hvashish@cs.ualberta.ca> wrote:

> Hello Ameya,
>
> Sorry to hear that.
>
> You have two options:
>
> 1) Apply HBase-8099 patch to your version. (
> https://issues.apache.org/jira/browse/HBASE-8099) The patch is simple, so
> should be easy to do, OR,
> 2) Turn off zk.multi feature (see hbase-default.xml). (You can refer to
> CDH4.2.0 docs for that)
>
> This fix (HBase-8099) will be in CDH4.2.1, though.
>
> Please ask list if you have any more questions.
>
> Thanks,
> Himanshu
>
> On Wed, Apr 17, 2013 at 10:38 PM, Ameya Kantikar <am...@groupon.com>
> wrote:
>
> > I am running Hbase 0.94.2 from cloudera cdh4.2. (10 machine cluster)
> >
> > Under heavy write load, and when replication is on, all my region servers
> > are going down.
> > I checked with cloudera version, it has HBASE-2611 bug patched in the
> > version I am using, so not sure whats going on. Here is the stack:
> >
> > 2013-04-18 01:47:33,423 INFO
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> > Atomically moving relevance-hbase5-snc1.snc1,60020,1366247910200's hlogs
> to
> > my queue
> >
> > 2013-04-18 01:47:33,424 DEBUG
> > org.apache.hadoop.hbase.replication.ReplicationZookeeper:  The multi list
> > size is: 1
> >
> > 2013-04-18 01:47:33,425 WARN
> > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception
> in
> > copyQueuesFromRSUsingMulti:
> >
> > org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
> > Directory not empty
> >
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:125)
> >
> >         at
> org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:925)
> >
> >         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:901)
> >
> >         at
> >
> >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:538)
> >
> >         at
> >
> >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1457)
> >
> >         at
> >
> >
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705)
> >
> >         at
> >
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:585)
> >
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >
> >         at java.lang.Thread.run(Thread.java:662)
> >
> >
> > Followed by
> >
> > 2013-04-18 01:47:36,043 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> > relevance-hbase2-snc1.snc1,60020,1366247745434: Writing replication
> status
> >
> >
> > I checked by turning replication off, and everything seems fine. I can
> > reproduce this bug almost every time I run my write heavy job.
> >
> >
> > Here is the complete log:
> >
> > http://pastebin.com/da0m475T
> >
> >
> >
> > Any ideas?
> >
> >
> > Ameya
> >
>

Re: Under Heavy Write Load + Replication On : Brings All My Region Servers Dead

Posted by Himanshu Vashishtha <hv...@cs.ualberta.ca>.

Hello Ameya,

Sorry to hear that.

You have two options:

1) Apply HBase-8099 patch to your version. (
https://issues.apache.org/jira/browse/HBASE-8099) The patch is simple, so
should be easy to do, OR,
2) Turn off zk.multi feature (see hbase-default.xml). (You can refer to
CDH4.2.0 docs for that)

This fix (HBase-8099) will be in CDH4.2.1, though.

Please ask list if you have any more questions.

Thanks,
Himanshu

On Wed, Apr 17, 2013 at 10:38 PM, Ameya Kantikar <am...@groupon.com> wrote:

> I am running Hbase 0.94.2 from cloudera cdh4.2. (10 machine cluster)
>
> Under heavy write load, and when replication is on, all my region servers
> are going down.
> I checked with cloudera version, it has HBASE-2611 bug patched in the
> version I am using, so not sure whats going on. Here is the stack:
>
> 2013-04-18 01:47:33,423 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Atomically moving relevance-hbase5-snc1.snc1,60020,1366247910200's hlogs to
> my queue
>
> 2013-04-18 01:47:33,424 DEBUG
> org.apache.hadoop.hbase.replication.ReplicationZookeeper:  The multi list
> size is: 1
>
> 2013-04-18 01:47:33,425 WARN
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in
> copyQueuesFromRSUsingMulti:
>
> org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
> Directory not empty
>
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:125)
>
>         at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:925)
>
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:901)
>
>         at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:538)
>
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1457)
>
>         at
>
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705)
>
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:585)
>
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
>         at java.lang.Thread.run(Thread.java:662)
>
>
> Followed by
>
> 2013-04-18 01:47:36,043 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> relevance-hbase2-snc1.snc1,60020,1366247745434: Writing replication status
>
>
> I checked by turning replication off, and everything seems fine. I can
> reproduce this bug almost every time I run my write heavy job.
>
>
> Here is the complete log:
>
> http://pastebin.com/da0m475T
>
>
>
> Any ideas?
>
>
> Ameya
>