You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Something Something <ma...@gmail.com> on 2012/05/26 03:22:34 UTC

HBase dies after some time

Hello,

I recently installed ZooKeeper & HBase on our dedicated Hadoop cluster on
EC2.  The HBase stays active for some time, but after a while it dies with
error messages similar to these:

2012-05-25 12:09:27,514 ERROR
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
master:60000-0x5378489312c0004-0x5378489312c0004 Received unexpected
KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
        at
org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:197)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310)
2012-05-25 12:09:27,514 ERROR
org.apache.hadoop.hbase.master.ActiveMasterManager:
master:60000-0x5378489312c0004-0x5378489312c0004 Error deleting our own
master address node
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
        at
org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:197)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310)


This kills the HMaster as well as all HRegionServers.  Could it be that my
ZooKeeper setup is incorrect?  Please help.  Thanks.

Re: HBase dies after some time

Posted by Something Something <ma...@gmail.com>.
These are the exceptions I see in ZooKeeper log:


2012-05-25 13:56:55,523 - ERROR [CommitProcessor:2:NIOServerCnxn@445] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
        at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
2012-05-25 13:56:55,523 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
read additional data from client sessionid 0x237858fc7a00003, likely client
has closed socket
2012-05-25 13:56:55,523 - ERROR [CommitProcessor:2:NIOServerCnxn@445] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
        at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
2012-05-25 13:56:55,524 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.58.102.92:44170 which had sessionid 0x237858fc7a00003
2012-05-25 13:56:55,524 - ERROR [CommitProcessor:2:NIOServerCnxn@445] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
        at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
2012-05-25 13:56:55,524 - ERROR [CommitProcessor:2:NIOServerCnxn@445] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
        at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)


On Fri, May 25, 2012 at 6:46 PM, N Keywal <nk...@gmail.com> wrote:

> Hi,
>
> The master have lost its connection to ZooKeeper. In this case it
> stops as the consistency of the cluster cannot be ensured. There are
> both a retry number and a timeout setting to control this, but it's
> not the root cause. The default is 10 tries and 3 minutes, so when it
> happens it means you have a serious issue.
> Note that the region servers will continue to work without the master.
> But may be they have lost their connection to ZK as well (in this case
> they will stop, for the same reason). You should have a look at how is
> the network between ZK and the master, or look after ZK logs to check
> that nobody killed it / killed them.
>
> N.
>
> On Sat, May 26, 2012 at 3:22 AM, Something Something
> <ma...@gmail.com> wrote:
> > Hello,
> >
> > I recently installed ZooKeeper & HBase on our dedicated Hadoop cluster on
> > EC2.  The HBase stays active for some time, but after a while it dies
> with
> > error messages similar to these:
> >
> > 2012-05-25 12:09:27,514 ERROR
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
> > master:60000-0x5378489312c0004-0x5378489312c0004 Received unexpected
> > KeeperException, re-throwing exception
> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/master
> >        at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> >        at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >        at
> > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
> >        at
> >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
> >        at
> >
> org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:197)
> >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310)
> > 2012-05-25 12:09:27,514 ERROR
> > org.apache.hadoop.hbase.master.ActiveMasterManager:
> > master:60000-0x5378489312c0004-0x5378489312c0004 Error deleting our own
> > master address node
> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/master
> >        at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> >        at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >        at
> > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
> >        at
> >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
> >        at
> >
> org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:197)
> >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310)
> >
> >
> > This kills the HMaster as well as all HRegionServers.  Could it be that
> my
> > ZooKeeper setup is incorrect?  Please help.  Thanks.
>

Re: HBase dies after some time

Posted by N Keywal <nk...@gmail.com>.
Hi,

The master have lost its connection to ZooKeeper. In this case it
stops as the consistency of the cluster cannot be ensured. There are
both a retry number and a timeout setting to control this, but it's
not the root cause. The default is 10 tries and 3 minutes, so when it
happens it means you have a serious issue.
Note that the region servers will continue to work without the master.
But may be they have lost their connection to ZK as well (in this case
they will stop, for the same reason). You should have a look at how is
the network between ZK and the master, or look after ZK logs to check
that nobody killed it / killed them.

N.

On Sat, May 26, 2012 at 3:22 AM, Something Something
<ma...@gmail.com> wrote:
> Hello,
>
> I recently installed ZooKeeper & HBase on our dedicated Hadoop cluster on
> EC2.  The HBase stays active for some time, but after a while it dies with
> error messages similar to these:
>
> 2012-05-25 12:09:27,514 ERROR
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
> master:60000-0x5378489312c0004-0x5378489312c0004 Received unexpected
> KeeperException, re-throwing exception
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>        at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
>        at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
>        at
> org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:197)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310)
> 2012-05-25 12:09:27,514 ERROR
> org.apache.hadoop.hbase.master.ActiveMasterManager:
> master:60000-0x5378489312c0004-0x5378489312c0004 Error deleting our own
> master address node
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>        at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
>        at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
>        at
> org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:197)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310)
>
>
> This kills the HMaster as well as all HRegionServers.  Could it be that my
> ZooKeeper setup is incorrect?  Please help.  Thanks.