You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Marco Gallotta <ma...@gallotta.co.za> on 2012/12/29 12:44:57 UTC

Hbase master failing to start after reaching 95% disk and expanding cluster

Hi there 

I've been running an hbase cluster for several months, and it recently experienced problems as the nodes reached 95% disk capacity. I added an extra node, and now the master keeps crashing with the errors below. I also increased the disk capacity on each individual node after this, and the errors are the same. I tried removing the new node, and that doesn't help.

There are similar errors in the regionserver and zookeeper logs, but the all seem to echo from the master logs.

Anything I can look at to help diagnose what the problem here is?

hbase-root-master-analytics.log:
Sat Dec 29 03:14:22 PST 2012 Starting master on analytics
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 59480
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 59480
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC Server listener on 60000
2012-12-29 03:14:25,322 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
2012-12-29 03:14:28,735 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase

2012-12-29 03:14:32,797 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
2012-12-29 03:14:41,427 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
2012-12-29 03:14:41,427 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries
2012-12-29 03:14:41,428 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster

        at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1754)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:176)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:896)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:161)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:154)
        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:281)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
        at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1735)
        ... 5 more


-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
marco@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

Re: Hbase master failing to start after reaching 95% disk and expanding cluster

Posted by Marco Gallotta <ma...@gallotta.co.za>.

On Sunday 30 December 2012 at 1:07 AM, Stack wrote:
> On Sat, Dec 29, 2012 at 9:22 AM, Marco Gallotta <marco@gallotta.co.za (mailto:marco@gallotta.co.za)>wrote:
> 
> > I also see a number of these warnings in the zookeeper logs, which look
> > quite telling. Zookeeper is running on the slaves in question, and port
> > 3888 is unblocked in the firewall.
> > 
> > 2012-12-29 07:23:42,492 WARN
> > org.apache.zookeeper.server.quorum.QuorumCnxManager
> > : Cannot open channel to 1 at election address
> > slave1.analytics-internal.lokistu
> > dios.com/10.171.98.247:3888 (http://dios.com/10.171.98.247:3888)
> > java.net.ConnectException: Connection refused
> > 
> 
> 
> 
> Is this server up and running. Connection refused is pretty plain
> speaking. Can you ping this server? Can you telnet to the server on the
> above port? What is in the logs for the server at 98.247?
> 
> It seems like HBase Master is failing because it cannot connect to zk.
> 
> St.Ack 
Got it working, after making several changes and I'm not quite sure which fixed it. The servers were all running, and I was able to telnet into the server on 3888.

-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
marco@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

Re: Hbase master failing to start after reaching 95% disk and expanding cluster

Posted by Stack <st...@duboce.net>.

On Sat, Dec 29, 2012 at 9:22 AM, Marco Gallotta <ma...@gallotta.co.za>wrote:

> I also see a number of these warnings in the zookeeper logs, which look
> quite telling. Zookeeper is running on the slaves in question, and port
> 3888 is unblocked in the firewall.
>
> 2012-12-29 07:23:42,492 WARN
> org.apache.zookeeper.server.quorum.QuorumCnxManager
> : Cannot open channel to 1 at election address
> slave1.analytics-internal.lokistu
> dios.com/10.171.98.247:3888
> java.net.ConnectException: Connection refused
>

Is this server up and running.  Connection refused is pretty plain
speaking.  Can you ping this server? Can you telnet to the server on the
above port?  What is in the logs for the server at 98.247?

It seems like HBase Master is failing because it cannot connect to zk.

St.Ack

Re: Hbase master failing to start after reaching 95% disk and expanding cluster

Posted by Marco Gallotta <ma...@gallotta.co.za>.

I also see a number of these warnings in the zookeeper logs, which look quite telling. Zookeeper is running on the slaves in question, and port 3888 is unblocked in the firewall.

2012-12-29 07:23:42,492 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager
: Cannot open channel to 1 at election address slave1.analytics-internal.lokistu
dios.com/10.171.98.247:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.ja
va:327)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocket
Impl.java:193)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java
:180)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
        at java.net.Socket.connect(Socket.java:546)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum
CnxManager.java:354)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM
anager.java:327)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
rSender.process(FastLeaderElection.java:393)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
        at java.lang.Thread.run(Thread.java:679)


-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
marco@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Saturday 29 December 2012 at 1:44 PM, Marco Gallotta wrote:

> Hi there 
> 
> I've been running an hbase cluster for several months, and it recently experienced problems as the nodes reached 95% disk capacity. I added an extra node, and now the master keeps crashing with the errors below. I also increased the disk capacity on each individual node after this, and the errors are the same. I tried removing the new node, and that doesn't help.
> 
> There are similar errors in the regionserver and zookeeper logs, but the all seem to echo from the master logs.
> 
> Anything I can look at to help diagnose what the problem here is?
> 
> hbase-root-master-analytics.log:
> Sat Dec 29 03:14:22 PST 2012 Starting master on analytics
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 59480
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 59480
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2
> 2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC Server listener on 60000
> 2012-12-29 03:14:25,322 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 2012-12-29 03:14:28,735 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 
> 2012-12-29 03:14:32,797 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 2012-12-29 03:14:41,427 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 2012-12-29 03:14:41,427 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries
> 2012-12-29 03:14:41,428 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster
> 
>         at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740)
>         at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146)
>         at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1754)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:176)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:896)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:161)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:154)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:281)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>         at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1735)
>         ... 5 more
> 
> 
> -- 
> Marco Gallotta | Mountain View, California
> Software Engineer, Infrastructure | Loki Studios
> fb.me/marco.gallotta (http://fb.me/marco.gallotta) | twitter.com/marcog (http://twitter.com/marcog)
> marco@gallotta.co.za (mailto:marco@gallotta.co.za) | +1 (650) 417-3313
> 
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>