You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Mikael Sitruk <mi...@gmail.com> on 2011/12/05 14:10:18 UTC

HBase master startup and the "Unable to read additional data from server sessionid 0x0" zk error.

Hi

I would like to share with you my finding with the "Unable to read
additional data from server sessionid 0x0" zk error which prevented HBase
Master to start

I have a cluster of 10 RS and a ZK quorum of 3 machines
I use a script to start the cluster, hdfs, mapreduce, zk quorum, HBMaster
and finally HBRS.

Using the script everything started beside HBase.

While checking into the log I found zk exception was thrown during the
startup:
2011-12-05 00:05:34,622 ERROR
org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
java.lang.RuntimeException: Failed construction of Master: class
org.apache.hadoop.hbase.master.HMaster
        at
org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1069)
        at
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:142)
        at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:102)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1083)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:223)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1064)
        ... 5 more

Googling on the subject did not provide enough insight for my problem.

I checked zk, and from the shell I got the same kind of exception,
therefore I reinstalled zk, checked the command line and everything was ok.
I thought that it will be the same with HBase, but not! Again I got the
same behavior (HMaster failed), but this time zk was stable from the
command line (zkCli).

I continued with several experiments, then I found the sequence of
operation that make the problem!
If I start the ZK quorum in and order that is different than the ZK leader
(the one with myid containing 1), the others zk and then immediately start
HBase master then HBase master will failed to load with the error above.
I added to the script 10 seconds wait between ZK start and HBase start and
it resolved the problem.

I suppose that the reason of the problem is that when another zk server is
started prior the leader, then the zk quorum will begin some consensus to
elect a new leader and this may take several seconds, during this time ZK
quorum will not be available and HBMaster will failed to start.

So I have several questions:
1. Is there a way in HBase at startup to check this situation and initiate
a 10 second wait before trying to reconnect?
2. Let suppose that HBase is in the middle of some work and zk failure
occurs (some node fail but still remaining n/2+1 zk server) and the
election protocol begin, does HBase will be ok, or will it begin a shutdown
sequence? My understanding is that HBase should be ok, as long as there is
a zk quorum available, it may just need to reconnect, but should not
shutdown nor be inaccessible.


Regards,
Mikael.S

Re: HBase master startup and the "Unable to read additional data from server sessionid 0x0" zk error.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
1. In 0.92 it should recover right away from those errors.

2. I happened to us, it's fine.

I might add that you don't need to stop zookeeper when stopping HBase.
Our ZK ensembles have hundreds of days of uptime.

J-D

On Mon, Dec 5, 2011 at 5:10 AM, Mikael Sitruk <mi...@gmail.com> wrote:
> Hi
>
> I would like to share with you my finding with the "Unable to read
> additional data from server sessionid 0x0" zk error which prevented HBase
> Master to start
>
> I have a cluster of 10 RS and a ZK quorum of 3 machines
> I use a script to start the cluster, hdfs, mapreduce, zk quorum, HBMaster
> and finally HBRS.
>
> Using the script everything started beside HBase.
>
> While checking into the log I found zk exception was thrown during the
> startup:
> 2011-12-05 00:05:34,622 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: Failed construction of Master: class
> org.apache.hadoop.hbase.master.HMaster
>        at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1069)
>        at
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:142)
>        at
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:102)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
>        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1083)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>        at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
>        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:223)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1064)
>        ... 5 more
>
> Googling on the subject did not provide enough insight for my problem.
>
> I checked zk, and from the shell I got the same kind of exception,
> therefore I reinstalled zk, checked the command line and everything was ok.
> I thought that it will be the same with HBase, but not! Again I got the
> same behavior (HMaster failed), but this time zk was stable from the
> command line (zkCli).
>
> I continued with several experiments, then I found the sequence of
> operation that make the problem!
> If I start the ZK quorum in and order that is different than the ZK leader
> (the one with myid containing 1), the others zk and then immediately start
> HBase master then HBase master will failed to load with the error above.
> I added to the script 10 seconds wait between ZK start and HBase start and
> it resolved the problem.
>
> I suppose that the reason of the problem is that when another zk server is
> started prior the leader, then the zk quorum will begin some consensus to
> elect a new leader and this may take several seconds, during this time ZK
> quorum will not be available and HBMaster will failed to start.
>
> So I have several questions:
> 1. Is there a way in HBase at startup to check this situation and initiate
> a 10 second wait before trying to reconnect?
> 2. Let suppose that HBase is in the middle of some work and zk failure
> occurs (some node fail but still remaining n/2+1 zk server) and the
> election protocol begin, does HBase will be ok, or will it begin a shutdown
> sequence? My understanding is that HBase should be ok, as long as there is
> a zk quorum available, it may just need to reconnect, but should not
> shutdown nor be inaccessible.
>
>
> Regards,
> Mikael.S