You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by eluiggi <ed...@gmail.com> on 2014/11/17 20:21:52 UTC

ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Hi,

I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with
4 region servers and 1 zookeeper server. The zookeeper server is running on
the same node as the hbase master. The problem I'm facing is that 3/4 region
servers are down because they can't connect to the zookeeper. The only
region server that stays up is the one running on the same node as the
master and zookeeper. Below is the relevant section of one of the failing
region server logs.

2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection,  connectString=ip-10-146-188-157.ec2.internal:2181
sessionTimeout=60000 watcher=regionserver:60020,    
quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase
2014-11-14 15:46:59,915 INFO
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process 
identifier=regionserver:60020 connecting to ZooKeeper
ensemble=ip-10-146-188-157.ec2.internal:2181
2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:47:00,649 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
thread: Shutdownhook:regionserver60020
2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60041ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:48:00,067 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 1000ms before retry #0...
2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60057ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:49:00,224 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 2000ms before retry #1...
2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60035ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:50:00,360 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 4000ms before retry #2...
2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60048ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:51:00,509 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 8000ms before retry #3...
2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60051ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:52:00,659 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, 
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode =  ConnectionLoss for /hbase/master
2014-11-14 15:52:00,660 ERROR
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
failed after 4 attempts
2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
regionserver:60020,   quorum=ip-10-146-188-157.ec2.internal:2181,
baseZNode=/hbase Unable to set watcher on znode  /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss  for  /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,687 ERROR
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:   regionserver:60020,
quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received
unexpected   KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,692 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
0.0.0.0,60020,1415998019646: Unexpected exception during initialization,
aborting
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at    
org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)

The hbase-site.xml fraction dealing with zookeeper is.
<property>
  <name>zookeeper.znode.parent</name>
  <value>/hbase</value>
</property>
<property>
  <name>zookeeper.znode.rootserver</name>
  <value>root-region-server</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>ip-10-146-188-157.ec2.internal</value>
</property>
<property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2181</value>
</property>

The /etc/hosts for each of the nodes is:
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6


Following some other threads I have removed the limit on the number of
connections, increased the timeout value, and explicitly added the hosts to
/etc/hosts on the region server and master nodes. None of these have helped
so far. 

Any help will be greatly appreciated.



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Posted by eluiggi <ed...@gmail.com>.
Thanks again for your help.

I restarted the 3-node zookeeper cluster and I no longer see the exceptions
in the zookeeper logs. Only warnings.

zookeeper.log
 

Restarting HBase results in the following.
--1 RegionServer sharing HMaster and Zookeeper node is up and running with
no exceptions.
--1 RegionServer sharing Zookeeper node throws exception reportForDuty
regionserver.log


--2 RegionServers (not sharing node with zookeeper or master) throwing
ConnectionLoss exception
regionserver.log




--
View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066076.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Posted by Ted Yu <yu...@gmail.com>.
Looks like the exceptions were omitted. 

Mind sending exceptions again ?

Thanks

On Nov 17, 2014, at 12:36 PM, eluiggi <ed...@gmail.com> wrote:

> The zoo.cfg file is the same for all 3 servers.
> 
> 
> After restarting the zookeeper cluster I see exceptions on all of them like
> the following:
> 
> 
> 
> 
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066042.html
> Sent from the HBase User mailing list archive at Nabble.com.

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Posted by eluiggi <ed...@gmail.com>.
The zoo.cfg file is the same for all 3 servers.


After restarting the zookeeper cluster I see exceptions on all of them like
the following:
 



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066042.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Posted by Ted Yu <yu...@gmail.com>.
Seems to be a zookeeper setup issue.

Mind pastebin'ing your config (for 3 zookeeper servers) ?

Please also check zookeeper server log.

Cheers

On Mon, Nov 17, 2014 at 11:58 AM, eluiggi <ed...@gmail.com> wrote:

> I have tried that as is one of the suggestions from Cloudera manager.
> However, adding the servers results in none of them able to talk to
> zookeeper (not even the one on the sharing the same node) and therefore
> Hbase completely down. The master throws an exception related to the one
> thrown by the region servers.
>
> 2014-11-17 14:50:20,590 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-17 14:50:20,591 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to
> ip-10-146-188-157.ec2.internal/10.146.188.157:2181, initiating session
> 2014-11-17 14:50:20,592 INFO org.apache.zookeeper.ClientCnxn: Unable to
> read
> additional data from server sessionid 0x0, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2014-11-17 14:50:22,576 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-164-167-107.ec2.internal/10.164.167.107:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-17 14:51:00,726 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 40032ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-17 14:51:00,826 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper,
>
> quorum=ip-10-146-194-138.ec2.internal:2181,ip-10-146-188-157.ec2.internal:2181,ip-10-164-167-107.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
> 2014-11-17 14:51:00,827 ERROR
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper create
> failed after 4 attempts
> 2014-11-17 14:51:00,828 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Master exiting
> java.lang.RuntimeException: Failed construction of Master: class
> org.apache.hadoop.hbase.master.HMaster
>         at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2775)
>         at
>
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:184)
>         at
>
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:134)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at
>
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2789)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>         at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:489)
>         at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:468)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1233)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1211)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:174)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:167)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:472)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2770)
>         ... 5 more
>
> One other test that I made was to connect to the zookeeper from one of the
> region server nodes using zkCli.sh. It looks like the connection is
> established but sockets are closed and reopen constantly as the timeout
> limit is reached.
>
> Thanks for the help!
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066039.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Posted by eluiggi <ed...@gmail.com>.
I have tried that as is one of the suggestions from Cloudera manager.
However, adding the servers results in none of them able to talk to
zookeeper (not even the one on the sharing the same node) and therefore
Hbase completely down. The master throws an exception related to the one
thrown by the region servers.

2014-11-17 14:50:20,590 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-17 14:50:20,591 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to
ip-10-146-188-157.ec2.internal/10.146.188.157:2181, initiating session
2014-11-17 14:50:20,592 INFO org.apache.zookeeper.ClientCnxn: Unable to read
additional data from server sessionid 0x0, likely server has closed socket,
closing socket connection and attempting reconnect
2014-11-17 14:50:22,576 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-164-167-107.ec2.internal/10.164.167.107:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-17 14:51:00,726 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 40032ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-17 14:51:00,826 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper,
quorum=ip-10-146-194-138.ec2.internal:2181,ip-10-146-188-157.ec2.internal:2181,ip-10-164-167-107.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
2014-11-17 14:51:00,827 ERROR
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper create
failed after 4 attempts
2014-11-17 14:51:00,828 ERROR
org.apache.hadoop.hbase.master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Failed construction of Master: class
org.apache.hadoop.hbase.master.HMaster
	at
org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2775)
	at
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:184)
	at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:134)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
	at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2789)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
	at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:489)
	at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:468)
	at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1233)
	at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1211)
	at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:174)
	at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:167)
	at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:472)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at
org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2770)
	... 5 more

One other test that I made was to connect to the zookeeper from one of the
region server nodes using zkCli.sh. It looks like the connection is
established but sockets are closed and reopen constantly as the timeout
limit is reached.

Thanks for the help! 



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066039.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Posted by Ted Yu <yu...@gmail.com>.
Any chance that you can use three servers in your zookeeper quorum ?

Cheers

On Mon, Nov 17, 2014 at 11:21 AM, eluiggi <ed...@gmail.com> wrote:

> Hi,
>
> I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with
> 4 region servers and 1 zookeeper server. The zookeeper server is running on
> the same node as the hbase master. The problem I'm facing is that 3/4
> region
> servers are down because they can't connect to the zookeeper. The only
> region server that stays up is the one running on the same node as the
> master and zookeeper. Below is the relevant section of one of the failing
> region server logs.
>
> 2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection,  connectString=ip-10-146-188-157.ec2.internal:2181
> sessionTimeout=60000 watcher=regionserver:60020,
> quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase
> 2014-11-14 15:46:59,915 INFO
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process
> identifier=regionserver:60020 connecting to ZooKeeper
> ensemble=ip-10-146-188-157.ec2.internal:2181
> 2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:47:00,649 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
> thread: Shutdownhook:regionserver60020
> 2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60041ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:48:00,067 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 1000ms before retry #0...
> 2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60057ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:49:00,224 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 2000ms before retry #1...
> 2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60035ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:50:00,360 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 4000ms before retry #2...
> 2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60048ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:51:00,509 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 8000ms before retry #3...
> 2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60051ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:52:00,659 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode =  ConnectionLoss for /hbase/master
> 2014-11-14 15:52:00,660 ERROR
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
> failed after 4 attempts
> 2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020,   quorum=ip-10-146-188-157.ec2.internal:2181,
> baseZNode=/hbase Unable to set watcher on znode  /hbase/master
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss  for  /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>     at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
>     at java.lang.Thread.run(Thread.java:744)
> 2014-11-14 15:52:00,687 ERROR
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:   regionserver:60020,
> quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received
> unexpected   KeeperException, re-throwing exception
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>     at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
>     at java.lang.Thread.run(Thread.java:744)
> 2014-11-14 15:52:00,692 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> 0.0.0.0,60020,1415998019646: Unexpected exception during initialization,
> aborting
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>     at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
>     at java.lang.Thread.run(Thread.java:744)
>
> The hbase-site.xml fraction dealing with zookeeper is.
> <property>
>   <name>zookeeper.znode.parent</name>
>   <value>/hbase</value>
> </property>
> <property>
>   <name>zookeeper.znode.rootserver</name>
>   <value>root-region-server</value>
> </property>
> <property>
>   <name>hbase.zookeeper.quorum</name>
>   <value>ip-10-146-188-157.ec2.internal</value>
> </property>
> <property>
>   <name>hbase.zookeeper.property.clientPort</name>
>   <value>2181</value>
> </property>
>
> The /etc/hosts for each of the nodes is:
> 127.0.0.1               localhost.localdomain localhost
> ::1             localhost6.localdomain6 localhost6
>
>
> Following some other threads I have removed the limit on the number of
> connections, increased the timeout value, and explicitly added the hosts to
> /etc/hosts on the region server and master nodes. None of these have helped
> so far.
>
> Any help will be greatly appreciated.
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034.html
> Sent from the HBase User mailing list archive at Nabble.com.
>