You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "ac@hsk.hk" <ac...@hsk.hk> on 2012/11/21 12:16:15 UTC

A region server stopped (timeout after trying to connect local Zookeeper)

Hi,


Please help!!

HBase version: 0.94
ZooKeeper: 3.4.4

One of the regional servers stopped very quickly after HBASE is started:

### Check JPS after HBASE cluster was started, could find the HRegionServer process (*** there is no any ZooKeeper instance running in this server ***)
$ jps
24767 Jps
18418 TaskTracker
24678 HRegionServer
18156 DataNode

### Wait a while and checked JPS again,  HRegionServer process gone
$ jps
18418 TaskTracker
24784 Jps
18156 DataNode


### Here is the setting in hbase-site.xml ( enabled hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

<property>
<name>hbase.ZooKeeper.quorum</name>
<value>m146,m145,m143</value>
</property>

<property>
<name>zookeeper.session.timeout</name>
<value>60000</value>
</property>


### hbase-env.sh also tells HBASE not to manage local instance of ZooKeeper
export HBASE_MANAGES_ZK=false


###This server can connect to the 3 ZooKeepers,
./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED) 0]


### checked the hbase log file, found something odd,  seemed that it tried to connect local ZooKeeper 
2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=60000 watcher=regionserver:60020

2012-11-21 17:31:33,254 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 2000ms before retry #1...
2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60010ms for sessionid 0x0, closing socket connection and attempting reconnect

2012-11-21 17:32:33,362 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

......

2012-11-21 17:34:33,570 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries
2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020 Unable to set watcher on znode /hbase/master
2012-11-21 17:34:33,573 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020 Received unexpected KeeperException, re-throwing exception
2012-11-21 17:34:33,573 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ......
2012-11-21 17:34:33,576 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []

2012-11-21 17:34:36,580 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server m144,60020,1353490232962: Initialization of RS failed.  Hence aborting RS.
java.io.IOException: Received the shutdown message while waiting.
	at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
	at java.lang.Thread.run(Thread.java:662)
2012-11-21 17:34:36,581 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []


Please help!
QUESTION: Is it a bug and I need to check something else?  

Thanks

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by "ac@hsk.hk" <ac...@hsk.hk>.

Hi JM, 

Thank you!

it is case sensitive indeed, a simple change of  'z' brings back ALL RegionServers (and a 'Z' could bring down all too), I spent few hours on other areas and hadn't realized this 'Z' effect.

Thanks again.
 

On 22 Nov 2012, at 8:39 AM, Jean-Marc Spaggiari wrote:

> I think the MAIN difference is the uppercase on the property... Seems
> that hbase-site.xml is case sensitive (which seems to be normal in
> Java and unix world).
> 
> You might want to retry by putting back the uppercase to see if this
> was the issue.
> 
> JM
> 
> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>> Hi
>> 
>> I changed the order of ZooKeepers in the value of hbase.zookeeper.quorum,
>> from "m146,m145,m143" to "m143,m145,m146", set timeout from 60000 to 70000,
>> and commented out lzo property.  it works now, here is the diff
>> 
>> 1) $ diff hbase-site.xml hbase-site.xml.xxx
>> 41,44c41,43
>> <
>> < <property>
>> < <name>hbase.zookeeper.quorum</name>
>> < <value>m143,m145,m146</value>
>> ---
>>> <property>
>>> <name>hbase.ZooKeeper.quorum</name>
>>> <value>m146,m145,m143</value>
>> 49c48,55
>> < <value>70000</value>
>> ---
>>> <value>60000</value>
>>> </property>
>>> 
>>> <!--
>>> /**
>>> <property>
>>> <name>hbase.regionserver.codecs</name>
>>> <value>lzo,gz</value>
>> 50a57,58
>>> **/
>>> -->
>> 
>> Above is the only change today .
>> 
>> 
>> 2) hbase log:
>> 2012-11-22 07:26:19,431 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=m145:2181,m143:2181,m146:2181
>> sessionTimeout=70000 watcher=regionserver:6$
>> 
>> 
>> I don't know why but it works now. It seems that hbase somehow could not
>> read in hbase-site.xml correctly.
>> 
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> On 22 Nov 2012, at 7:51 AM, Jean-Marc Spaggiari wrote:
>> 
>>> Can you do JPS on your master and look at the logs too?
>>> 
>>> Another think, can you try with hbase.zookeeper.quorum instead of
>>> hbase.ZooKeeper.quorum?
>>> 
>>> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>>>> Hi,
>>>> 
>>>> Here are my HBase configuration and test:
>>>> 
>>>> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
>>>> <property>
>>>> <name>hbase.ZooKeeper.quorum</name>
>>>> <value>m146,m145,m143</value>
>>>> </property>
>>>> 
>>>> <property>
>>>> <name>zookeeper.session.timeout</name>
>>>> <value>60000</value>
>>>> </property>
>>>> 
>>>> 
>>>> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
>>>> export HBASE_MANAGES_ZK=false
>>>> 
>>>> 
>>>> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
>>>> connection, it worked
>>>> [zk: m145,m146,m143(CONNECTED) 0]
>>>> 
>>>> 
>>>> 4) from the logs, I found that the connectString was odd, the
>>>> RegionServer
>>>> did not use the setting of "hbase.ZooKeeper.quorum" in
>>>> conf/hbase-site.xml,
>>>> it seemed that it always used the default and tried to connect
>>>> "localhost:2181" in the distributed cluster:
>>>> 
>>>> 	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>> watcher=regionserver:60020
>>>> 	...
>>>> 	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server localhost/127.0.0.1:2181. Will not attempt
>>>> to
>>>> authenticate using SASL (Unable to locate a login configura$
>>>> 	...
>>>> 	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session
>>>> 0x0
>>>> for server null, unexpected error, closing socket connection and
>>>> attempting
>>>> reconnect java.net.ConnectException: Connection refused
>>>> 	...  (remark: it tried above 3 times, then had FATAL error as follows)
>>>> 
>>>> 	2012-11-21 17:21:57,846 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>>> Received unexpected KeeperException, re-throwing exception
>>>> 	...
>>>> 	2012-11-21 17:21:57,847 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>> ...
>>>> 
>>>> 
>>>> 
>>>> Please help.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> What do you have on your HBase configuration? Are you passing the name
>>>>> of the Quorum servers?
>>>>> $ cat conf/hbase-site.xml
>>>>> ......
>>>>> </property>
>>>>>  <property>
>>>>>    <name>hbase.zookeeper.quorum</name>
>>>>>    <value>cube,latitude,node3</value>
>>>>>    <description>Comma separated list of servers in the ZooKeeper
>>>>> Quorum.
>>>>>    For example,
>>>>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>>>>    By default this is set to localhost for local and
>>>>> pseudo-distributed
>>>>> modes
>>>>>    of operation. For a fully-distributed setup, this should be set to
>>>>> a
>>>>> full
>>>>>    list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>>>>> hbase-env.sh
>>>>>    this is the list of servers which we will start/stop ZooKeeper on.
>>>>>    </description>
>>>>>  </property>
>>>>> .....
>>>>> 
>>>>> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> I have the following line in /etc/hosts in all servers, should I keep
>>>>>> it
>>>>>> or
>>>>>> comment it out or ...?
>>>>>> 
>>>>>> 127.0.0.1       localhost
>>>>>> 
>>>>>> Please help.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> Please help!!
>>>>>>> 
>>>>>>> HBase version: 0.94
>>>>>>> ZooKeeper: 3.4.4
>>>>>>> 
>>>>>>> One of the regional servers stopped very quickly after HBASE is
>>>>>>> started:
>>>>>>> 
>>>>>>> ### Check JPS after HBASE cluster was started, could find the
>>>>>>> HRegionServer process (*** there is no any ZooKeeper instance running
>>>>>>> in
>>>>>>> this server ***)
>>>>>>> $ jps
>>>>>>> 24767 Jps
>>>>>>> 18418 TaskTracker
>>>>>>> 24678 HRegionServer
>>>>>>> 18156 DataNode
>>>>>>> 
>>>>>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>>>>>> $ jps
>>>>>>> 18418 TaskTracker
>>>>>>> 24784 Jps
>>>>>>> 18156 DataNode
>>>>>>> 
>>>>>>> 
>>>>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>>>>> <property>
>>>>>>> <name>hbase.cluster.distributed</name>
>>>>>>> <value>true</value>
>>>>>>> </property>
>>>>>>> 
>>>>>>> <property>
>>>>>>> <name>hbase.ZooKeeper.quorum</name>
>>>>>>> <value>m146,m145,m143</value>
>>>>>>> </property>
>>>>>>> 
>>>>>>> <property>
>>>>>>> <name>zookeeper.session.timeout</name>
>>>>>>> <value>60000</value>
>>>>>>> </property>
>>>>>>> 
>>>>>>> 
>>>>>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>>>>>> ZooKeeper
>>>>>>> export HBASE_MANAGES_ZK=false
>>>>>>> 
>>>>>>> 
>>>>>>> ###This server can connect to the 3 ZooKeepers,
>>>>>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk:
>>>>>>> m145,m146,m143(CONNECTED)
>>>>>>> 0]
>>>>>>> 
>>>>>>> 
>>>>>>> ### checked the hbase log file, found something odd,  seemed that it
>>>>>>> tried
>>>>>>> to connect local ZooKeeper
>>>>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper:
>>>>>>> Initiating
>>>>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>>>>> watcher=regionserver:60020
>>>>>>> 
>>>>>>> 2012-11-21 17:31:33,254 WARN
>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>> ZooKeeper exception:
>>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>>>> 
>>>>>>> 2012-11-21 17:31:33,254 INFO
>>>>>>> org.apache.hadoop.hbase.util.RetryCounter:
>>>>>>> Sleeping 2000ms before retry #1...
>>>>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>>>> session timed out, have not heard from server in 60010ms for
>>>>>>> sessionid
>>>>>>> 0x0, closing socket connection and attempting reconnect
>>>>>>> 
>>>>>>> 2012-11-21 17:32:33,362 WARN
>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>> ZooKeeper exception:
>>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>>>> 
>>>>>>> ......
>>>>>>> 
>>>>>>> 2012-11-21 17:34:33,570 ERROR
>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>>>> exists
>>>>>>> failed after 3 retries
>>>>>>> 2012-11-21 17:34:33,571 WARN
>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>>>>> 2012-11-21 17:34:33,573 ERROR
>>>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
>>>>>>> regionserver:60020
>>>>>>> Received unexpected KeeperException, re-throwing exception
>>>>>>> 2012-11-21 17:34:33,573 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>>>> server
>>>>>>> ......
>>>>>>> 2012-11-21 17:34:33,576 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>>>>> abort:
>>>>>>> loaded coprocessors are: []
>>>>>>> 
>>>>>>> 2012-11-21 17:34:36,580 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>>>> server
>>>>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence
>>>>>>> aborting
>>>>>>> RS.
>>>>>>> java.io.IOException: Received the shutdown message while waiting.
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>>>>> 2012-11-21 17:34:36,581 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>>>>> abort:
>>>>>>> loaded coprocessors are: []
>>>>>>> 
>>>>>>> 
>>>>>>> Please help!
>>>>>>> QUESTION: Is it a bug and I need to check something else?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

I think the MAIN difference is the uppercase on the property... Seems
that hbase-site.xml is case sensitive (which seems to be normal in
Java and unix world).

You might want to retry by putting back the uppercase to see if this
was the issue.

JM

2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
> Hi
>
> I changed the order of ZooKeepers in the value of hbase.zookeeper.quorum,
> from "m146,m145,m143" to "m143,m145,m146", set timeout from 60000 to 70000,
> and commented out lzo property.  it works now, here is the diff
>
> 1) $ diff hbase-site.xml hbase-site.xml.xxx
> 41,44c41,43
> <
> < <property>
> < <name>hbase.zookeeper.quorum</name>
> < <value>m143,m145,m146</value>
> ---
>> <property>
>> <name>hbase.ZooKeeper.quorum</name>
>> <value>m146,m145,m143</value>
> 49c48,55
> < <value>70000</value>
> ---
>> <value>60000</value>
>> </property>
>>
>> <!--
>> /**
>> <property>
>> <name>hbase.regionserver.codecs</name>
>> <value>lzo,gz</value>
> 50a57,58
>> **/
>> -->
>
> Above is the only change today .
>
>
> 2) hbase log:
> 2012-11-22 07:26:19,431 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=m145:2181,m143:2181,m146:2181
> sessionTimeout=70000 watcher=regionserver:6$
>
>
> I don't know why but it works now. It seems that hbase somehow could not
> read in hbase-site.xml correctly.
>
>
> Thanks
>
>
>
>
> On 22 Nov 2012, at 7:51 AM, Jean-Marc Spaggiari wrote:
>
>> Can you do JPS on your master and look at the logs too?
>>
>> Another think, can you try with hbase.zookeeper.quorum instead of
>> hbase.ZooKeeper.quorum?
>>
>> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>>> Hi,
>>>
>>> Here are my HBase configuration and test:
>>>
>>> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
>>> <property>
>>> <name>hbase.ZooKeeper.quorum</name>
>>> <value>m146,m145,m143</value>
>>> </property>
>>>
>>> <property>
>>> <name>zookeeper.session.timeout</name>
>>> <value>60000</value>
>>> </property>
>>>
>>>
>>> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
>>> export HBASE_MANAGES_ZK=false
>>>
>>>
>>> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
>>> connection, it worked
>>> [zk: m145,m146,m143(CONNECTED) 0]
>>>
>>>
>>> 4) from the logs, I found that the connectString was odd, the
>>> RegionServer
>>> did not use the setting of "hbase.ZooKeeper.quorum" in
>>> conf/hbase-site.xml,
>>> it seemed that it always used the default and tried to connect
>>> "localhost:2181" in the distributed cluster:
>>>
>>> 	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>> watcher=regionserver:60020
>>> 	...
>>> 	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server localhost/127.0.0.1:2181. Will not attempt
>>> to
>>> authenticate using SASL (Unable to locate a login configura$
>>> 	...
>>> 	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session
>>> 0x0
>>> for server null, unexpected error, closing socket connection and
>>> attempting
>>> reconnect java.net.ConnectException: Connection refused
>>> 	...  (remark: it tried above 3 times, then had FATAL error as follows)
>>>
>>> 	2012-11-21 17:21:57,846 ERROR
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>> Received unexpected KeeperException, re-throwing exception
>>> 	...
>>> 	2012-11-21 17:21:57,847 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>> server
>>> ...
>>>
>>>
>>>
>>> Please help.
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>>>
>>>> Hi,
>>>>
>>>> What do you have on your HBase configuration? Are you passing the name
>>>> of the Quorum servers?
>>>> $ cat conf/hbase-site.xml
>>>> ......
>>>> </property>
>>>>   <property>
>>>>     <name>hbase.zookeeper.quorum</name>
>>>>     <value>cube,latitude,node3</value>
>>>>     <description>Comma separated list of servers in the ZooKeeper
>>>> Quorum.
>>>>     For example,
>>>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>>>     By default this is set to localhost for local and
>>>> pseudo-distributed
>>>> modes
>>>>     of operation. For a fully-distributed setup, this should be set to
>>>> a
>>>> full
>>>>     list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>>>> hbase-env.sh
>>>>     this is the list of servers which we will start/stop ZooKeeper on.
>>>>     </description>
>>>>   </property>
>>>> .....
>>>>
>>>> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>>>>> Hi,
>>>>>
>>>>>
>>>>> I have the following line in /etc/hosts in all servers, should I keep
>>>>> it
>>>>> or
>>>>> comment it out or ...?
>>>>>
>>>>> 127.0.0.1       localhost
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Please help!!
>>>>>>
>>>>>> HBase version: 0.94
>>>>>> ZooKeeper: 3.4.4
>>>>>>
>>>>>> One of the regional servers stopped very quickly after HBASE is
>>>>>> started:
>>>>>>
>>>>>> ### Check JPS after HBASE cluster was started, could find the
>>>>>> HRegionServer process (*** there is no any ZooKeeper instance running
>>>>>> in
>>>>>> this server ***)
>>>>>> $ jps
>>>>>> 24767 Jps
>>>>>> 18418 TaskTracker
>>>>>> 24678 HRegionServer
>>>>>> 18156 DataNode
>>>>>>
>>>>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>>>>> $ jps
>>>>>> 18418 TaskTracker
>>>>>> 24784 Jps
>>>>>> 18156 DataNode
>>>>>>
>>>>>>
>>>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>>>> <property>
>>>>>> <name>hbase.cluster.distributed</name>
>>>>>> <value>true</value>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>> <name>hbase.ZooKeeper.quorum</name>
>>>>>> <value>m146,m145,m143</value>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>> <name>zookeeper.session.timeout</name>
>>>>>> <value>60000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>>>>> ZooKeeper
>>>>>> export HBASE_MANAGES_ZK=false
>>>>>>
>>>>>>
>>>>>> ###This server can connect to the 3 ZooKeepers,
>>>>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk:
>>>>>> m145,m146,m143(CONNECTED)
>>>>>> 0]
>>>>>>
>>>>>>
>>>>>> ### checked the hbase log file, found something odd,  seemed that it
>>>>>> tried
>>>>>> to connect local ZooKeeper
>>>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper:
>>>>>> Initiating
>>>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>>>> watcher=regionserver:60020
>>>>>>
>>>>>> 2012-11-21 17:31:33,254 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>> transient
>>>>>> ZooKeeper exception:
>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>>>
>>>>>> 2012-11-21 17:31:33,254 INFO
>>>>>> org.apache.hadoop.hbase.util.RetryCounter:
>>>>>> Sleeping 2000ms before retry #1...
>>>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>>> session timed out, have not heard from server in 60010ms for
>>>>>> sessionid
>>>>>> 0x0, closing socket connection and attempting reconnect
>>>>>>
>>>>>> 2012-11-21 17:32:33,362 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>> transient
>>>>>> ZooKeeper exception:
>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>>>
>>>>>> ......
>>>>>>
>>>>>> 2012-11-21 17:34:33,570 ERROR
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>>> exists
>>>>>> failed after 3 retries
>>>>>> 2012-11-21 17:34:33,571 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>>>> 2012-11-21 17:34:33,573 ERROR
>>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
>>>>>> regionserver:60020
>>>>>> Received unexpected KeeperException, re-throwing exception
>>>>>> 2012-11-21 17:34:33,573 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>>> server
>>>>>> ......
>>>>>> 2012-11-21 17:34:33,576 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>>>> abort:
>>>>>> loaded coprocessors are: []
>>>>>>
>>>>>> 2012-11-21 17:34:36,580 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>>> server
>>>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence
>>>>>> aborting
>>>>>> RS.
>>>>>> java.io.IOException: Received the shutdown message while waiting.
>>>>>> 	at
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>>>> 	at
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>>>> 	at
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>>>> 	at
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>>>> 2012-11-21 17:34:36,581 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>>>> abort:
>>>>>> loaded coprocessors are: []
>>>>>>
>>>>>>
>>>>>> Please help!
>>>>>> QUESTION: Is it a bug and I need to check something else?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by "ac@hsk.hk" <ac...@hsk.hk>.

Hi 

I changed the order of ZooKeepers in the value of hbase.zookeeper.quorum,  from "m146,m145,m143" to "m143,m145,m146", set timeout from 60000 to 70000, and commented out lzo property.  it works now, here is the diff

1) $ diff hbase-site.xml hbase-site.xml.xxx 
41,44c41,43
< 
< <property> 
< <name>hbase.zookeeper.quorum</name> 
< <value>m143,m145,m146</value> 
---
> <property>
> <name>hbase.ZooKeeper.quorum</name>
> <value>m146,m145,m143</value>
49c48,55
< <value>70000</value>
---
> <value>60000</value>
> </property>
> 
> <!--
> /**
> <property>
> <name>hbase.regionserver.codecs</name>
> <value>lzo,gz</value>
50a57,58
> **/
> -->

Above is the only change today .


2) hbase log:
2012-11-22 07:26:19,431 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=m145:2181,m143:2181,m146:2181 sessionTimeout=70000 watcher=regionserver:6$


I don't know why but it works now. It seems that hbase somehow could not read in hbase-site.xml correctly.


Thanks




On 22 Nov 2012, at 7:51 AM, Jean-Marc Spaggiari wrote:

> Can you do JPS on your master and look at the logs too?
> 
> Another think, can you try with hbase.zookeeper.quorum instead of
> hbase.ZooKeeper.quorum?
> 
> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>> Hi,
>> 
>> Here are my HBase configuration and test:
>> 
>> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
>> <property>
>> <name>hbase.ZooKeeper.quorum</name>
>> <value>m146,m145,m143</value>
>> </property>
>> 
>> <property>
>> <name>zookeeper.session.timeout</name>
>> <value>60000</value>
>> </property>
>> 
>> 
>> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
>> export HBASE_MANAGES_ZK=false
>> 
>> 
>> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
>> connection, it worked
>> [zk: m145,m146,m143(CONNECTED) 0]
>> 
>> 
>> 4) from the logs, I found that the connectString was odd, the RegionServer
>> did not use the setting of "hbase.ZooKeeper.quorum" in conf/hbase-site.xml,
>> it seemed that it always used the default and tried to connect
>> "localhost:2181" in the distributed cluster:
>> 
>> 	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=localhost:2181 sessionTimeout=60000
>> watcher=regionserver:60020
>> 	...
>> 	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server localhost/127.0.0.1:2181. Will not attempt to
>> authenticate using SASL (Unable to locate a login configura$
>> 	...
>> 	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
>> for server null, unexpected error, closing socket connection and attempting
>> reconnect java.net.ConnectException: Connection refused
>> 	...  (remark: it tried above 3 times, then had FATAL error as follows)
>> 
>> 	2012-11-21 17:21:57,846 ERROR
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>> Received unexpected KeeperException, re-throwing exception
>> 	...
>> 	2012-11-21 17:21:57,847 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>> ...
>> 
>> 
>> 
>> Please help.
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>> 
>>> Hi,
>>> 
>>> What do you have on your HBase configuration? Are you passing the name
>>> of the Quorum servers?
>>> $ cat conf/hbase-site.xml
>>> ......
>>> </property>
>>>   <property>
>>>     <name>hbase.zookeeper.quorum</name>
>>>     <value>cube,latitude,node3</value>
>>>     <description>Comma separated list of servers in the ZooKeeper
>>> Quorum.
>>>     For example,
>>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>>     By default this is set to localhost for local and pseudo-distributed
>>> modes
>>>     of operation. For a fully-distributed setup, this should be set to a
>>> full
>>>     list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>>> hbase-env.sh
>>>     this is the list of servers which we will start/stop ZooKeeper on.
>>>     </description>
>>>   </property>
>>> .....
>>> 
>>> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>>>> Hi,
>>>> 
>>>> 
>>>> I have the following line in /etc/hosts in all servers, should I keep it
>>>> or
>>>> comment it out or ...?
>>>> 
>>>> 127.0.0.1       localhost
>>>> 
>>>> Please help.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> 
>>>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> Please help!!
>>>>> 
>>>>> HBase version: 0.94
>>>>> ZooKeeper: 3.4.4
>>>>> 
>>>>> One of the regional servers stopped very quickly after HBASE is
>>>>> started:
>>>>> 
>>>>> ### Check JPS after HBASE cluster was started, could find the
>>>>> HRegionServer process (*** there is no any ZooKeeper instance running
>>>>> in
>>>>> this server ***)
>>>>> $ jps
>>>>> 24767 Jps
>>>>> 18418 TaskTracker
>>>>> 24678 HRegionServer
>>>>> 18156 DataNode
>>>>> 
>>>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>>>> $ jps
>>>>> 18418 TaskTracker
>>>>> 24784 Jps
>>>>> 18156 DataNode
>>>>> 
>>>>> 
>>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>>> <property>
>>>>> <name>hbase.cluster.distributed</name>
>>>>> <value>true</value>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>> <name>hbase.ZooKeeper.quorum</name>
>>>>> <value>m146,m145,m143</value>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>> <name>zookeeper.session.timeout</name>
>>>>> <value>60000</value>
>>>>> </property>
>>>>> 
>>>>> 
>>>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>>>> ZooKeeper
>>>>> export HBASE_MANAGES_ZK=false
>>>>> 
>>>>> 
>>>>> ###This server can connect to the 3 ZooKeepers,
>>>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED)
>>>>> 0]
>>>>> 
>>>>> 
>>>>> ### checked the hbase log file, found something odd,  seemed that it
>>>>> tried
>>>>> to connect local ZooKeeper
>>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>>> watcher=regionserver:60020
>>>>> 
>>>>> 2012-11-21 17:31:33,254 WARN
>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>> ZooKeeper exception:
>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>> 
>>>>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>>>>> Sleeping 2000ms before retry #1...
>>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>> session timed out, have not heard from server in 60010ms for sessionid
>>>>> 0x0, closing socket connection and attempting reconnect
>>>>> 
>>>>> 2012-11-21 17:32:33,362 WARN
>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>> ZooKeeper exception:
>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>> 
>>>>> ......
>>>>> 
>>>>> 2012-11-21 17:34:33,570 ERROR
>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>> exists
>>>>> failed after 3 retries
>>>>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>>> 2012-11-21 17:34:33,573 ERROR
>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>>>> Received unexpected KeeperException, re-throwing exception
>>>>> 2012-11-21 17:34:33,573 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>> server
>>>>> ......
>>>>> 2012-11-21 17:34:33,576 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>>> loaded coprocessors are: []
>>>>> 
>>>>> 2012-11-21 17:34:36,580 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>> server
>>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>>>>> RS.
>>>>> java.io.IOException: Received the shutdown message while waiting.
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>>> 2012-11-21 17:34:36,581 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>>> loaded coprocessors are: []
>>>>> 
>>>>> 
>>>>> Please help!
>>>>> QUESTION: Is it a bug and I need to check something else?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Can you do JPS on your master and look at the logs too?

Another think, can you try with hbase.zookeeper.quorum instead of
hbase.ZooKeeper.quorum?

2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
> Hi,
>
> Here are my HBase configuration and test:
>
> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
> <property>
> <name>hbase.ZooKeeper.quorum</name>
> <value>m146,m145,m143</value>
> </property>
>
> <property>
> <name>zookeeper.session.timeout</name>
> <value>60000</value>
> </property>
>
>
> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
> export HBASE_MANAGES_ZK=false
>
>
> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
> connection, it worked
> [zk: m145,m146,m143(CONNECTED) 0]
>
>
> 4) from the logs, I found that the connectString was odd, the RegionServer
> did not use the setting of "hbase.ZooKeeper.quorum" in conf/hbase-site.xml,
> it seemed that it always used the default and tried to connect
> "localhost:2181" in the distributed cluster:
>
> 	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=localhost:2181 sessionTimeout=60000
> watcher=regionserver:60020
> 	...
> 	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server localhost/127.0.0.1:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configura$
> 	...
> 	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
> for server null, unexpected error, closing socket connection and attempting
> reconnect java.net.ConnectException: Connection refused
> 	...  (remark: it tried above 3 times, then had FATAL error as follows)
>
> 	2012-11-21 17:21:57,846 ERROR
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
> Received unexpected KeeperException, re-throwing exception
> 	...
> 	2012-11-21 17:21:57,847 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> ...
>
>
>
> Please help.
>
> Thanks
>
>
>
>
>
> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>
>> Hi,
>>
>> What do you have on your HBase configuration? Are you passing the name
>> of the Quorum servers?
>> $ cat conf/hbase-site.xml
>> ......
>>  </property>
>>    <property>
>>      <name>hbase.zookeeper.quorum</name>
>>      <value>cube,latitude,node3</value>
>>      <description>Comma separated list of servers in the ZooKeeper
>> Quorum.
>>      For example,
>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>      By default this is set to localhost for local and pseudo-distributed
>> modes
>>      of operation. For a fully-distributed setup, this should be set to a
>> full
>>      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>> hbase-env.sh
>>      this is the list of servers which we will start/stop ZooKeeper on.
>>      </description>
>>    </property>
>> .....
>>
>> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>>> Hi,
>>>
>>>
>>> I have the following line in /etc/hosts in all servers, should I keep it
>>> or
>>> comment it out or ...?
>>>
>>> 127.0.0.1       localhost
>>>
>>> Please help.
>>>
>>> Thanks
>>>
>>>
>>>
>>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> Please help!!
>>>>
>>>> HBase version: 0.94
>>>> ZooKeeper: 3.4.4
>>>>
>>>> One of the regional servers stopped very quickly after HBASE is
>>>> started:
>>>>
>>>> ### Check JPS after HBASE cluster was started, could find the
>>>> HRegionServer process (*** there is no any ZooKeeper instance running
>>>> in
>>>> this server ***)
>>>> $ jps
>>>> 24767 Jps
>>>> 18418 TaskTracker
>>>> 24678 HRegionServer
>>>> 18156 DataNode
>>>>
>>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>>> $ jps
>>>> 18418 TaskTracker
>>>> 24784 Jps
>>>> 18156 DataNode
>>>>
>>>>
>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>> <property>
>>>> <name>hbase.cluster.distributed</name>
>>>> <value>true</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>hbase.ZooKeeper.quorum</name>
>>>> <value>m146,m145,m143</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>zookeeper.session.timeout</name>
>>>> <value>60000</value>
>>>> </property>
>>>>
>>>>
>>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>>> ZooKeeper
>>>> export HBASE_MANAGES_ZK=false
>>>>
>>>>
>>>> ###This server can connect to the 3 ZooKeepers,
>>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED)
>>>> 0]
>>>>
>>>>
>>>> ### checked the hbase log file, found something odd,  seemed that it
>>>> tried
>>>> to connect local ZooKeeper
>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>> watcher=regionserver:60020
>>>>
>>>> 2012-11-21 17:31:33,254 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>
>>>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>>>> Sleeping 2000ms before retry #1...
>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session timed out, have not heard from server in 60010ms for sessionid
>>>> 0x0, closing socket connection and attempting reconnect
>>>>
>>>> 2012-11-21 17:32:33,362 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>
>>>> ......
>>>>
>>>> 2012-11-21 17:34:33,570 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>> exists
>>>> failed after 3 retries
>>>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>> 2012-11-21 17:34:33,573 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>>> Received unexpected KeeperException, re-throwing exception
>>>> 2012-11-21 17:34:33,573 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>> ......
>>>> 2012-11-21 17:34:33,576 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>> loaded coprocessors are: []
>>>>
>>>> 2012-11-21 17:34:36,580 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>>>> RS.
>>>> java.io.IOException: Received the shutdown message while waiting.
>>>> 	at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>> 	at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>> 	at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>> 	at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>> 2012-11-21 17:34:36,581 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>> loaded coprocessors are: []
>>>>
>>>>
>>>> Please help!
>>>> QUESTION: Is it a bug and I need to check something else?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by "ac@hsk.hk" <ac...@hsk.hk>.

Hi,

Here are my HBase configuration and test:

1) {$HBASE_HOME}hbase/conf/hbase-site.xml
<property>
<name>hbase.ZooKeeper.quorum</name>
<value>m146,m145,m143</value>
</property>

<property>
<name>zookeeper.session.timeout</name>
<value>60000</value>
</property>


2) {$HBASE_HOME}hbase/conf/hbase-env.sh
export HBASE_MANAGES_ZK=false


3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the connection, it worked
[zk: m145,m146,m143(CONNECTED) 0]


4) from the logs, I found that the connectString was odd, the RegionServer did not use the setting of "hbase.ZooKeeper.quorum" in conf/hbase-site.xml, it seemed that it always used the default and tried to connect "localhost:2181" in the distributed cluster: 

	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=60000 watcher=regionserver:60020
	...
	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (Unable to locate a login configura$
	...
	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused
	...  (remark: it tried above 3 times, then had FATAL error as follows)
       
	2012-11-21 17:21:57,846 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020 Received unexpected KeeperException, re-throwing exception 
	...
	2012-11-21 17:21:57,847 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ...



Please help.
 
Thanks





On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:

> Hi,
> 
> What do you have on your HBase configuration? Are you passing the name
> of the Quorum servers?
> $ cat conf/hbase-site.xml
> ......
>  </property>
>    <property>
>      <name>hbase.zookeeper.quorum</name>
>      <value>cube,latitude,node3</value>
>      <description>Comma separated list of servers in the ZooKeeper Quorum.
>      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>      By default this is set to localhost for local and pseudo-distributed modes
>      of operation. For a fully-distributed setup, this should be set to a full
>      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
> hbase-env.sh
>      this is the list of servers which we will start/stop ZooKeeper on.
>      </description>
>    </property>
> .....
> 
> 2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
>> Hi,
>> 
>> 
>> I have the following line in /etc/hosts in all servers, should I keep it or
>> comment it out or ...?
>> 
>> 127.0.0.1       localhost
>> 
>> Please help.
>> 
>> Thanks
>> 
>> 
>> 
>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> Please help!!
>>> 
>>> HBase version: 0.94
>>> ZooKeeper: 3.4.4
>>> 
>>> One of the regional servers stopped very quickly after HBASE is started:
>>> 
>>> ### Check JPS after HBASE cluster was started, could find the
>>> HRegionServer process (*** there is no any ZooKeeper instance running in
>>> this server ***)
>>> $ jps
>>> 24767 Jps
>>> 18418 TaskTracker
>>> 24678 HRegionServer
>>> 18156 DataNode
>>> 
>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>> $ jps
>>> 18418 TaskTracker
>>> 24784 Jps
>>> 18156 DataNode
>>> 
>>> 
>>> ### Here is the setting in hbase-site.xml ( enabled
>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>> <property>
>>> <name>hbase.cluster.distributed</name>
>>> <value>true</value>
>>> </property>
>>> 
>>> <property>
>>> <name>hbase.ZooKeeper.quorum</name>
>>> <value>m146,m145,m143</value>
>>> </property>
>>> 
>>> <property>
>>> <name>zookeeper.session.timeout</name>
>>> <value>60000</value>
>>> </property>
>>> 
>>> 
>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>> ZooKeeper
>>> export HBASE_MANAGES_ZK=false
>>> 
>>> 
>>> ###This server can connect to the 3 ZooKeepers,
>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED)
>>> 0]
>>> 
>>> 
>>> ### checked the hbase log file, found something odd,  seemed that it tried
>>> to connect local ZooKeeper
>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>> watcher=regionserver:60020
>>> 
>>> 2012-11-21 17:31:33,254 WARN
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>>> ZooKeeper exception:
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>> 
>>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>>> Sleeping 2000ms before retry #1...
>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>> session timed out, have not heard from server in 60010ms for sessionid
>>> 0x0, closing socket connection and attempting reconnect
>>> 
>>> 2012-11-21 17:32:33,362 WARN
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>>> ZooKeeper exception:
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>> 
>>> ......
>>> 
>>> 2012-11-21 17:34:33,570 ERROR
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
>>> failed after 3 retries
>>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>> 2012-11-21 17:34:33,573 ERROR
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>> Received unexpected KeeperException, re-throwing exception
>>> 2012-11-21 17:34:33,573 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>>> ......
>>> 2012-11-21 17:34:33,576 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>> loaded coprocessors are: []
>>> 
>>> 2012-11-21 17:34:36,580 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>>> RS.
>>> java.io.IOException: Received the shutdown message while waiting.
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>> 	at java.lang.Thread.run(Thread.java:662)
>>> 2012-11-21 17:34:36,581 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>> loaded coprocessors are: []
>>> 
>>> 
>>> Please help!
>>> QUESTION: Is it a bug and I need to check something else?
>>> 
>>> Thanks
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi,

What do you have on your HBase configuration? Are you passing the name
of the Quorum servers?
$ cat conf/hbase-site.xml
......
  </property>
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>cube,latitude,node3</value>
      <description>Comma separated list of servers in the ZooKeeper Quorum.
      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
      By default this is set to localhost for local and pseudo-distributed modes
      of operation. For a fully-distributed setup, this should be set to a full
      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
hbase-env.sh
      this is the list of servers which we will start/stop ZooKeeper on.
      </description>
    </property>
.....

2012/11/21, ac@hsk.hk <ac...@hsk.hk>:
> Hi,
>
>
> I have the following line in /etc/hosts in all servers, should I keep it or
> comment it out or ...?
>
> 127.0.0.1       localhost
>
> Please help.
>
> Thanks
>
>
>
> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>
>> Hi,
>>
>>
>> Please help!!
>>
>> HBase version: 0.94
>> ZooKeeper: 3.4.4
>>
>> One of the regional servers stopped very quickly after HBASE is started:
>>
>> ### Check JPS after HBASE cluster was started, could find the
>> HRegionServer process (*** there is no any ZooKeeper instance running in
>> this server ***)
>> $ jps
>> 24767 Jps
>> 18418 TaskTracker
>> 24678 HRegionServer
>> 18156 DataNode
>>
>> ### Wait a while and checked JPS again,  HRegionServer process gone
>> $ jps
>> 18418 TaskTracker
>> 24784 Jps
>> 18156 DataNode
>>
>>
>> ### Here is the setting in hbase-site.xml ( enabled
>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>> <property>
>> <name>hbase.cluster.distributed</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>hbase.ZooKeeper.quorum</name>
>> <value>m146,m145,m143</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.session.timeout</name>
>> <value>60000</value>
>> </property>
>>
>>
>> ### hbase-env.sh also tells HBASE not to manage local instance of
>> ZooKeeper
>> export HBASE_MANAGES_ZK=false
>>
>>
>> ###This server can connect to the 3 ZooKeepers,
>> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED)
>> 0]
>>
>>
>> ### checked the hbase log file, found something odd,  seemed that it tried
>> to connect local ZooKeeper
>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=localhost:2181 sessionTimeout=60000
>> watcher=regionserver:60020
>>
>> 2012-11-21 17:31:33,254 WARN
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>> ZooKeeper exception:
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>
>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>> Sleeping 2000ms before retry #1...
>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>> session timed out, have not heard from server in 60010ms for sessionid
>> 0x0, closing socket connection and attempting reconnect
>>
>> 2012-11-21 17:32:33,362 WARN
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>> ZooKeeper exception:
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>
>> ......
>>
>> 2012-11-21 17:34:33,570 ERROR
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
>> failed after 3 retries
>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>> regionserver:60020 Unable to set watcher on znode /hbase/master
>> 2012-11-21 17:34:33,573 ERROR
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>> Received unexpected KeeperException, re-throwing exception
>> 2012-11-21 17:34:33,573 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>> ......
>> 2012-11-21 17:34:33,576 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>> loaded coprocessors are: []
>>
>> 2012-11-21 17:34:36,580 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>> RS.
>> java.io.IOException: Received the shutdown message while waiting.
>> 	at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>> 	at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>> 	at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>> 	at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>> 	at java.lang.Thread.run(Thread.java:662)
>> 2012-11-21 17:34:36,581 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>> loaded coprocessors are: []
>>
>>
>> Please help!
>> QUESTION: Is it a bug and I need to check something else?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>
>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Posted by "ac@hsk.hk" <ac...@hsk.hk>.

Hi, 


I have the following line in /etc/hosts in all servers, should I keep it or comment it out or ...?

127.0.0.1       localhost

Please help.

Thanks



On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:

> Hi,
> 
> 
> Please help!!
> 
> HBase version: 0.94
> ZooKeeper: 3.4.4
> 
> One of the regional servers stopped very quickly after HBASE is started:
> 
> ### Check JPS after HBASE cluster was started, could find the HRegionServer process (*** there is no any ZooKeeper instance running in this server ***)
> $ jps
> 24767 Jps
> 18418 TaskTracker
> 24678 HRegionServer
> 18156 DataNode
> 
> ### Wait a while and checked JPS again,  HRegionServer process gone
> $ jps
> 18418 TaskTracker
> 24784 Jps
> 18156 DataNode
> 
> 
> ### Here is the setting in hbase-site.xml ( enabled hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
> <property>
> <name>hbase.cluster.distributed</name>
> <value>true</value>
> </property>
> 
> <property>
> <name>hbase.ZooKeeper.quorum</name>
> <value>m146,m145,m143</value>
> </property>
> 
> <property>
> <name>zookeeper.session.timeout</name>
> <value>60000</value>
> </property>
> 
> 
> ### hbase-env.sh also tells HBASE not to manage local instance of ZooKeeper
> export HBASE_MANAGES_ZK=false
> 
> 
> ###This server can connect to the 3 ZooKeepers,
> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED) 0]
> 
> 
> ### checked the hbase log file, found something odd,  seemed that it tried to connect local ZooKeeper 
> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=60000 watcher=regionserver:60020
> 
> 2012-11-21 17:31:33,254 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
> 
> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 2000ms before retry #1...
> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60010ms for sessionid 0x0, closing socket connection and attempting reconnect
> 
> 2012-11-21 17:32:33,362 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
> 
> ......
> 
> 2012-11-21 17:34:33,570 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries
> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020 Unable to set watcher on znode /hbase/master
> 2012-11-21 17:34:33,573 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020 Received unexpected KeeperException, re-throwing exception
> 2012-11-21 17:34:33,573 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ......
> 2012-11-21 17:34:33,576 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
> 
> 2012-11-21 17:34:36,580 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server m144,60020,1353490232962: Initialization of RS failed.  Hence aborting RS.
> java.io.IOException: Received the shutdown message while waiting.
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
> 	at java.lang.Thread.run(Thread.java:662)
> 2012-11-21 17:34:36,581 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
> 
> 
> Please help!
> QUESTION: Is it a bug and I need to check something else?  
> 
> Thanks
> 
> 
> 
> 
> 
>