You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Stuti Awasthi <st...@hcl.com> on 2011/09/19 07:15:02 UTC

Unexpected shutdown of Zookeeper

Hi All,

I was running a 2 node cluster with 1 zookeeper node and 2 region server node. I had also setup cluster replication with another single node Hbase-Hadoop cluster. Replication was successful and I left the cluster running over the weekend with no data for replication.

Today I can see that in  Master cluster Zookeeper is dead. 1 region server which was running on slave machine is also dead. The cluster to which I was replicating is running fine.

My queries are :

1.       Can zookeeper be dead because there is no replication over the network for long time ?

2.       How to cater to these situations ? Running 3-4 zookeeper node will help ?

3.       If I run multiple Zookeeper node, then will the cluster keep on running normally even if 2-3 zookeeper are dead?

4.       In my case, out of 2 region server, 1 is dead but 1 is still working, if my zookeeper node was running, will I able to access hbase properly.

Logs :
hbase-root-zookeeper-master.log :

2011-09-19 10:07:55,753 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /10.33.64.235:44706
2011-09-19 10:07:55,758 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.33.64.235:44706
2011-09-19 10:07:55,761 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x13271b6c4f1000c with negotiated timeout 180000 for client /10.33.64.235:44706
2011-09-19 10:10:48,318 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid 0x13271b6c4f1000c, likely client has closed socket
2011-09-19 10:10:48,319 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c
2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x13271b6c4f1000c, timeout of 180000ms exceeded
2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x13271b6c4f1000c

hbase-root-regionserver-slave.log:

2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
       at sun.nio.ch.FileDispatcher.read0(Native Method)
       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
2011-09-16 16:00:51,058 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246
2011-09-16 16:00:51,064 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:5003 and seenEntries:0 and size: 0
2011-09-16 16:00:51,064 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #slave%3A60020.1316168146136 for position 663246 in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A60020.1316168146136
2011-09-16 16:00:51,066 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
2011-09-16 16:00:51,066 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing to replicate, sleeping 1000 times 2
2011-09-16 16:00:53,068 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246
..................................
2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session 0x13271b5395c0007 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection timed out
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
       at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2011-09-16 17:14:51,039 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: /hbase/rs/master,60020,1316167798366 znode expired, trying to lock it
2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server slave1/172.28.96.239:2181
2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to slave1/172.28.96.239:2181, initiating session
2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x13271b5395c0007 has expired, closing socket connection
2011-09-16 17:14:51,094 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=slave,60020,1316168145427, load=(requests=0, regions=6, usedHeap=29, maxHeap=996): connection to cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
       at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2011-09-16 17:14:51,094 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29, maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384, blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, blockCacheEvictedCount=0, blockCacheHitRatio=93, blockCacheHitCachingRatio=93
2011-09-16 17:14:51,094 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: connection to cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2011-09-16 17:14:51,114 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Source exiting 1
2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: exiting
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: exiting
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: exiting
2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
2011-09-16 17:14:52,585 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor exiting
2011-09-16 17:14:52,585 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver60020.cacheFlusher exiting
2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: regionserver60020.majorCompactionChecker exiting
2011-09-16 17:14:52,587 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer interrupted while waiting for sync requests
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling compactions & flushes
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
2011-09-16 17:14:52,589 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer exiting
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of -ROOT-,,0.70236052
2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling compactions & flushes
............................
2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: 0x13271b6c4f10003 closed
2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: 0x13271b6c4f10005 closed
2011-09-16 17:14:52,605 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Closing source 1 because: Region server is closing
2011-09-16 17:14:52,605 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
2011-09-16 17:14:53,040 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Not transferring queue since we are shutting down
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main]
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.

Please suggest.

Thanks

________________________________
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
received this email in error please delete it and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

RE: Unexpected shutdown of Zookeeper

Posted by Stuti Awasthi <st...@hcl.com>.
Thanks Lars,
I will also try to test this on my end. Thanks. Will update more if faces further issues.

-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com]
Sent: Tuesday, September 20, 2011 11:05 AM
To: user@hbase.apache.org
Subject: Re: Unexpected shutdown of Zookeeper

I think the fix the mostly good.
Chris is working on a test. This will be in 0.92, but can probably be back ported.


-- Lars


----- Original Message -----
From: Stuti Awasthi <st...@hcl.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc:
Sent: Monday, September 19, 2011 9:25 PM
Subject: RE: Unexpected shutdown of Zookeeper

Hi JD,

Thanks for your response. I was planning to use replication for my production/development servers but it seems like work is still going on this issue. I want to know that which version release is planned for this bug. Currently Im using Hbase 0.90.3

Some of my queries are :
1.       Will running 3-4 zookeeper node helps in case of failure of 1-2 zookeeper node? Will the cluster keeps on running or it will be down ?

Thanks
-Stuti

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, September 19, 2011 11:04 PM
To: user@hbase.apache.org
Subject: Re: Unexpected shutdown of Zookeeper

I think this is just:

https://issues.apache.org/jira/browse/HBASE-3130

J-D

On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi All,
>
> I was running a 2 node cluster with 1 zookeeper node and 2 region server node. I had also setup cluster replication with another single node Hbase-Hadoop cluster. Replication was successful and I left the cluster running over the weekend with no data for replication.
>
> Today I can see that in  Master cluster Zookeeper is dead. 1 region server which was running on slave machine is also dead. The cluster to which I was replicating is running fine.
>
> My queries are :
>
> 1.       Can zookeeper be dead because there is no replication over the network for long time ?
>
> 2.       How to cater to these situations ? Running 3-4 zookeeper node will help ?
>
> 3.       If I run multiple Zookeeper node, then will the cluster keep on running normally even if 2-3 zookeeper are dead?
>
> 4.       In my case, out of 2 region server, 1 is dead but 1 is still working, if my zookeeper node was running, will I able to access hbase properly.
>
> Logs :
> hbase-root-zookeeper-master.log :
>
> 2011-09-19 10:07:55,753 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection
> from /10.33.64.235:44706
> 2011-09-19 10:07:55,758 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Client attempting to
> establish new session at /10.33.64.235:44706
> 2011-09-19 10:07:55,761 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Established session
> 0x13271b6c4f1000c with negotiated timeout 180000 for client
> /10.33.64.235:44706
> 2011-09-19 10:10:48,318 WARN
> org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException:
> Unable to read additional data from client sessionid
> 0x13271b6c4f1000c, likely client has closed socket
> 2011-09-19 10:10:48,319 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection
> for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c
> 2011-09-19 10:12:57,002 INFO
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x13271b6c4f1000c, timeout of 180000ms exceeded
> 2011-09-19 10:12:57,002 INFO
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x13271b6c4f1000c
>
> hbase-root-regionserver-slave.log:
>
> 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>Server listener on 60020: readAndProcess threw exception
> java.io.IOException: Connection reset by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcher.read0(Native Method)
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> 2011-09-16 16:00:51,058 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening log for replication slave%3A60020.1316168146136 at 663246
> 2011-09-16 16:00:51,064 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:5003 and seenEntries:0 and size: 0
> 2011-09-16 16:00:51,064 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Going to report log #slave%3A60020.1316168146136 for position
> 663246 in
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A6002
> 0.1316168146136
> 2011-09-16 16:00:51,066 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Removing 0 logs in the list: []
> 2011-09-16 16:00:51,066 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Nothing to replicate, sleeping 1000 times 2
> 2011-09-16 16:00:53,068 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246 ..................................
> 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session
> 0x13271b5395c0007 for server null, unexpected error, closing socket
>connection and attempting reconnect
> java.net.ConnectException: Connection timed out
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2011-09-16 17:14:51,039 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: /hbase/rs/master,60020,1316167798366 znode expired, trying to
>lock it
> 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening
>socket connection to server slave1/172.28.96.239:2181
> 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket
>connection established to slave1/172.28.96.239:2181, initiating
>session
> 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable
>to reconnect to ZooKeeper service, session 0x13271b5395c0007 has
>expired, closing socket connection
> 2011-09-16 17:14:51,094 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>server serverName=slave,60020,1316168145427, load=(requests=0,
>regions=6, usedHeap=29, maxHeap=996): connection to cluster:
> 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007
>received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>       at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo
> KeeperWatcher.java:343)
>       at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa
> tcher.java:261)
>       at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja
> va:530)
>       at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2011-09-16 17:14:51,094 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0,
>memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29,
>maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384,
>blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2,
>blockCacheEvictedCount=0, blockCacheHitRatio=93,
> blockCacheHitCachingRatio=93
> 2011-09-16 17:14:51,094 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> connection to cluster: 1-0x13271b5395c0007 connection to cluster:
> 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
> 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:51,114 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
>infoServer
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server listener on 60020
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server Responder
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
>IPC Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60030
> 2011-09-16 17:14:52,585 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2011-09-16 17:14:52,585 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2011-09-16 17:14:52,586 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChec
> ker: regionserver60020.majorCompactionChecker exiting
> 2011-09-16 17:14:52,587 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer interrupted while waiting for sync
>requests
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling
>compactions & flushes
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
> 2011-09-16 17:14:52,589 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer exiting
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of -ROOT-,,0.70236052
> 2011-09-16 17:14:52,589 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
> 2011-09-16 17:14:52,589 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling compactions & flushes ............................
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x13271b6c4f10003 closed
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x13271b6c4f10005 closed
> 2011-09-16 17:14:52,605 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-09-16 17:14:52,605 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
>exiting
> 2011-09-16 17:14:53,040 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Not transferring queue since we are shutting down
> 2011-09-16 17:14:53,042 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>starting; hbase.shutdown.hook=true;
>fsShutdownHook=Thread[Thread-14,5,main]
> 2011-09-16 17:14:53,042 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
>hook
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
>
> Please suggest.
>
> Thanks
>
> ________________________________
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> -------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of this message
> without the prior written consent of the author of this e-mail is
> strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
>
> ----------------------------------------------------------------------
> -------------------------------------------------
>


Re: Unexpected shutdown of Zookeeper

Posted by lars hofhansl <lh...@yahoo.com>.
I think the fix the mostly good.
Chris is working on a test. This will be in 0.92, but can probably be back ported.


-- Lars


----- Original Message -----
From: Stuti Awasthi <st...@hcl.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Monday, September 19, 2011 9:25 PM
Subject: RE: Unexpected shutdown of Zookeeper

Hi JD,

Thanks for your response. I was planning to use replication for my production/development servers but it seems like work is still going on this issue. I want to know that which version release is planned for this bug. Currently Im using Hbase 0.90.3

Some of my queries are :
1.       Will running 3-4 zookeeper node helps in case of failure of 1-2 zookeeper node? Will the cluster keeps on running or it will be down ?

Thanks
-Stuti

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, September 19, 2011 11:04 PM
To: user@hbase.apache.org
Subject: Re: Unexpected shutdown of Zookeeper

I think this is just:

https://issues.apache.org/jira/browse/HBASE-3130

J-D

On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi All,
>
> I was running a 2 node cluster with 1 zookeeper node and 2 region server node. I had also setup cluster replication with another single node Hbase-Hadoop cluster. Replication was successful and I left the cluster running over the weekend with no data for replication.
>
> Today I can see that in  Master cluster Zookeeper is dead. 1 region server which was running on slave machine is also dead. The cluster to which I was replicating is running fine.
>
> My queries are :
>
> 1.       Can zookeeper be dead because there is no replication over the network for long time ?
>
> 2.       How to cater to these situations ? Running 3-4 zookeeper node will help ?
>
> 3.       If I run multiple Zookeeper node, then will the cluster keep on running normally even if 2-3 zookeeper are dead?
>
> 4.       In my case, out of 2 region server, 1 is dead but 1 is still working, if my zookeeper node was running, will I able to access hbase properly.
>
> Logs :
> hbase-root-zookeeper-master.log :
>
> 2011-09-19 10:07:55,753 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection
> from /10.33.64.235:44706
> 2011-09-19 10:07:55,758 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Client attempting to
> establish new session at /10.33.64.235:44706
> 2011-09-19 10:07:55,761 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Established session
> 0x13271b6c4f1000c with negotiated timeout 180000 for client
> /10.33.64.235:44706
> 2011-09-19 10:10:48,318 WARN
> org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException:
> Unable to read additional data from client sessionid
> 0x13271b6c4f1000c, likely client has closed socket
> 2011-09-19 10:10:48,319 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection
> for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c
> 2011-09-19 10:12:57,002 INFO
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x13271b6c4f1000c, timeout of 180000ms exceeded
> 2011-09-19 10:12:57,002 INFO
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x13271b6c4f1000c
>
> hbase-root-regionserver-slave.log:
>
> 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server listener on 60020: readAndProcess threw exception
> java.io.IOException: Connection reset by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcher.read0(Native Method)
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> 2011-09-16 16:00:51,058 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening log for replication slave%3A60020.1316168146136 at 663246
> 2011-09-16 16:00:51,064 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:5003 and seenEntries:0 and size: 0
> 2011-09-16 16:00:51,064 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Going to report log #slave%3A60020.1316168146136 for position
> 663246 in
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A6002
> 0.1316168146136
> 2011-09-16 16:00:51,066 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Removing 0 logs in the list: []
> 2011-09-16 16:00:51,066 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Nothing to replicate, sleeping 1000 times 2
> 2011-09-16 16:00:53,068 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246 ..................................
> 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session
> 0x13271b5395c0007 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection timed out
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2011-09-16 17:14:51,039 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: /hbase/rs/master,60020,1316167798366 znode expired, trying to
> lock it
> 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server slave1/172.28.96.239:2181
> 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to slave1/172.28.96.239:2181, initiating
> session
> 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable
> to reconnect to ZooKeeper service, session 0x13271b5395c0007 has
> expired, closing socket connection
> 2011-09-16 17:14:51,094 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=slave,60020,1316168145427, load=(requests=0,
> regions=6, usedHeap=29, maxHeap=996): connection to cluster:
> 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>       at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo
> KeeperWatcher.java:343)
>       at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa
> tcher.java:261)
>       at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja
> va:530)
>       at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2011-09-16 17:14:51,094 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0,
> memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29,
> maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384,
> blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2,
> blockCacheEvictedCount=0, blockCacheHitRatio=93,
> blockCacheHitCachingRatio=93
> 2011-09-16 17:14:51,094 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> connection to cluster: 1-0x13271b5395c0007 connection to cluster:
> 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
> 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:51,114 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
> infoServer
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server listener on 60020
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server Responder
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60030
> 2011-09-16 17:14:52,585 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2011-09-16 17:14:52,585 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2011-09-16 17:14:52,586 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChec
> ker: regionserver60020.majorCompactionChecker exiting
> 2011-09-16 17:14:52,587 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer interrupted while waiting for sync
> requests
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling
> compactions & flushes
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
> 2011-09-16 17:14:52,589 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer exiting
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of -ROOT-,,0.70236052
> 2011-09-16 17:14:52,589 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
> 2011-09-16 17:14:52,589 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling compactions & flushes ............................
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x13271b6c4f10003 closed
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x13271b6c4f10005 closed
> 2011-09-16 17:14:52,605 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-09-16 17:14:52,605 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2011-09-16 17:14:53,040 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Not transferring queue since we are shutting down
> 2011-09-16 17:14:53,042 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-14,5,main]
> 2011-09-16 17:14:53,042 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
>
> Please suggest.
>
> Thanks
>
> ________________________________
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> -------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of this message
> without the prior written consent of the author of this e-mail is
> strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
>
> ----------------------------------------------------------------------
> -------------------------------------------------
>


RE: Unexpected shutdown of Zookeeper

Posted by Stuti Awasthi <st...@hcl.com>.
Hi JD,

Thanks for your response. I was planning to use replication for my production/development servers but it seems like work is still going on this issue. I want to know that which version release is planned for this bug. Currently Im using Hbase 0.90.3

Some of my queries are :
1.       Will running 3-4 zookeeper node helps in case of failure of 1-2 zookeeper node? Will the cluster keeps on running or it will be down ?

Thanks
-Stuti

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, September 19, 2011 11:04 PM
To: user@hbase.apache.org
Subject: Re: Unexpected shutdown of Zookeeper

I think this is just:

https://issues.apache.org/jira/browse/HBASE-3130

J-D

On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi All,
>
> I was running a 2 node cluster with 1 zookeeper node and 2 region server node. I had also setup cluster replication with another single node Hbase-Hadoop cluster. Replication was successful and I left the cluster running over the weekend with no data for replication.
>
> Today I can see that in  Master cluster Zookeeper is dead. 1 region server which was running on slave machine is also dead. The cluster to which I was replicating is running fine.
>
> My queries are :
>
> 1.       Can zookeeper be dead because there is no replication over the network for long time ?
>
> 2.       How to cater to these situations ? Running 3-4 zookeeper node will help ?
>
> 3.       If I run multiple Zookeeper node, then will the cluster keep on running normally even if 2-3 zookeeper are dead?
>
> 4.       In my case, out of 2 region server, 1 is dead but 1 is still working, if my zookeeper node was running, will I able to access hbase properly.
>
> Logs :
> hbase-root-zookeeper-master.log :
>
> 2011-09-19 10:07:55,753 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection
> from /10.33.64.235:44706
> 2011-09-19 10:07:55,758 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Client attempting to
> establish new session at /10.33.64.235:44706
> 2011-09-19 10:07:55,761 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Established session
> 0x13271b6c4f1000c with negotiated timeout 180000 for client
> /10.33.64.235:44706
> 2011-09-19 10:10:48,318 WARN
> org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException:
> Unable to read additional data from client sessionid
> 0x13271b6c4f1000c, likely client has closed socket
> 2011-09-19 10:10:48,319 INFO
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection
> for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c
> 2011-09-19 10:12:57,002 INFO
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x13271b6c4f1000c, timeout of 180000ms exceeded
> 2011-09-19 10:12:57,002 INFO
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x13271b6c4f1000c
>
> hbase-root-regionserver-slave.log:
>
> 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server listener on 60020: readAndProcess threw exception
> java.io.IOException: Connection reset by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcher.read0(Native Method)
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> 2011-09-16 16:00:51,058 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening log for replication slave%3A60020.1316168146136 at 663246
> 2011-09-16 16:00:51,064 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:5003 and seenEntries:0 and size: 0
> 2011-09-16 16:00:51,064 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Going to report log #slave%3A60020.1316168146136 for position
> 663246 in
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A6002
> 0.1316168146136
> 2011-09-16 16:00:51,066 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Removing 0 logs in the list: []
> 2011-09-16 16:00:51,066 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Nothing to replicate, sleeping 1000 times 2
> 2011-09-16 16:00:53,068 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246 ..................................
> 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session
> 0x13271b5395c0007 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection timed out
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2011-09-16 17:14:51,039 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: /hbase/rs/master,60020,1316167798366 znode expired, trying to
> lock it
> 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server slave1/172.28.96.239:2181
> 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to slave1/172.28.96.239:2181, initiating
> session
> 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable
> to reconnect to ZooKeeper service, session 0x13271b5395c0007 has
> expired, closing socket connection
> 2011-09-16 17:14:51,094 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=slave,60020,1316168145427, load=(requests=0,
> regions=6, usedHeap=29, maxHeap=996): connection to cluster:
> 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>       at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo
> KeeperWatcher.java:343)
>       at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa
> tcher.java:261)
>       at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja
> va:530)
>       at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2011-09-16 17:14:51,094 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0,
> memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29,
> maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384,
> blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2,
> blockCacheEvictedCount=0, blockCacheHitRatio=93,
> blockCacheHitCachingRatio=93
> 2011-09-16 17:14:51,094 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> connection to cluster: 1-0x13271b5395c0007 connection to cluster:
> 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
> 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:51,114 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
> infoServer
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server listener on 60020
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server Responder
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60030
> 2011-09-16 17:14:52,585 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2011-09-16 17:14:52,585 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2011-09-16 17:14:52,586 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChec
> ker: regionserver60020.majorCompactionChecker exiting
> 2011-09-16 17:14:52,587 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer interrupted while waiting for sync
> requests
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling
> compactions & flushes
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
> 2011-09-16 17:14:52,589 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer exiting
> 2011-09-16 17:14:52,588 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of -ROOT-,,0.70236052
> 2011-09-16 17:14:52,589 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
> 2011-09-16 17:14:52,589 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling compactions & flushes ............................
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x13271b6c4f10003 closed
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x13271b6c4f10005 closed
> 2011-09-16 17:14:52,605 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-09-16 17:14:52,605 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2011-09-16 17:14:53,040 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana
> ger: Not transferring queue since we are shutting down
> 2011-09-16 17:14:53,042 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-14,5,main]
> 2011-09-16 17:14:53,042 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
>
> Please suggest.
>
> Thanks
>
> ________________________________
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> -------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of this message
> without the prior written consent of the author of this e-mail is
> strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
>
> ----------------------------------------------------------------------
> -------------------------------------------------
>

Re: Unexpected shutdown of Zookeeper

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I think this is just:

https://issues.apache.org/jira/browse/HBASE-3130

J-D

On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi All,
>
> I was running a 2 node cluster with 1 zookeeper node and 2 region server node. I had also setup cluster replication with another single node Hbase-Hadoop cluster. Replication was successful and I left the cluster running over the weekend with no data for replication.
>
> Today I can see that in  Master cluster Zookeeper is dead. 1 region server which was running on slave machine is also dead. The cluster to which I was replicating is running fine.
>
> My queries are :
>
> 1.       Can zookeeper be dead because there is no replication over the network for long time ?
>
> 2.       How to cater to these situations ? Running 3-4 zookeeper node will help ?
>
> 3.       If I run multiple Zookeeper node, then will the cluster keep on running normally even if 2-3 zookeeper are dead?
>
> 4.       In my case, out of 2 region server, 1 is dead but 1 is still working, if my zookeeper node was running, will I able to access hbase properly.
>
> Logs :
> hbase-root-zookeeper-master.log :
>
> 2011-09-19 10:07:55,753 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /10.33.64.235:44706
> 2011-09-19 10:07:55,758 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.33.64.235:44706
> 2011-09-19 10:07:55,761 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x13271b6c4f1000c with negotiated timeout 180000 for client /10.33.64.235:44706
> 2011-09-19 10:10:48,318 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid 0x13271b6c4f1000c, likely client has closed socket
> 2011-09-19 10:10:48,319 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c
> 2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x13271b6c4f1000c, timeout of 180000ms exceeded
> 2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x13271b6c4f1000c
>
> hbase-root-regionserver-slave.log:
>
> 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcher.read0(Native Method)
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> 2011-09-16 16:00:51,058 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246
> 2011-09-16 16:00:51,064 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:5003 and seenEntries:0 and size: 0
> 2011-09-16 16:00:51,064 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #slave%3A60020.1316168146136 for position 663246 in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A60020.1316168146136
> 2011-09-16 16:00:51,066 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
> 2011-09-16 16:00:51,066 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing to replicate, sleeping 1000 times 2
> 2011-09-16 16:00:53,068 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication slave%3A60020.1316168146136 at 663246
> ..................................
> 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session 0x13271b5395c0007 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection timed out
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2011-09-16 17:14:51,039 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: /hbase/rs/master,60020,1316167798366 znode expired, trying to lock it
> 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server slave1/172.28.96.239:2181
> 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to slave1/172.28.96.239:2181, initiating session
> 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x13271b5395c0007 has expired, closing socket connection
> 2011-09-16 17:14:51,094 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=slave,60020,1316168145427, load=(requests=0, regions=6, usedHeap=29, maxHeap=996): connection to cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
>       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
>       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
>       at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2011-09-16 17:14:51,094 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29, maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384, blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, blockCacheEvictedCount=0, blockCacheHitRatio=93, blockCacheHitCachingRatio=93
> 2011-09-16 17:14:51,094 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: connection to cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 received expired from ZooKeeper, aborting
> 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2011-09-16 17:14:51,114 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Source exiting 1
> 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
> 2011-09-16 17:14:52,585 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor exiting
> 2011-09-16 17:14:52,585 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver60020.cacheFlusher exiting
> 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: regionserver60020.majorCompactionChecker exiting
> 2011-09-16 17:14:52,587 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer interrupted while waiting for sync requests
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling compactions & flushes
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
> 2011-09-16 17:14:52,589 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer exiting
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of -ROOT-,,0.70236052
> 2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
> 2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling compactions & flushes
> ............................
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: 0x13271b6c4f10003 closed
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: 0x13271b6c4f10005 closed
> 2011-09-16 17:14:52,605 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Closing source 1 because: Region server is closing
> 2011-09-16 17:14:52,605 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
> 2011-09-16 17:14:53,040 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Not transferring queue since we are shutting down
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main]
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.
> 2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
>
> Please suggest.
>
> Thanks
>
> ________________________________
> ::DISCLAIMER::
> -----------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
> this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
> received this email in error please delete it and notify the sender immediately. Before opening any mail and
> attachments please check them for viruses and defect.
>
> -----------------------------------------------------------------------------------------------------------------------
>