You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Eran Kutner <er...@gigya.com> on 2011/03/22 18:39:10 UTC

Region server crashes when using replication

Hi,
I'm trying to use replication between two HBase clusters and I'm
encountering all kinds of crashes and weird behavior.

First, it seems that starting a region server when the peer ZKs are
not available will cause the server to fail to start:

2011-03-22 08:31:56,647 INFO
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
is now started
2011-03-22 08:31:56,668 WARN
org.apache.hadoop.hbase.zookeeper.ZKConfig:
java.net.UnknownHostException: haddop2-zk3
������� at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
������� at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
������� at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
������� at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1084)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1020)
������� at java.net.InetAddress.getByName(InetAddress.java:970)
������� at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206)
������� at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250)
������� at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
������� at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
������� at java.lang.Thread.run(Thread.java:662)

2011-03-22 08:31:56,669 WARN
org.apache.hadoop.hbase.zookeeper.ZKConfig:
java.net.UnknownHostException: haddop2-zk2
������� at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
������� at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
������� at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
������� at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1084)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1020)
������� at java.net.InetAddress.getByName(InetAddress.java:970)
������� at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206)
������� at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250)
������� at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
������� at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
������� at java.lang.Thread.run(Thread.java:662)

2011-03-22 08:31:56,669 INFO org.apache.zookeeper.ZooKeeper:
Initiating client connection,
connectString=haddop2-zk3:2181,haddop2-zk2:2181,hadoop2-zk1:2181
sessionTimeout=180000 watcher=connection to cluster: 1
2011-03-22 08:31:56,670 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed
initialization
2011-03-22 08:31:56,670 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.net.UnknownHostException: haddop2-zk3
������� at java.net.InetAddress.getAllByName0(InetAddress.java:1158)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1084)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1020)
������� at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386)
������� at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
������� at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
������� at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97)
������� at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
������� at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
������� at java.lang.Thread.run(Thread.java:662)
2011-03-22 08:31:56,675 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300797113247,
load=(requests=0, regions=0, usedHeap=24, maxHeap=987): Unhandled
exception: haddop2-zk3
java.net.UnknownHostException: haddop2-zk3
������� at java.net.InetAddress.getAllByName0(InetAddress.java:1158)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1084)
������� at java.net.InetAddress.getAllByName(InetAddress.java:1020)
������� at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386)
������� at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
������� at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
������� at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97)
������� at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
������� at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
������� at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
������� at java.lang.Thread.run(Thread.java:662)
2011-03-22 08:31:56,675 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unhandled
exception: haddop2-zk3
2011-03-22 08:31:56,675 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping server on 60020
2011-03-22 08:31:56,679 INFO
org.apache.hadoop.hbase.regionserver.StoreFile: Allocating
LruBlockCache with maximum size 197.5m
2011-03-22 08:31:56,683 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
at: hadoop1-s05.farm-ny.gigya.com,60020,1300797113247
2011-03-22 08:31:56,683 DEBUG
org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
tracker org.apache.hadoop.hbase.catalog.CatalogTracker@508aeb74
2011-03-22 08:31:56,684 INFO
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
leases
2011-03-22 08:31:56,684 INFO
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
leases
2011-03-22 08:31:56,684 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x22e669588a20058
2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ZooKeeper: Session:
0x22e669588a20058 closed
2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ZooKeeper: Session:
0x12e669588b8004d closed
2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-03-22 08:31:56,702 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
starting; hbase.shutdown.hook=true;
fsShutdownHook=Thread[Thread-15,5,main]
2011-03-22 08:31:56,702 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
hook
2011-03-22 08:31:56,702 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
shutdown hook thread.
2011-03-22 08:31:56,804 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
finished.


Second, it seems that when I'm shutting down a region server on the
peer cluster region servers on the source cluster connect to it are
also shutting down:
2011-03-22 09:03:34,541 INFO
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
leases
2011-03-22 09:03:34,541 INFO
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
leases
2011-03-22 09:03:34,644 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x12e669588b80050
2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ZooKeeper: Session:
0x12e669588b80050 closed
2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ZooKeeper: Session:
0x22e669588a2005d closed
2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-03-22 09:03:34,664 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Closing source 1 because: Region server is closing
2011-03-22 09:03:39,377 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Source exiting 1
2011-03-22 09:03:39,431 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Source exiting 1
2011-03-22 09:03:39,431 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
exiting
2011-03-22 09:03:39,432 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
starting; hbase.shutdown.hook=true;
fsShutdownHook=Thread[Thread-15,5,main]
2011-03-22 09:03:39,432 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
hook
2011-03-22 09:03:39,432 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
shutdown hook thread.
2011-03-22 09:03:39,433 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
finished.



Third, sometimes it crashes without any reason I can understand. See
the attached log dump. It includes the entire load process from start
to shutdown of the region server. When I configure "stop_replication"
everything is OK, here's what happens after "start_replication":
2011-03-22 09:38:59,199 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Replication is disabled, sleeping 1000 times 10
2011-03-22 09:38:59,333 INFO
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
is now started
2011-03-22 09:39:09,202 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Opening log for replication
hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 at 124
2011-03-22 09:39:09,215 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
currentNbOperations:0 and seenEntries:1 and size: 191
2011-03-22 09:39:09,215 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log
#hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 for position 315
in hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876
2011-03-22 09:39:09,224 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300799918373,
load=(requests=0, regions=3, usedHeap=41, maxHeap=987): Writing
replication status
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
= NoNode for /hbase/replication/rs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/1/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876
������� at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
������� at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
������� at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
������� at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:708)
������� at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:751)
������� at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:432)
������� at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:131)
������� at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:332)
2011-03-22 09:39:09,225 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=3, stores=5, storefiles=5, storefileIndexSize=1,
memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=41,
maxHeap=987, blockCacheSize=1702768, blockCacheFree=205390992,
blockCacheCount=3, blockCacheHitCount=15, blockCacheMissCount=3,
blockCacheEvictedCount=0, blockCacheHitRatio=83,
blockCacheHitCachingRatio=83
2011-03-22 09:39:09,225 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Writing
replication status
2011-03-22 09:39:09,225 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Removing 0 logs in the list: []
2011-03-22 09:39:09,225 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Nothing to replicate, sleeping 1000 times 10
2011-03-22 09:39:10,996 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping server on 60020
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 0 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 2 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 18 on 60020: exiting
2011-03-22 09:39:10,998 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
infoServer
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 30 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 5 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping IPC Server Responder
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 4 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 6 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 20 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 15 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 9 on 60020: exiting
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 21 on 60020: exiting
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 49 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 14 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 8 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 17 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 44 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 8 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 42 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 41 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 40 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 39 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 38 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 37 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 36 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 34 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 33 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 32 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 31 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 29 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 28 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 27 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 26 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 25 on 60020: exiting
2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 24 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 23 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 13 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 7 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 16 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 12 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 6 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 11 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 5 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 4 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 0 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 9 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 7 on 60020: exiting
2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping IPC Server listener on 60020
2011-03-22 09:39:11,004 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:60030
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 35 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 43 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 45 on 60020: exiting
2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 46 on 60020: exiting
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 47 on 60020: exiting
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 48 on 60020: exiting
2011-03-22 09:39:11,010 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2011-03-22 09:39:11,010 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLog:
regionserver60020.logSyncer interrupted while waiting for sync
requests
2011-03-22 09:39:11,010 INFO
org.apache.hadoop.hbase.regionserver.wal.HLog:
regionserver60020.logSyncer exiting
2011-03-22 09:39:11,010 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 22 on 60020: exiting
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 10 on 60020: exiting
2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 3 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 3 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 2 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 1 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 19 on 60020: exiting
2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 1 on 60020: exiting
2011-03-22 09:39:11,011 DEBUG
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of -ROOT-,,0.70236052
2011-03-22 09:39:11,011 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver60020.cacheFlusher exiting
2011-03-22 09:39:11,010 DEBUG
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of
TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
2011-03-22 09:39:11,010 DEBUG
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of
TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
2011-03-22 09:39:11,009 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver60020.compactor exiting
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Closing
TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.:
disabling compactions & flushes
2011-03-22 09:39:11,009 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
regionserver60020.majorCompactionChecker exiting
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
region TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.Store: closed Settings
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.Store: closed default
2011-03-22 09:39:11,012 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Closed
TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Closed region TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Closing
TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.:
disabling compactions & flushes
2011-03-22 09:39:11,011 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Closing
-ROOT-,,0.70236052: disabling compactions & flushes
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
region TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
2011-03-22 09:39:11,012 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
region -ROOT-,,0.70236052
2011-03-22 09:39:11,013 DEBUG
org.apache.hadoop.hbase.regionserver.Store: closed Data
2011-03-22 09:39:11,013 DEBUG
org.apache.hadoop.hbase.regionserver.Store: closed default
2011-03-22 09:39:11,013 DEBUG
org.apache.hadoop.hbase.regionserver.Store: closed info
2011-03-22 09:39:11,013 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Closed
TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
2011-03-22 09:39:11,013 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Closed
-ROOT-,,0.70236052
2011-03-22 09:39:11,013 DEBUG
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Closed region TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
2011-03-22 09:39:11,013 DEBUG
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Closed region -ROOT-,,0.70236052
2011-03-22 09:39:11,066 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
at: hadoop1-s05.farm-ny.gigya.com,60020,1300799918373
2011-03-22 09:39:11,066 DEBUG
org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
tracker org.apache.hadoop.hbase.catalog.CatalogTracker@2eb0a3f5
2011-03-22 09:39:11,066 INFO
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
leases
2011-03-22 09:39:11,066 INFO
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
leases
2011-03-22 09:39:11,169 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x22e669588a20061
2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ZooKeeper: Session:
0x22e669588a20061 closed
2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ZooKeeper: Session:
0x12e669588b80057 closed
2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-03-22 09:39:11,190 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Closing source 1 because: Region server is closing
2011-03-22 09:39:12,414 INFO
org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020.leaseChecker closing leases
2011-03-22 09:39:12,415 INFO
org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020.leaseChecker closed leases
2011-03-22 09:39:19,229 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Source exiting 1
2011-03-22 09:39:19,229 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
exiting
2011-03-22 09:39:19,230 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
starting; hbase.shutdown.hook=true;
fsShutdownHook=Thread[Thread-15,5,main]
2011-03-22 09:39:19,230 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
hook
2011-03-22 09:39:19,230 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
shutdown hook thread.
2011-03-22 09:39:19,231 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
finished.



Forth and probably worst of all, it seems that when the servers are
crashing this way the master still thinks they are alive so the region
is not transitioned and is therefor inaccessible. How long should it
normally take the master to detect a dead region server?

Any help on what's going on would be greatly appreciated.

-eran

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
I tried that, but still get the same result with the 0.90.2 build. The
worst part is that region server fails, then another one tries to take
over and also fails, until the entire cluster is down. The fact that a
replication failure such as this can cause a cascading fail on the
entire cluster is very troubling. What is the design reason for
shutting down the server for a replication error?

I also confirmed that the region server failures are not detected by
the master even after 10 minutes. I'm not sure how to show this, but I
see the servers go down and when I run status 'detailed' they are
reported as "live" forever. I have zookeeper.session.timeout
configured for 20000, which should cause it to be detected in 20
seconds.

-eran



On Tue, Mar 22, 2011 at 21:46, Jean-Daniel Cryans <jd...@apache.org> wrote:
>
> You can apply the patch that I included there and that I also
> committed to the 0.90 branch.
>
> J-D
>
> On Tue, Mar 22, 2011 at 12:37 PM, Eran Kutner <er...@gigya.com> wrote:
> > Actually, it will probably be connection timeout, not connection
> > refused when there is no connection between the two clusters.
> >
> > Is there a workaround I can implement now for HBASE-3664, can I write
> > something in ZK so the server has an old entry to delete and is happy
> > with it?
> >
> > -eran
> >
> >
> >
> >
> > On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <jd...@apache.org> wrote:
> >> Inline.
> >>
> >> J-D
> >>
> >> On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <er...@gigya.com> wrote:
> >>> Thanks, J-D.
> >>> As for the first issue, why does this behavior make sense? What happens when
> >>> the connection between the two cluster fails? Will the region servers of the
> >>> primary fail as well? or at least won't be able to start? Seems very
> >>> radical.
> >>
> >> The DNS entry should remain, so you won't get UnknownHostException but
> >> ConnectionRefused instead. But that's a different issue: HBASE-3130
> >>
> >>>
> >>> Regarding the second issue, I didn't see anything else in the logs, it just
> >>> seemed like it decided to shutdown, but maybe I missed it. I will try to
> >>> reproduce that and let you know if I succeed.
> >>
> >> That'd be nice :)
> >>
> >>>
> >>> Regarding the timeout to detect a failed server, 3 minutes sounds like a
> >>> very long time for a region server to be down. Obviously, during that time
> >>> the data owned by that server is inaccessible. Is there a reason for this
> >>> long timeout? Can it be configured?
> >>>
> >>
> >> We set it that high for people that try to push too much data to
> >> clusters that are too small / badly configured and then end up with
> >> crazy garbage collections. Have fun reading this serie of blog posts:
> >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> >>
> >> Please also see the book about this configuration:
> >> http://hbase.apache.org/book.html#recommended_configurations
> >>
> >

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
(ugh last email bounced 3 TIMES with spam score too high, something
weird is going on so clearing all old text)

That really sounds like a different issue, and more inline with what I
guessed earlier, that some other region server is deleting the znode
under that region server. I would think of a configuration issue...
how did you setup zookeeper for those two clusters? Do they use the
same zk ensemble? If so, different root znode or the same?

Thx for sticking,

J-D

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
So the message "No log to process" means that nothing was assigned to
replicate at all... I wonder if it's HBASE-3664 that did that... but
you should be able to confirm by looking at the region server log when
it starts, it should be adding it's own HLog to the list to replicate.

Regarding the failover, it might be that the log is indeed registered
in zookeeper but somehow isn't in the memory log structure of
ReplicationSourceManager... again the logs should tell.

Feel free to send me directly a big tar.gz with all those logs and I
should be able to figure out.

Thx!

J-D

On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <er...@notspammer.com> wrote:
> Had more time to look into it and verify that indeed data is not replicated
> because the server doesn't see it in the log. So I tried restarting the RS
> and sure enough when the table (which has only one region) transitioned to
> another RS the replication started working (for new data only).
> So I tried with another table, and same thing, replication doesn't work and
> the logs says "No log to process" but after restarting the RS and a table
> transition the replication started working for that table too. Is there
> something that gets initialized during a transition that could be missing
> before?
>
> -eran
>
>
>

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Had more time to look into it and verify that indeed data is not
replicated because the server doesn't see it in the log. So I tried
restarting the RS and sure enough when the table (which has only one
region) transitioned to another RS the replication started working
(for new data only).
So I tried with another table, and same thing, replication doesn't
work and the logs says "No log to process" but after restarting the RS
and a table transition the replication started working for that table
too. Is there something that gets initialized during a transition that
could be missing before?

-eran



On Fri, Mar 25, 2011 at 21:26, Eran Kutner <er...@gigya.com> wrote:
>
> Thanks, J-D, that managed to solve a part of the problem. The servers
> have stopped crashing and the master now properly detects when a RS
> goes down, by the way, since the RS does detect this it may be a good
> idea to stop the server on this event which is a significant
> configuration issue.
> However now the replication just doesn't see to work. I didn't change
> anything in the configuration which already managed to push 2 rows
> before crashing yesterday. I still see the peer properly configured in
> ZK, the replication is enabled but nothing is happening. All I see in
> the log of the RS which holds the table I'm writing into is:
>
> 2011-03-25 15:16:56,504 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
> 2011-03-25 15:17:07,509 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
> 2011-03-25 15:17:18,515 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
> 2011-03-25 15:17:29,520 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
> 2011-03-25 15:17:40,526 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
>
> Needless to say nothing get's into the peer cluster.
>
> -eran
>
>
>
>
> On Fri, Mar 25, 2011 at 02:02, Jean-Daniel Cryans <jd...@apache.org> wrote:
> > Ok so this is the same old DNS issue...
> >
> > This is the important message in the log:
> >
> > Master passed us address to use. Was=hadoop1-s02:60020,
> > Now=hadoop1-s02.farm-ny.not-a-spammer.com:60020
> >
> > This means that when the RS tries to resolve itself it gets its
> > hostname, but when the master resolves the RS it gets the FQDN. This
> > is a bug in HBase that we rely on those strings as "true machine
> > identification" but that's how it is at the moment. It happens to be
> > that replication is setup later in the process so it uses the FQDN.
> > The only way you can fix it is to change your DNS settings. Here we
> > resolve everything with their hostnames.
> >
> > Hope that helps and sorry about all the trouble,
> >
> > J-D
> >
> >>> You make it sound like it's a bad thing :)
> >>> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B.
> >>>
> >>> Thanks for looking into the bug.
> >>>
> >>> -eran
> >

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Thanks again J-D. I will avoid using stop_replication from now on.
As for the shell, JRuby (or even Java for that matter) is not really
our strong suit here, but I'll try to give it a look when I have some
time.

-eran




On Mon, Mar 28, 2011 at 23:43, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Inline.
>
>> Thanks J-D!
>> I disabled replication because at the time, every time I started it
>> the entire cluster would shut itself down.
>> Any reason why the servers will not create the HLog immediately when
>> they receive the start_replicsation command?
>
> In the current code base replication cannot ask anything to the part
> responsible for the WAL. In any case start/stop replication wasn't
> built to do what you're trying to do, it's just a dirty kill switch.
>
>> Is there a less destructive way to stop and start the replication?
>> Will removing the peer yield better results? (By the way it would be
>> nice if the shell had a "show_peers" command.)
>
> There's a enable/disable command I have yet to implement :)
> Adding/removing the peer should do the trick too. I agree we need to
> list peers (that could be a nice first contribution wink wink).
>
>> 1) it seems that the servers crash if they can't talk to the peer ZK ensemble, which is really a huge problem.
>
> Like we previously discussed, this only happens when the region server
> starts and it's also very easy to fix (just catch the right
> exception).
>
>> 2) I can't be certain when will the HLogs actually start being written unless I restart the entire secondary cluster after reversing the replication direction.
>
> That's when you use star/stop replication, which like I said isn't
> designed to do what you want to do. Adding/removing the peers will
> work correctly in this case.
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Inline.

> Thanks J-D!
> I disabled replication because at the time, every time I started it
> the entire cluster would shut itself down.
> Any reason why the servers will not create the HLog immediately when
> they receive the start_replicsation command?

In the current code base replication cannot ask anything to the part
responsible for the WAL. In any case start/stop replication wasn't
built to do what you're trying to do, it's just a dirty kill switch.

> Is there a less destructive way to stop and start the replication?
> Will removing the peer yield better results? (By the way it would be
> nice if the shell had a "show_peers" command.)

There's a enable/disable command I have yet to implement :)
Adding/removing the peer should do the trick too. I agree we need to
list peers (that could be a nice first contribution wink wink).

> 1) it seems that the servers crash if they can't talk to the peer ZK ensemble, which is really a huge problem.

Like we previously discussed, this only happens when the region server
starts and it's also very easy to fix (just catch the right
exception).

> 2) I can't be certain when will the HLogs actually start being written unless I restart the entire secondary cluster after reversing the replication direction.

That's when you use star/stop replication, which like I said isn't
designed to do what you want to do. Adding/removing the peers will
work correctly in this case.

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Thanks J-D!
I disabled replication because at the time, every time I started it
the entire cluster would shut itself down.
Any reason why the servers will not create the HLog immediately when
they receive the start_replicsation command?
Is there a less destructive way to stop and start the replication?
Will removing the peer yield better results? (By the way it would be
nice if the shell had a "show_peers" command.)

The way I planned on using the replication is kind of a "manual
by-direction" operation. The idea is to have each cluster in a
different data center, one primary and one backup. Initially the
replication is configured from primary to backup. If the primary data
center goes down, we will switch traffic to the backup DC and reverse
the direction of the replication so new writes will eventually sync
back to the primary when it comes back online. Right now I see two
problems with this plan: 1) it seems that the servers crash if they
can't talk to the peer ZK ensemble, which is really a huge problem. 2)
I can't be certain when will the HLogs actually start being written
unless I restart the entire secondary cluster after reversing the
replication direction.
Am I right in my understanding of the current state of things?

Really appreciate your help!




-eran




On Mon, Mar 28, 2011 at 20:29, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Ah turns out the issue was way simpler than I thought. One example:
>
> 2011-03-25 13:55:02,103 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replication is disabled, sleeping 1000 times 10
> 2011-03-25 13:55:07,762 INFO
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
> is now started
> 2011-03-25 13:55:13,111 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
>
> (BTW, pro tip when debugging an issue with HBase is to go back to its
> first occurrence)
>
> The issue is that the cluster had replication disabled, which is
> REALLY disruptive as it disables every replication feature including
> adding new logs to replicate meaning that if the server starts with
> replication disabled it won't even add a single log to replicate.
> Here's an example of when a new log was finally added after a long
> time of "No log to process":
>
> 2011-03-24 03:56:20,538 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
> log to process, sleeping 1000 times 10
> 2011-03-24 03:56:22,848 DEBUG
> org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
> 3600000ms elapsed
> 2011-03-24 03:56:22,911 INFO
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using
> syncFs -- HDFS-200
> 2011-03-24 03:56:22,974 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
> /hbase/.logs/hadoop1-s04.farm-ny.etc
>
> That was the previous day, on the 25th when replication was enabled no
> other log with data in it was rolled so none was added to replicate.
>
> Bottom line, disabling replication is a kill switch and shouldn't only
> be used with that functionality in mind. Starting the cluster with
> replication enabled should make it work right away for you.
>
> Thx!
>
> J-D
>
> On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <er...@notspammer.com> wrote:
>> Had more time to look into it and verify that indeed data is not replicated
>> because the server doesn't see it in the log. So I tried restarting the RS
>> and sure enough when the table (which has only one region) transitioned to
>> another RS the replication started working (for new data only).
>> So I tried with another table, and same thing, replication doesn't work and
>> the logs says "No log to process" but after restarting the RS and a table
>> transition the replication started working for that table too. Is there
>> something that gets initialized during a transition that could be missing
>> before?
>>
>> -eran
>>
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Ah turns out the issue was way simpler than I thought. One example:

2011-03-25 13:55:02,103 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Replication is disabled, sleeping 1000 times 10
2011-03-25 13:55:07,762 INFO
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
is now started
2011-03-25 13:55:13,111 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10

(BTW, pro tip when debugging an issue with HBase is to go back to its
first occurrence)

The issue is that the cluster had replication disabled, which is
REALLY disruptive as it disables every replication feature including
adding new logs to replicate meaning that if the server starts with
replication disabled it won't even add a single log to replicate.
Here's an example of when a new log was finally added after a long
time of "No log to process":

2011-03-24 03:56:20,538 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10
2011-03-24 03:56:22,848 DEBUG
org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
3600000ms elapsed
2011-03-24 03:56:22,911 INFO
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using
syncFs -- HDFS-200
2011-03-24 03:56:22,974 INFO
org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
/hbase/.logs/hadoop1-s04.farm-ny.etc

That was the previous day, on the 25th when replication was enabled no
other log with data in it was rolled so none was added to replicate.

Bottom line, disabling replication is a kill switch and shouldn't only
be used with that functionality in mind. Starting the cluster with
replication enabled should make it work right away for you.

Thx!

J-D

On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <er...@notspammer.com> wrote:
> Had more time to look into it and verify that indeed data is not replicated
> because the server doesn't see it in the log. So I tried restarting the RS
> and sure enough when the table (which has only one region) transitioned to
> another RS the replication started working (for new data only).
> So I tried with another table, and same thing, replication doesn't work and
> the logs says "No log to process" but after restarting the RS and a table
> transition the replication started working for that table too. Is there
> something that gets initialized during a transition that could be missing
> before?
>
> -eran
>

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Thanks, J-D, that managed to solve a part of the problem. The servers
have stopped crashing and the master now properly detects when a RS
goes down, by the way, since the RS does detect this it may be a good
idea to stop the server on this event which is a significant
configuration issue.
However now the replication just doesn't see to work. I didn't change
anything in the configuration which already managed to push 2 rows
before crashing yesterday. I still see the peer properly configured in
ZK, the replication is enabled but nothing is happening. All I see in
the log of the RS which holds the table I'm writing into is:

2011-03-25 15:16:56,504 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10
2011-03-25 15:17:07,509 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10
2011-03-25 15:17:18,515 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10
2011-03-25 15:17:29,520 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10
2011-03-25 15:17:40,526 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No
log to process, sleeping 1000 times 10

Needless to say nothing get's into the peer cluster.

-eran




On Fri, Mar 25, 2011 at 02:02, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Ok so this is the same old DNS issue...
>
> This is the important message in the log:
>
> Master passed us address to use. Was=hadoop1-s02:60020,
> Now=hadoop1-s02.farm-ny.not-a-spammer.com:60020
>
> This means that when the RS tries to resolve itself it gets its
> hostname, but when the master resolves the RS it gets the FQDN. This
> is a bug in HBase that we rely on those strings as "true machine
> identification" but that's how it is at the moment. It happens to be
> that replication is setup later in the process so it uses the FQDN.
> The only way you can fix it is to change your DNS settings. Here we
> resolve everything with their hostnames.
>
> Hope that helps and sorry about all the trouble,
>
> J-D
>
>>> You make it sound like it's a bad thing :)
>>> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B.
>>>
>>> Thanks for looking into the bug.
>>>
>>> -eran
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Ok so this is the same old DNS issue...

This is the important message in the log:

Master passed us address to use. Was=hadoop1-s02:60020,
Now=hadoop1-s02.farm-ny.not-a-spammer.com:60020

This means that when the RS tries to resolve itself it gets its
hostname, but when the master resolves the RS it gets the FQDN. This
is a bug in HBase that we rely on those strings as "true machine
identification" but that's how it is at the moment. It happens to be
that replication is setup later in the process so it uses the FQDN.
The only way you can fix it is to change your DNS settings. Here we
resolve everything with their hostnames.

Hope that helps and sorry about all the trouble,

J-D

>> You make it sound like it's a bad thing :)
>> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B.
>>
>> Thanks for looking into the bug.
>>
>> -eran

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Now it doesn't like the email because it was in HTML format... As I
said, not a very smart piece of software.

On Fri, Mar 25, 2011 at 00:07, Eran Kutner <er...@gigya.com> wrote:
>
> You make it sound like it's a bad thing :)
> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B.
>
> Thanks for looking into the bug.
>
> -eran
>
>
>
> On Thu, Mar 24, 2011 at 22:52, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>
>> Ah yeah that's the issue, the mixup of FQDNs and hostnames. I wonder
>> how it got into that state... but that explains why the environment
>> looked so weird! Let me have a quick look at the code to figure why
>> it's different and hopefully I can get you a patch just in time for
>> 0.90.2
>>
>> J-D
>>
>> > One thing strange I'm noticing is that in the /hbase/rs node the
>> > servers are listed with their host name only while in
>> > /hbase/replication/rs they are listed with their fully qualified DNS
>> > names. Is this intentional?
>> >
>> > Note: I changed the domain name in this email because I think the
>> > mailing list's spam filter doesn't like it. The attached logs still
>> > show the full name.
>> >
>>
>> So you guys are spammers? ;)
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Ah yeah that's the issue, the mixup of FQDNs and hostnames. I wonder
how it got into that state... but that explains why the environment
looked so weird! Let me have a quick look at the code to figure why
it's different and hopefully I can get you a patch just in time for
0.90.2

J-D

> One thing strange I'm noticing is that in the /hbase/rs node the
> servers are listed with their host name only while in
> /hbase/replication/rs they are listed with their fully qualified DNS
> names. Is this intentional?
>
> Note: I changed the domain name in this email because I think the
> mailing list's spam filter doesn't like it. The attached logs still
> show the full name.
>

So you guys are spammers? ;)

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Here's what I found. I started with 2 RSs running in the cluster (#1 and #4).
This is how ZK looked at that point:
[zk: hadoop1-zk3:2181(CONNECTED) 25] ls /hbase/rs
[hadoop1-s01,60020,1300952215842, hadoop1-s04,60020,1300881354710]
[zk: hadoop1-zk3:2181(CONNECTED) 26] ls /hbase/replication/rs
[hadoop1-s01.farm-ny.gig.com,60020,1300952215842]

I then started RS #2 it seems that it is looking for the replication
log file which it can't find (see attached log).
Immediatly after that ZK looks like this:
[zk: hadoop1-zk3:2181(CONNECTED) 27] ls /hbase/replication/rs
[hadoop1-s02.farm-ny.gig.com,60020,1300953027434]

This is how .log on HDFS looks at this time (.oldlogs directory is empty):
Found 5 items
drwxr-xr-x   - hbase supergroup          0 2011-03-24 03:36
/hbase/.logs/hadoop1-s01.farm-ny.gig.com,60020,1300952215842
drwxr-xr-x   - hbase supergroup          0 2011-03-24 03:50
/hbase/.logs/hadoop1-s02.farm-ny.gig.com,60020,1300953027434
drwxr-xr-x   - hbase supergroup          0 2011-03-24 03:36
/hbase/.logs/hadoop1-s03.farm-ny.gig.com,60020,1300952197026
drwxr-xr-x   - hbase supergroup          0 2011-03-24 03:56
/hbase/.logs/hadoop1-s04.farm-ny.gig.com,60020,1300881354710
drwxr-xr-x   - hbase supergroup          0 2011-03-23 16:57
/hbase/.logs/hadoop1-s05.farm-ny.gig.com,60020,1300913823878

After some time, while I'm writing this, I now see that RS #1 has now
crashed and ZK looks like this (see attached log):
[zk: hadoop1-zk3:2181(CONNECTED) 29] ls /hbase/replication/rs
[hadoop1-s02.farm-ny.gig.com,60020,1300953027434,
hadoop1-s04.farm-ny.gig.com,60020,1300881354710]

One thing strange I'm noticing is that in the /hbase/rs node the
servers are listed with their host name only while in
/hbase/replication/rs they are listed with their fully qualified DNS
names. Is this intentional?

Note: I changed the domain name in this email because I think the
mailing list's spam filter doesn't like it. The attached logs still
show the full name.

-eran




On Thu, Mar 24, 2011 at 00:28, Jean-Daniel Cryans <jd...@apache.org> wrote:
> On Wed, Mar 23, 2011 at 3:22 PM, Eran Kutner <er...@gigya.com> wrote:
>> They are using two separate ensembles, 3 servers in each. I'm trying to
>> create total independence for each cluster.
>
> Can you find out who deleted the znode that the region server failed
> on? If it reported the status a few times before that, it means it
> existed whereas the other bug was that the znode never existed.
>
> J-D
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Wed, Mar 23, 2011 at 3:22 PM, Eran Kutner <er...@gigya.com> wrote:
> They are using two separate ensembles, 3 servers in each. I'm trying to
> create total independence for each cluster.

Can you find out who deleted the znode that the region server failed
on? If it reported the status a few times before that, it means it
existed whereas the other bug was that the znode never existed.

J-D

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Wed, Mar 23, 2011 at 4:54 AM, Eran Kutner <er...@gigya.com> wrote:
> I tried that, but still get the same result with the 0.90.2 build. The worst
> part is that region server fails, then another one tries to take over and
> also fails, until the entire cluster is down. The fact that a replication
> failure such as this can cause a cascading fail on the entire cluster is
> very troubling. What is the design reason for shutting down the server for a
> replication error?

Yeah looking at the log something is very odd, it's like something
else came in and deleted the znode under the region server since
previously you can see that it was using it:

2011-03-23 07:38:35,572 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Opening log for replication
hadoop1-s02.farm-ny.gigya.com%3A60020.1300880313950 at 0
2011-03-23 07:38:35,590 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
currentNbOperations:0 and seenEntries:0 and size: 124
2011-03-23 07:38:35,591 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log
#hadoop1-s02.farm-ny.gigya.com%3A60020.1300880313950 for position 124
in hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s02.farm-ny.gigya.com,60020,1300880313349/hadoop1-s02.farm-ny.gigya.com%3A60020.1300880313950

So it was there, then something deleted it.

Also, since you ask, here's a little story about replication.

It was originally developed on the 0.89 branch and this is what we
have run in production since September. The 0.90 branch has a totally
reworked master and also totally reworked interface to ZK. Among other
things were added a way to about the region server more easily. When
porting over the replication code to the new ZK interface (replication
relies heavily on it for coordination), a lot of abort calls were
added as a way of telling the developer that something is wrong in the
code and that it should be handled. Then, not much testing was done on
replication and both 0.90.0 and 0.90.1 were released.

This is also why it's not enabled by default and that the
documentation says: "This package is experimental quality software and
is only meant to be a base for future developments."

Meaning, use at your own risk. If you still wanna go down that route,
which I would really appreciate since I've been the only one
maintaining replication since the beginning, then be prepared to get
your hands dirty.

>
> I also confirmed that the region server failures are not detected by the
> master even after 10 minutes. I'm not sure how to show this, but I see the
> servers go down and when I run status 'detailed' they are reported as "live"
> forever. I have zookeeper.session.timeout configured for 20000, which should
> cause it to be detected in 20 seconds.

That's very very odd. Is the znode for that region server still in
ZooKeeper? Does the master see anything about a znode getting deleted?

Thx,

J-D

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
You can apply the patch that I included there and that I also
committed to the 0.90 branch.

J-D

On Tue, Mar 22, 2011 at 12:37 PM, Eran Kutner <er...@gigya.com> wrote:
> Actually, it will probably be connection timeout, not connection
> refused when there is no connection between the two clusters.
>
> Is there a workaround I can implement now for HBASE-3664, can I write
> something in ZK so the server has an old entry to delete and is happy
> with it?
>
> -eran
>
>
>
>
> On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Inline.
>>
>> J-D
>>
>> On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <er...@gigya.com> wrote:
>>> Thanks, J-D.
>>> As for the first issue, why does this behavior make sense? What happens when
>>> the connection between the two cluster fails? Will the region servers of the
>>> primary fail as well? or at least won't be able to start? Seems very
>>> radical.
>>
>> The DNS entry should remain, so you won't get UnknownHostException but
>> ConnectionRefused instead. But that's a different issue: HBASE-3130
>>
>>>
>>> Regarding the second issue, I didn't see anything else in the logs, it just
>>> seemed like it decided to shutdown, but maybe I missed it. I will try to
>>> reproduce that and let you know if I succeed.
>>
>> That'd be nice :)
>>
>>>
>>> Regarding the timeout to detect a failed server, 3 minutes sounds like a
>>> very long time for a region server to be down. Obviously, during that time
>>> the data owned by that server is inaccessible. Is there a reason for this
>>> long timeout? Can it be configured?
>>>
>>
>> We set it that high for people that try to push too much data to
>> clusters that are too small / badly configured and then end up with
>> crazy garbage collections. Have fun reading this serie of blog posts:
>> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
>>
>> Please also see the book about this configuration:
>> http://hbase.apache.org/book.html#recommended_configurations
>>
>

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Actually, it will probably be connection timeout, not connection
refused when there is no connection between the two clusters.

Is there a workaround I can implement now for HBASE-3664, can I write
something in ZK so the server has an old entry to delete and is happy
with it?

-eran




On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Inline.
>
> J-D
>
> On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <er...@gigya.com> wrote:
>> Thanks, J-D.
>> As for the first issue, why does this behavior make sense? What happens when
>> the connection between the two cluster fails? Will the region servers of the
>> primary fail as well? or at least won't be able to start? Seems very
>> radical.
>
> The DNS entry should remain, so you won't get UnknownHostException but
> ConnectionRefused instead. But that's a different issue: HBASE-3130
>
>>
>> Regarding the second issue, I didn't see anything else in the logs, it just
>> seemed like it decided to shutdown, but maybe I missed it. I will try to
>> reproduce that and let you know if I succeed.
>
> That'd be nice :)
>
>>
>> Regarding the timeout to detect a failed server, 3 minutes sounds like a
>> very long time for a region server to be down. Obviously, during that time
>> the data owned by that server is inaccessible. Is there a reason for this
>> long timeout? Can it be configured?
>>
>
> We set it that high for people that try to push too much data to
> clusters that are too small / badly configured and then end up with
> crazy garbage collections. Have fun reading this serie of blog posts:
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
>
> Please also see the book about this configuration:
> http://hbase.apache.org/book.html#recommended_configurations
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Inline.

J-D

On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <er...@gigya.com> wrote:
> Thanks, J-D.
> As for the first issue, why does this behavior make sense? What happens when
> the connection between the two cluster fails? Will the region servers of the
> primary fail as well? or at least won't be able to start? Seems very
> radical.

The DNS entry should remain, so you won't get UnknownHostException but
ConnectionRefused instead. But that's a different issue: HBASE-3130

>
> Regarding the second issue, I didn't see anything else in the logs, it just
> seemed like it decided to shutdown, but maybe I missed it. I will try to
> reproduce that and let you know if I succeed.

That'd be nice :)

>
> Regarding the timeout to detect a failed server, 3 minutes sounds like a
> very long time for a region server to be down. Obviously, during that time
> the data owned by that server is inaccessible. Is there a reason for this
> long timeout? Can it be configured?
>

We set it that high for people that try to push too much data to
clusters that are too small / badly configured and then end up with
crazy garbage collections. Have fun reading this serie of blog posts:
http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/

Please also see the book about this configuration:
http://hbase.apache.org/book.html#recommended_configurations

Re: Region server crashes when using replication

Posted by Eran Kutner <er...@gigya.com>.
Thanks, J-D.

As for the first issue, why does this behavior make sense? What
happens when the connection between the two cluster fails? Will the
region servers of the primary fail as well? or at least won't be able
to start? Seems very radical.

Regarding the second issue, I didn't see anything else in the logs, it
just seemed like it decided to shutdown, but maybe I missed it. I will
try to reproduce that and let you know if I succeed.

Regarding the timeout to detect a failed server, 3 minutes sounds like
a very long time for a region server to be down. Obviously, during
that time the data owned by that server is inaccessible. Is there a
reason for this long timeout? Can it be configured?

-eran



On Tue, Mar 22, 2011 at 20:22, Jean-Daniel Cryans <jd...@apache.org> wrote:
>
> First issue: UnknownHostException is unforgiving, your machines need
> to be able to talk to haddop2-zk3 (is that a typo?)  and it seems that
> at least that one can't. The reason the machine dies is that we
> usually try to "fail fast" in HBase.
>
> Second issue: There's not enough information, all I see is a region
> server shutting down and the reason why is probably before that.
>
> Third issue: https://issues.apache.org/jira/browse/HBASE-3664
>
> Fourth issue: it's now 3 minutes in 0.90 for the timeout to happen.
>
> J-D
>

Re: Region server crashes when using replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.
First issue: UnknownHostException is unforgiving, your machines need
to be able to talk to haddop2-zk3 (is that a typo?)  and it seems that
at least that one can't. The reason the machine dies is that we
usually try to "fail fast" in HBase.

Second issue: There's not enough information, all I see is a region
server shutting down and the reason why is probably before that.

Third issue: https://issues.apache.org/jira/browse/HBASE-3664

Fourth issue: it's now 3 minutes in 0.90 for the timeout to happen.

J-D

On Tue, Mar 22, 2011 at 10:39 AM, Eran Kutner <er...@gigya.com> wrote:
> Hi,
> I'm trying to use replication between two HBase clusters and I'm
> encountering all kinds of crashes and weird behavior.
>
> First, it seems that starting a region server when the peer ZKs are
> not available will cause the server to fail to start:
>
> 2011-03-22 08:31:56,647 INFO
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
> is now started
> 2011-03-22 08:31:56,668 WARN
> org.apache.hadoop.hbase.zookeeper.ZKConfig:
> java.net.UnknownHostException: haddop2-zk3
>         at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
>         at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at java.net.InetAddress.getByName(InetAddress.java:970)
>         at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206)
>         at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
>
> 2011-03-22 08:31:56,669 WARN
> org.apache.hadoop.hbase.zookeeper.ZKConfig:
> java.net.UnknownHostException: haddop2-zk2
>         at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
>         at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at java.net.InetAddress.getByName(InetAddress.java:970)
>         at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206)
>         at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
>
> 2011-03-22 08:31:56,669 INFO org.apache.zookeeper.ZooKeeper:
> Initiating client connection,
> connectString=haddop2-zk3:2181,haddop2-zk2:2181,hadoop2-zk1:2181
> sessionTimeout=180000 watcher=connection to cluster: 1
> 2011-03-22 08:31:56,670 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed
> initialization
> 2011-03-22 08:31:56,670 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
> java.net.UnknownHostException: haddop2-zk3
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1158)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
>         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-03-22 08:31:56,675 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300797113247,
> load=(requests=0, regions=0, usedHeap=24, maxHeap=987): Unhandled
> exception: haddop2-zk3
> java.net.UnknownHostException: haddop2-zk3
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1158)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
>         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-03-22 08:31:56,675 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unhandled
> exception: haddop2-zk3
> 2011-03-22 08:31:56,675 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-03-22 08:31:56,679 INFO
> org.apache.hadoop.hbase.regionserver.StoreFile: Allocating
> LruBlockCache with maximum size 197.5m
> 2011-03-22 08:31:56,683 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
> at: hadoop1-s05.farm-ny.gigya.com,60020,1300797113247
> 2011-03-22 08:31:56,683 DEBUG
> org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
> tracker org.apache.hadoop.hbase.catalog.CatalogTracker@508aeb74
> 2011-03-22 08:31:56,684 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
> leases
> 2011-03-22 08:31:56,684 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
> leases
> 2011-03-22 08:31:56,684 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x22e669588a20058
> 2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x22e669588a20058 closed
> 2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12e669588b8004d closed
> 2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 08:31:56,702 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-15,5,main]
> 2011-03-22 08:31:56,702 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-03-22 08:31:56,702 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
> shutdown hook thread.
> 2011-03-22 08:31:56,804 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
>
>
> Second, it seems that when I'm shutting down a region server on the
> peer cluster region servers on the source cluster connect to it are
> also shutting down:
> 2011-03-22 09:03:34,541 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
> leases
> 2011-03-22 09:03:34,541 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
> leases
> 2011-03-22 09:03:34,644 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x12e669588b80050
> 2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12e669588b80050 closed
> 2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x22e669588a2005d closed
> 2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:03:34,664 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-03-22 09:03:39,377 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-03-22 09:03:39,431 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-03-22 09:03:39,431 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2011-03-22 09:03:39,432 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-15,5,main]
> 2011-03-22 09:03:39,432 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-03-22 09:03:39,432 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
> shutdown hook thread.
> 2011-03-22 09:03:39,433 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
>
>
>
> Third, sometimes it crashes without any reason I can understand. See
> the attached log dump. It includes the entire load process from start
> to shutdown of the region server. When I configure "stop_replication"
> everything is OK, here's what happens after "start_replication":
> 2011-03-22 09:38:59,199 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replication is disabled, sleeping 1000 times 10
> 2011-03-22 09:38:59,333 INFO
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
> is now started
> 2011-03-22 09:39:09,202 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening log for replication
> hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 at 124
> 2011-03-22 09:39:09,215 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:0 and seenEntries:1 and size: 191
> 2011-03-22 09:39:09,215 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Going to report log
> #hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 for position 315
> in hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876
> 2011-03-22 09:39:09,224 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300799918373,
> load=(requests=0, regions=3, usedHeap=41, maxHeap=987): Writing
> replication status
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
> = NoNode for /hbase/replication/rs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/1/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:708)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:751)
>         at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:432)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:131)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:332)
> 2011-03-22 09:39:09,225 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=3, stores=5, storefiles=5, storefileIndexSize=1,
> memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=41,
> maxHeap=987, blockCacheSize=1702768, blockCacheFree=205390992,
> blockCacheCount=3, blockCacheHitCount=15, blockCacheMissCount=3,
> blockCacheEvictedCount=0, blockCacheHitRatio=83,
> blockCacheHitCachingRatio=83
> 2011-03-22 09:39:09,225 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Writing
> replication status
> 2011-03-22 09:39:09,225 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Removing 0 logs in the list: []
> 2011-03-22 09:39:09,225 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Nothing to replicate, sleeping 1000 times 10
> 2011-03-22 09:39:10,996 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 0 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 2 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 18 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
> infoServer
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 30 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 5 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server Responder
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 4 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 6 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 20 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 15 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 9 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 21 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 49 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 14 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 8 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 17 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 44 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 8 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 42 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 41 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 40 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 39 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 38 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 37 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 36 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 34 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 33 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 32 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 31 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 29 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 28 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 27 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 26 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 25 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 24 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 23 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 13 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 16 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 12 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 11 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 5 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 4 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 0 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 9 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 7 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server listener on 60020
> 2011-03-22 09:39:11,004 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60030
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 35 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 43 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 45 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 46 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 47 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 48 on 60020: exiting
> 2011-03-22 09:39:11,010 INFO
> org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer interrupted while waiting for sync
> requests
> 2011-03-22 09:39:11,010 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer exiting
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
> hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 22 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 10 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 3 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 2 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 1 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 19 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 1 on 60020: exiting
> 2011-03-22 09:39:11,011 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of -ROOT-,,0.70236052
> 2011-03-22 09:39:11,011 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,009 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.:
> disabling compactions & flushes
> 2011-03-22 09:39:11,009 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
> regionserver60020.majorCompactionChecker exiting
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed Settings
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed default
> 2011-03-22 09:39:11,012 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Closed region TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.:
> disabling compactions & flushes
> 2011-03-22 09:39:11,011 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> -ROOT-,,0.70236052: disabling compactions & flushes
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region -ROOT-,,0.70236052
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed Data
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed default
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed info
> 2011-03-22 09:39:11,013 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,013 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> -ROOT-,,0.70236052
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Closed region TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Closed region -ROOT-,,0.70236052
> 2011-03-22 09:39:11,066 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
> at: hadoop1-s05.farm-ny.gigya.com,60020,1300799918373
> 2011-03-22 09:39:11,066 DEBUG
> org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
> tracker org.apache.hadoop.hbase.catalog.CatalogTracker@2eb0a3f5
> 2011-03-22 09:39:11,066 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
> leases
> 2011-03-22 09:39:11,066 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
> leases
> 2011-03-22 09:39:11,169 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x22e669588a20061
> 2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x22e669588a20061 closed
> 2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12e669588b80057 closed
> 2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:39:11,190 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-03-22 09:39:12,414 INFO
> org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020.leaseChecker closing leases
> 2011-03-22 09:39:12,415 INFO
> org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020.leaseChecker closed leases
> 2011-03-22 09:39:19,229 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-03-22 09:39:19,229 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2011-03-22 09:39:19,230 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-15,5,main]
> 2011-03-22 09:39:19,230 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-03-22 09:39:19,230 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
> shutdown hook thread.
> 2011-03-22 09:39:19,231 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
>
>
>
> Forth and probably worst of all, it seems that when the servers are
> crashing this way the master still thinks they are alive so the region
> is not transitioned and is therefor inaccessible. How long should it
> normally take the master to detect a dead region server?
>
> Any help on what's going on would be greatly appreciated.
>
> -eran
>