You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Robert Kent <Ro...@inps.co.uk> on 2014/12/01 12:25:11 UTC

HBase Regionserver randomly dies

Hi,

I have got a collection of HBase clusters.  Each cluster is running separate instances of Zookeeper, Hadoop & HBase.  The clusters are either single node or three node setups.

I am getting constant stability problems with the HBase Regionserver, it dies randomly everyday or every other day.  It normally dies shortly after printing the following:

ERROR [Thread-125066] hdfs.DFSClient (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode 32621
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.

Does anyone have any pointers on where I can look to debug this issue?

thanks

Rob

Registered name: In Practice Systems Ltd.
Registered address: The Bread Factory, 1a Broughton Street, London, SW8 3QJ
Registered Number: 1788577
Registered in England
Visit our Internet Web site at www.inps.co.uk
The information in this internet email is confidential and is intended solely for the addressee. Access, copying or re-use of information in it by anyone else is not authorised. Any views or opinions presented are solely those of the author and do not necessarily represent those of INPS or any of its affiliates. If you are not the intended recipient please contact  is.helpdesk@inps.co.uk


Re: HBase Regionserver randomly dies

Posted by Qiang Tian <ti...@gmail.com>.
sorry I did not get chance to look at the HDFS issue. but a quick search
show many hits for the HDFS error message, looks a common issue. you could
search it and check those suggestions((e.g. increase memory..).  or ask it
directly in HDFS user list :-)


On Tue, Dec 2, 2014 at 9:33 PM, Robert Kent <Ro...@inps.co.uk> wrote:

> Sending without an attachment, as the attachment seems to be
> delaying/stopping the message hitting the mailing list.
>
> ________________________________________
> From: Robert Kent
> Sent: 02 December 2014 12:40
> To: user@hbase.apache.org
> Subject: RE: HBase Regionserver randomly dies
>
> > which hbase release are you using?
>
> Hadoop: 2.5.0
> HBase: 0.98.6
> Zookeeper: 3.4.5
>
> > your regionserver log is helpful.
>
> I've attached all the logs for the 29th November - the logs are 148M
> uncompressed.
> I've also attached the Zookeeper, Hadoop & HBase configurations
>
> Is there any other information I can give you to help?
>
> > ps you mentioned "The clusters are either single node..." what about
> your hdfs nodes?
>
> I have a handful of clusters:
>
> 1 node cluster: 1x VM running: HBase Regionserver & Master; Hadoop
> NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper
> 3 node cluster: 3x VM all/mostly-all running: HBase Regionserver & Master;
> Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper
>
> HBase is running on top of HDFS in both cluster types.
>
> The single node cluster runs everything on itself.
> The three node cluster runs virtually everything on every node
> _______________________________________
> From: Qiang Tian [tianq01@gmail.com]
> Sent: 02 December 2014 01:40
> To: user@hbase.apache.org
> Subject: Re: HBase Regionserver randomly dies
>
> which hbase release are you using?
>
> your regionserver log is helpful.
>
> a related case is https://issues.apache.org/jira/browse/HBASE-11902
>
> in 0.98 the RS abort is expected behavior when getting HDFS failure.  you
> still need to find the root cause of the hdfs failure:
>
> could only be replicat
> ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> running and no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.
> chooseTarget(BlockManager.java:1492)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> getAdditionalBlock(FSNamesystem.java:3027)
>
>
> ps you mentioned "The clusters are either single node..." what about your
> hdfs nodes?
>
>
> On Tue, Dec 2, 2014 at 4:15 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > There could be multiple reasons why the single datanode became considered
> > as dead.
> > e.g. datanode went under load which it couldn't handle.
> >
> > I would recommend adding more datanode(s) so that client (hbase) can ride
> > over (slow) datanode.
> >
> > Cheers
> >
> > On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <Ro...@inps.co.uk>
> > wrote:
> >
> > > Sorry, those logs were from the Regionserver.
> > >
> > > The NameNode logs are:
> > >
> > > 2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020]
> > > blockmanagement.BlockPlacementPolicy
> > > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > > enough replicas, still in need of 1 to reach 1. For more information,
> > > please enable DEBUG log level on
> > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > > 2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server
> > > (Server.java:run(2034)) - IPC Server handler 0 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > > 127.0.0.1:39965 Call#382010 Retry#0
> > > java.io.IOException: File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > 2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020]
> > > blockmanagement.BlockPlacementPolicy
> > > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > > enough replicas, still in need of 1 to reach 1. For more information,
> > > please enable DEBUG log level on
> > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > > 2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server
> > > (Server.java:run(2034)) - IPC Server handler 5 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > > 127.0.0.1:39965 Call#382017 Retry#0
> > > java.io.IOException: File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > > [snip]
> > >
> > > Then after 2014-11-29 21:13:11 there are no further exceptions.
> > >
> > >
> > > After having the Regionserver abort, I did not touch Hadoop.  All I did
> > > was restart the Regionserver and everything started working correctly
> > again.
> > > ________________________________________
> > > From: Robert Kent
> > > Sent: 01 December 2014 16:17
> > > To: user@hbase.apache.org
> > > Subject: RE: HBase Regionserver randomly dies
> > >
> > > > From: Ted Yu [yuzhihong@gmail.com]
> > > > Sent: 01 December 2014 15:31
> > > > To: user@hbase.apache.org
> > > > Subject: Re: HBase Regionserver randomly dies
> > >
> > > > Can you check namenode log around the time 'Failed to close inode'
> > error
> > > > was thrown ?
> > >
> > > > Thanks
> > >
> > > Here are the errors from the logs:
> > >
> > > 2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient
> > > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicat
> > > ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> > > running and no node(s) are excluded in this operation.
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
> > > [snip]
> > > 2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller]
> > hdfs.DFSClient
> > > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(566)) - pre-sync failed
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(588)) - Rolled WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
> > > with entries=23544, filesize=121.9 M; new WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1]
> > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient
> > > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller]
> > hdfs.DFSClient
> > > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > [snip]
> > > 2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close
> failure!
> > > error count=1
> > > 2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(588)) - Rolled WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > with entries=5, filesize=0; new WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > [snip]
> > > 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient
> > > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
> > > wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) -
> Got
> > > IOException while writing trailer
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close
> failure!
> > > error count=2
> > > 2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(588)) - Rolled WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > with entries=5, filesize=0; new WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > > [snip]
> > > 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
> > > regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING
> > > region server extras1.ci.local,60020,1417171049368: Failed log close in
> > log
> > > roller
> > > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> > > #1417295579737
> > > [snip]
> > > 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
> > > regionserver.HRegionServer (HRegionServer.java:abort(1873)) -
> > RegionServer
> > > abort: loaded coprocessors are:
> > > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > > 2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller]
> > > regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED:
> > > Failed log close in log roller
> > > [snip]
> > > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > > (Leases.java:close(147)) - regionserver60020 closing leases
> > > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > > (Leases.java:close(150)) - regionserver60020 closed leases
> > > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > > regionserver.ReplicationSource (ReplicationSource.java:terminate(860))
> -
> > > Closing source Indexer_mhsaudit because: Region server is closing
> > > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > > client.HConnectionManager$HConnectionImplementation
> > > (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing
> zookeeper
> > > sessionid=0x149f0f13a300434
> > > 2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper
> > > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
> > > 2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread]
> > > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > > 2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper
> > > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
> > > 2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread]
> > > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > > 2014-11-29 21:13:02,356 INFO  [regionserver60020]
> > > regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping
> > server
> > > extras1.ci.local,60020,1417171049368; zookeeper connection closed.
> > > 2014-11-29 21:13:02,417 INFO  [regionserver60020]
> > > regionserver.HRegionServer (HRegionServer.java:run(1061)) -
> > > regionserver60020 exiting
> > > 2014-11-29 21:13:02,494 ERROR [main]
> > regionserver.HRegionServerCommandLine
> > > (HRegionServerCommandLine.java:start(70)) - Region server exiting
> > > java.lang.RuntimeException: HRegionServer Aborted
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
> > > 2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook
> > > (ShutdownHook.java:run(111)) - Shutdown hook starting;
> > > hbase.shutdown.hook=true;
> > >
> >
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
> > > 2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook
> > > (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
> > > 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
> > > (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode
> > > 32621
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > >
> > >
> > > On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk>
> > > wrote:
> > > > > Looks like an HDFS issue. Are you sure your HDFS is working fine?
> > > >
> > > > HDFS appears to be working correctly - HBase will process requests
> > > > properly and everything appears to work correctly for hours/days,
> until
> > > the
> > > > regionserver randomly falls over.  If there were HDFS issues I would
> > > expect
> > > > to see these during normal operation, but I don't.
> > > >
> > >
> >
>

RE: HBase Regionserver randomly dies

Posted by Robert Kent <Ro...@inps.co.uk>.
Sending without an attachment, as the attachment seems to be delaying/stopping the message hitting the mailing list.

________________________________________
From: Robert Kent
Sent: 02 December 2014 12:40
To: user@hbase.apache.org
Subject: RE: HBase Regionserver randomly dies

> which hbase release are you using?

Hadoop: 2.5.0
HBase: 0.98.6
Zookeeper: 3.4.5

> your regionserver log is helpful.

I've attached all the logs for the 29th November - the logs are 148M uncompressed.
I've also attached the Zookeeper, Hadoop & HBase configurations

Is there any other information I can give you to help?

> ps you mentioned "The clusters are either single node..." what about your hdfs nodes?

I have a handful of clusters:

1 node cluster: 1x VM running: HBase Regionserver & Master; Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper
3 node cluster: 3x VM all/mostly-all running: HBase Regionserver & Master; Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper

HBase is running on top of HDFS in both cluster types.

The single node cluster runs everything on itself.
The three node cluster runs virtually everything on every node
_______________________________________
From: Qiang Tian [tianq01@gmail.com]
Sent: 02 December 2014 01:40
To: user@hbase.apache.org
Subject: Re: HBase Regionserver randomly dies

which hbase release are you using?

your regionserver log is helpful.

a related case is https://issues.apache.org/jira/browse/HBASE-11902

in 0.98 the RS abort is expected behavior when getting HDFS failure.  you
still need to find the root cause of the hdfs failure:

could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.
chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:3027)


ps you mentioned "The clusters are either single node..." what about your
hdfs nodes?


On Tue, Dec 2, 2014 at 4:15 AM, Ted Yu <yu...@gmail.com> wrote:

> There could be multiple reasons why the single datanode became considered
> as dead.
> e.g. datanode went under load which it couldn't handle.
>
> I would recommend adding more datanode(s) so that client (hbase) can ride
> over (slow) datanode.
>
> Cheers
>
> On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <Ro...@inps.co.uk>
> wrote:
>
> > Sorry, those logs were from the Regionserver.
> >
> > The NameNode logs are:
> >
> > 2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020]
> > blockmanagement.BlockPlacementPolicy
> > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > enough replicas, still in need of 1 to reach 1. For more information,
> > please enable DEBUG log level on
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > 2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server
> > (Server.java:run(2034)) - IPC Server handler 0 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 127.0.0.1:39965 Call#382010 Retry#0
> > java.io.IOException: File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > 2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020]
> > blockmanagement.BlockPlacementPolicy
> > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > enough replicas, still in need of 1 to reach 1. For more information,
> > please enable DEBUG log level on
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > 2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server
> > (Server.java:run(2034)) - IPC Server handler 5 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 127.0.0.1:39965 Call#382017 Retry#0
> > java.io.IOException: File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > [snip]
> >
> > Then after 2014-11-29 21:13:11 there are no further exceptions.
> >
> >
> > After having the Regionserver abort, I did not touch Hadoop.  All I did
> > was restart the Regionserver and everything started working correctly
> again.
> > ________________________________________
> > From: Robert Kent
> > Sent: 01 December 2014 16:17
> > To: user@hbase.apache.org
> > Subject: RE: HBase Regionserver randomly dies
> >
> > > From: Ted Yu [yuzhihong@gmail.com]
> > > Sent: 01 December 2014 15:31
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase Regionserver randomly dies
> >
> > > Can you check namenode log around the time 'Failed to close inode'
> error
> > > was thrown ?
> >
> > > Thanks
> >
> > Here are the errors from the logs:
> >
> > 2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicat
> > ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> > running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
> > [snip]
> > 2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller]
> hdfs.DFSClient
> > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(566)) - pre-sync failed
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
> > with entries=23544, filesize=121.9 M; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller]
> hdfs.DFSClient
> > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > [snip]
> > 2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> > error count=1
> > 2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > with entries=5, filesize=0; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > [snip]
> > 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
> > wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got
> > IOException while writing trailer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> > error count=2
> > 2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > with entries=5, filesize=0; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > [snip]
> > 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING
> > region server extras1.ci.local,60020,1417171049368: Failed log close in
> log
> > roller
> > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> > #1417295579737
> > [snip]
> > 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:abort(1873)) -
> RegionServer
> > abort: loaded coprocessors are:
> > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > 2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED:
> > Failed log close in log roller
> > [snip]
> > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > (Leases.java:close(147)) - regionserver60020 closing leases
> > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > (Leases.java:close(150)) - regionserver60020 closed leases
> > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) -
> > Closing source Indexer_mhsaudit because: Region server is closing
> > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > client.HConnectionManager$HConnectionImplementation
> > (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper
> > sessionid=0x149f0f13a300434
> > 2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper
> > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
> > 2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread]
> > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper
> > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread]
> > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020]
> > regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping
> server
> > extras1.ci.local,60020,1417171049368; zookeeper connection closed.
> > 2014-11-29 21:13:02,417 INFO  [regionserver60020]
> > regionserver.HRegionServer (HRegionServer.java:run(1061)) -
> > regionserver60020 exiting
> > 2014-11-29 21:13:02,494 ERROR [main]
> regionserver.HRegionServerCommandLine
> > (HRegionServerCommandLine.java:start(70)) - Region server exiting
> > java.lang.RuntimeException: HRegionServer Aborted
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
> > 2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook
> > (ShutdownHook.java:run(111)) - Shutdown hook starting;
> > hbase.shutdown.hook=true;
> >
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
> > 2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook
> > (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
> > 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
> > (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode
> > 32621
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >
> >
> > On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk>
> > wrote:
> > > > Looks like an HDFS issue. Are you sure your HDFS is working fine?
> > >
> > > HDFS appears to be working correctly - HBase will process requests
> > > properly and everything appears to work correctly for hours/days, until
> > the
> > > regionserver randomly falls over.  If there were HDFS issues I would
> > expect
> > > to see these during normal operation, but I don't.
> > >
> >
>

RE: HBase Regionserver randomly dies

Posted by Robert Kent <Ro...@inps.co.uk>.
Let me try that again with an attachment size that is small enough to go through the mailing list.

________________________________________
From: Robert Kent
Sent: 02 December 2014 12:40
To: user@hbase.apache.org
Subject: RE: HBase Regionserver randomly dies

> which hbase release are you using?

Hadoop: 2.5.0
HBase: 0.98.6
Zookeeper: 3.4.5

> your regionserver log is helpful.

I've attached all the logs for the 29th November - the logs are 148M uncompressed.
I've also attached the Zookeeper, Hadoop & HBase configurations

Is there any other information I can give you to help?

> ps you mentioned "The clusters are either single node..." what about your hdfs nodes?

I have a handful of clusters:

1 node cluster: 1x VM running: HBase Regionserver & Master; Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper
3 node cluster: 3x VM all/mostly-all running: HBase Regionserver & Master; Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper

HBase is running on top of HDFS in both cluster types.

The single node cluster runs everything on itself.
The three node cluster runs virtually everything on every node
_______________________________________
From: Qiang Tian [tianq01@gmail.com]
Sent: 02 December 2014 01:40
To: user@hbase.apache.org
Subject: Re: HBase Regionserver randomly dies

which hbase release are you using?

your regionserver log is helpful.

a related case is https://issues.apache.org/jira/browse/HBASE-11902

in 0.98 the RS abort is expected behavior when getting HDFS failure.  you
still need to find the root cause of the hdfs failure:

could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.
chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:3027)


ps you mentioned "The clusters are either single node..." what about your
hdfs nodes?


On Tue, Dec 2, 2014 at 4:15 AM, Ted Yu <yu...@gmail.com> wrote:

> There could be multiple reasons why the single datanode became considered
> as dead.
> e.g. datanode went under load which it couldn't handle.
>
> I would recommend adding more datanode(s) so that client (hbase) can ride
> over (slow) datanode.
>
> Cheers
>
> On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <Ro...@inps.co.uk>
> wrote:
>
> > Sorry, those logs were from the Regionserver.
> >
> > The NameNode logs are:
> >
> > 2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020]
> > blockmanagement.BlockPlacementPolicy
> > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > enough replicas, still in need of 1 to reach 1. For more information,
> > please enable DEBUG log level on
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > 2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server
> > (Server.java:run(2034)) - IPC Server handler 0 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 127.0.0.1:39965 Call#382010 Retry#0
> > java.io.IOException: File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > 2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020]
> > blockmanagement.BlockPlacementPolicy
> > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > enough replicas, still in need of 1 to reach 1. For more information,
> > please enable DEBUG log level on
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > 2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server
> > (Server.java:run(2034)) - IPC Server handler 5 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 127.0.0.1:39965 Call#382017 Retry#0
> > java.io.IOException: File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > [snip]
> >
> > Then after 2014-11-29 21:13:11 there are no further exceptions.
> >
> >
> > After having the Regionserver abort, I did not touch Hadoop.  All I did
> > was restart the Regionserver and everything started working correctly
> again.
> > ________________________________________
> > From: Robert Kent
> > Sent: 01 December 2014 16:17
> > To: user@hbase.apache.org
> > Subject: RE: HBase Regionserver randomly dies
> >
> > > From: Ted Yu [yuzhihong@gmail.com]
> > > Sent: 01 December 2014 15:31
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase Regionserver randomly dies
> >
> > > Can you check namenode log around the time 'Failed to close inode'
> error
> > > was thrown ?
> >
> > > Thanks
> >
> > Here are the errors from the logs:
> >
> > 2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicat
> > ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> > running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
> > [snip]
> > 2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller]
> hdfs.DFSClient
> > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(566)) - pre-sync failed
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
> > with entries=23544, filesize=121.9 M; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller]
> hdfs.DFSClient
> > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > [snip]
> > 2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> > error count=1
> > 2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > with entries=5, filesize=0; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > [snip]
> > 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
> > wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got
> > IOException while writing trailer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> > error count=2
> > 2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > with entries=5, filesize=0; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > [snip]
> > 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING
> > region server extras1.ci.local,60020,1417171049368: Failed log close in
> log
> > roller
> > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> > #1417295579737
> > [snip]
> > 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:abort(1873)) -
> RegionServer
> > abort: loaded coprocessors are:
> > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > 2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED:
> > Failed log close in log roller
> > [snip]
> > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > (Leases.java:close(147)) - regionserver60020 closing leases
> > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > (Leases.java:close(150)) - regionserver60020 closed leases
> > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) -
> > Closing source Indexer_mhsaudit because: Region server is closing
> > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > client.HConnectionManager$HConnectionImplementation
> > (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper
> > sessionid=0x149f0f13a300434
> > 2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper
> > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
> > 2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread]
> > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper
> > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread]
> > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020]
> > regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping
> server
> > extras1.ci.local,60020,1417171049368; zookeeper connection closed.
> > 2014-11-29 21:13:02,417 INFO  [regionserver60020]
> > regionserver.HRegionServer (HRegionServer.java:run(1061)) -
> > regionserver60020 exiting
> > 2014-11-29 21:13:02,494 ERROR [main]
> regionserver.HRegionServerCommandLine
> > (HRegionServerCommandLine.java:start(70)) - Region server exiting
> > java.lang.RuntimeException: HRegionServer Aborted
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
> > 2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook
> > (ShutdownHook.java:run(111)) - Shutdown hook starting;
> > hbase.shutdown.hook=true;
> >
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
> > 2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook
> > (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
> > 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
> > (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode
> > 32621
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >
> >
> > On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk>
> > wrote:
> > > > Looks like an HDFS issue. Are you sure your HDFS is working fine?
> > >
> > > HDFS appears to be working correctly - HBase will process requests
> > > properly and everything appears to work correctly for hours/days, until
> > the
> > > regionserver randomly falls over.  If there were HDFS issues I would
> > expect
> > > to see these during normal operation, but I don't.
> > >
> >
>

Re: HBase Regionserver randomly dies

Posted by Qiang Tian <ti...@gmail.com>.
which hbase release are you using?

your regionserver log is helpful.

a related case is https://issues.apache.org/jira/browse/HBASE-11902

in 0.98 the RS abort is expected behavior when getting HDFS failure.  you
still need to find the root cause of the hdfs failure:

could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.
chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:3027)


ps you mentioned "The clusters are either single node..." what about your
hdfs nodes?


On Tue, Dec 2, 2014 at 4:15 AM, Ted Yu <yu...@gmail.com> wrote:

> There could be multiple reasons why the single datanode became considered
> as dead.
> e.g. datanode went under load which it couldn't handle.
>
> I would recommend adding more datanode(s) so that client (hbase) can ride
> over (slow) datanode.
>
> Cheers
>
> On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <Ro...@inps.co.uk>
> wrote:
>
> > Sorry, those logs were from the Regionserver.
> >
> > The NameNode logs are:
> >
> > 2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020]
> > blockmanagement.BlockPlacementPolicy
> > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > enough replicas, still in need of 1 to reach 1. For more information,
> > please enable DEBUG log level on
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > 2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server
> > (Server.java:run(2034)) - IPC Server handler 0 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 127.0.0.1:39965 Call#382010 Retry#0
> > java.io.IOException: File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > 2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020]
> > blockmanagement.BlockPlacementPolicy
> > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > enough replicas, still in need of 1 to reach 1. For more information,
> > please enable DEBUG log level on
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > 2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server
> > (Server.java:run(2034)) - IPC Server handler 5 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 127.0.0.1:39965 Call#382017 Retry#0
> > java.io.IOException: File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > [snip]
> >
> > Then after 2014-11-29 21:13:11 there are no further exceptions.
> >
> >
> > After having the Regionserver abort, I did not touch Hadoop.  All I did
> > was restart the Regionserver and everything started working correctly
> again.
> > ________________________________________
> > From: Robert Kent
> > Sent: 01 December 2014 16:17
> > To: user@hbase.apache.org
> > Subject: RE: HBase Regionserver randomly dies
> >
> > > From: Ted Yu [yuzhihong@gmail.com]
> > > Sent: 01 December 2014 15:31
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase Regionserver randomly dies
> >
> > > Can you check namenode log around the time 'Failed to close inode'
> error
> > > was thrown ?
> >
> > > Thanks
> >
> > Here are the errors from the logs:
> >
> > 2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicat
> > ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> > running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
> > [snip]
> > 2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller]
> hdfs.DFSClient
> > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(566)) - pre-sync failed
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
> > with entries=23544, filesize=121.9 M; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller]
> hdfs.DFSClient
> > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > [snip]
> > 2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> > error count=1
> > 2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > with entries=5, filesize=0; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > [snip]
> > 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient
> > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
> > wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got
> > IOException while writing trailer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> > [snip]
> > 2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> > error count=2
> > 2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog
> > (FSHLog.java:rollWriter(588)) - Rolled WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > with entries=5, filesize=0; new WAL
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter]
> > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> request
> > close of hlog
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > [snip]
> > 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING
> > region server extras1.ci.local,60020,1417171049368: Failed log close in
> log
> > roller
> > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> > #1417295579737
> > [snip]
> > 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:abort(1873)) -
> RegionServer
> > abort: loaded coprocessors are:
> > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > 2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller]
> > regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED:
> > Failed log close in log roller
> > [snip]
> > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > (Leases.java:close(147)) - regionserver60020 closing leases
> > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > (Leases.java:close(150)) - regionserver60020 closed leases
> > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) -
> > Closing source Indexer_mhsaudit because: Region server is closing
> > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > client.HConnectionManager$HConnectionImplementation
> > (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper
> > sessionid=0x149f0f13a300434
> > 2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper
> > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
> > 2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread]
> > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper
> > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread]
> > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > 2014-11-29 21:13:02,356 INFO  [regionserver60020]
> > regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping
> server
> > extras1.ci.local,60020,1417171049368; zookeeper connection closed.
> > 2014-11-29 21:13:02,417 INFO  [regionserver60020]
> > regionserver.HRegionServer (HRegionServer.java:run(1061)) -
> > regionserver60020 exiting
> > 2014-11-29 21:13:02,494 ERROR [main]
> regionserver.HRegionServerCommandLine
> > (HRegionServerCommandLine.java:start(70)) - Region server exiting
> > java.lang.RuntimeException: HRegionServer Aborted
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
> > 2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook
> > (ShutdownHook.java:run(111)) - Shutdown hook starting;
> > hbase.shutdown.hook=true;
> >
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
> > 2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook
> > (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
> > 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
> > (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode
> > 32621
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
> > could only be replicated to 0 nodes instead of minReplication (=1).
> There
> > are 1 datanode(s) running and no node(s) are excluded in this operation.
> >
> >
> > On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk>
> > wrote:
> > > > Looks like an HDFS issue. Are you sure your HDFS is working fine?
> > >
> > > HDFS appears to be working correctly - HBase will process requests
> > > properly and everything appears to work correctly for hours/days, until
> > the
> > > regionserver randomly falls over.  If there were HDFS issues I would
> > expect
> > > to see these during normal operation, but I don't.
> > >
> >
>

Re: HBase Regionserver randomly dies

Posted by Ted Yu <yu...@gmail.com>.
There could be multiple reasons why the single datanode became considered
as dead.
e.g. datanode went under load which it couldn't handle.

I would recommend adding more datanode(s) so that client (hbase) can ride
over (slow) datanode.

Cheers

On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <Ro...@inps.co.uk> wrote:

> Sorry, those logs were from the Regionserver.
>
> The NameNode logs are:
>
> 2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020]
> blockmanagement.BlockPlacementPolicy
> (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> enough replicas, still in need of 1 to reach 1. For more information,
> please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server
> (Server.java:run(2034)) - IPC Server handler 0 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> 127.0.0.1:39965 Call#382010 Retry#0
> java.io.IOException: File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> 2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020]
> blockmanagement.BlockPlacementPolicy
> (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> enough replicas, still in need of 1 to reach 1. For more information,
> please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server
> (Server.java:run(2034)) - IPC Server handler 5 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> 127.0.0.1:39965 Call#382017 Retry#0
> java.io.IOException: File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
>         at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> [snip]
>
> Then after 2014-11-29 21:13:11 there are no further exceptions.
>
>
> After having the Regionserver abort, I did not touch Hadoop.  All I did
> was restart the Regionserver and everything started working correctly again.
> ________________________________________
> From: Robert Kent
> Sent: 01 December 2014 16:17
> To: user@hbase.apache.org
> Subject: RE: HBase Regionserver randomly dies
>
> > From: Ted Yu [yuzhihong@gmail.com]
> > Sent: 01 December 2014 15:31
> > To: user@hbase.apache.org
> > Subject: Re: HBase Regionserver randomly dies
>
> > Can you check namenode log around the time 'Failed to close inode' error
> > was thrown ?
>
> > Thanks
>
> Here are the errors from the logs:
>
> 2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient
> (DFSOutputStream.java:run(639)) - DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicat
> ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> running and no node(s) are excluded in this operation.
>         at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
> [snip]
> 2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller] hdfs.DFSClient
> (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:rollWriter(566)) - pre-sync failed
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:rollWriter(588)) - Rolled WAL
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
> with entries=23544, filesize=121.9 M; new WAL
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter]
> wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter]
> wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0]
> wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1]
> wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter]
> wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0]
> wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient
> (DFSOutputStream.java:run(639)) - DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller] hdfs.DFSClient
> (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> [snip]
> 2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> error count=1
> 2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:rollWriter(588)) - Rolled WAL
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> with entries=5, filesize=0; new WAL
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> [snip]
> 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter]
> wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient
> (DFSOutputStream.java:run(639)) - DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
> wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got
> IOException while writing trailer
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> [snip]
> 2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure!
> error count=2
> 2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog
> (FSHLog.java:rollWriter(588)) - Rolled WAL
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> with entries=5, filesize=0; new WAL
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter]
> wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request
> close of hlog
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
>         at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> [snip]
> 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
> regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING
> region server extras1.ci.local,60020,1417171049368: Failed log close in log
> roller
> org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> #1417295579737
> [snip]
> 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
> regionserver.HRegionServer (HRegionServer.java:abort(1873)) - RegionServer
> abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller]
> regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED:
> Failed log close in log roller
> [snip]
> 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> (Leases.java:close(147)) - regionserver60020 closing leases
> 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> (Leases.java:close(150)) - regionserver60020 closed leases
> 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) -
> Closing source Indexer_mhsaudit because: Region server is closing
> 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> client.HConnectionManager$HConnectionImplementation
> (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper
> sessionid=0x149f0f13a300434
> 2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper
> (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
> 2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread]
> zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> 2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper
> (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
> 2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread]
> zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> 2014-11-29 21:13:02,356 INFO  [regionserver60020]
> regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping server
> extras1.ci.local,60020,1417171049368; zookeeper connection closed.
> 2014-11-29 21:13:02,417 INFO  [regionserver60020]
> regionserver.HRegionServer (HRegionServer.java:run(1061)) -
> regionserver60020 exiting
> 2014-11-29 21:13:02,494 ERROR [main] regionserver.HRegionServerCommandLine
> (HRegionServerCommandLine.java:start(70)) - Region server exiting
> java.lang.RuntimeException: HRegionServer Aborted
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
> 2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook
> (ShutdownHook.java:run(111)) - Shutdown hook starting;
> hbase.shutdown.hook=true;
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
> 2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook
> (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
> 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
> (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode
> 32621
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 1 datanode(s) running and no node(s) are excluded in this operation.
>
>
> On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk>
> wrote:
> > > Looks like an HDFS issue. Are you sure your HDFS is working fine?
> >
> > HDFS appears to be working correctly - HBase will process requests
> > properly and everything appears to work correctly for hours/days, until
> the
> > regionserver randomly falls over.  If there were HDFS issues I would
> expect
> > to see these during normal operation, but I don't.
> >
>

RE: HBase Regionserver randomly dies

Posted by Robert Kent <Ro...@inps.co.uk>.
Sorry, those logs were from the Regionserver.

The NameNode logs are:

2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020] blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place enough replicas, still in need of 1 to reach 1. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server (Server.java:run(2034)) - IPC Server handler 0 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:39965 Call#382010 Retry#0
java.io.IOException: File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020] blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place enough replicas, still in need of 1 to reach 1. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server (Server.java:run(2034)) - IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:39965 Call#382017 Retry#0
java.io.IOException: File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
[snip]

Then after 2014-11-29 21:13:11 there are no further exceptions.


After having the Regionserver abort, I did not touch Hadoop.  All I did was restart the Regionserver and everything started working correctly again.
________________________________________
From: Robert Kent
Sent: 01 December 2014 16:17
To: user@hbase.apache.org
Subject: RE: HBase Regionserver randomly dies

> From: Ted Yu [yuzhihong@gmail.com]
> Sent: 01 December 2014 15:31
> To: user@hbase.apache.org
> Subject: Re: HBase Regionserver randomly dies

> Can you check namenode log around the time 'Failed to close inode' error
> was thrown ?

> Thanks

Here are the errors from the logs:

2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient (DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
[snip]
2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller] hdfs.DFSClient (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(566)) - pre-sync failed
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588)) - Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389 with entries=23544, filesize=121.9 M; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient (DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller] hdfs.DFSClient (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
[snip]
2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error count=1
2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588)) - Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 with entries=5, filesize=0; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
[snip]
2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient (DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got IOException while writing trailer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error count=2
2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588)) - Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 with entries=5, filesize=0; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
[snip]
2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING region server extras1.ci.local,60020,1417171049368: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1417295579737
[snip]
2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:abort(1873)) - RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED: Failed log close in log roller
[snip]
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases (Leases.java:close(147)) - regionserver60020 closing leases
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases (Leases.java:close(150)) - regionserver60020 closed leases
2014-11-29 21:13:02,179 INFO  [regionserver60020] regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) - Closing source Indexer_mhsaudit because: Region server is closing
2014-11-29 21:13:02,179 INFO  [regionserver60020] client.HConnectionManager$HConnectionImplementation (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper sessionid=0x149f0f13a300434
2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping server extras1.ci.local,60020,1417171049368; zookeeper connection closed.
2014-11-29 21:13:02,417 INFO  [regionserver60020] regionserver.HRegionServer (HRegionServer.java:run(1061)) - regionserver60020 exiting
2014-11-29 21:13:02,494 ERROR [main] regionserver.HRegionServerCommandLine (HRegionServerCommandLine.java:start(70)) - Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook (ShutdownHook.java:run(111)) - Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode 32621
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.


On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk> wrote:
> > Looks like an HDFS issue. Are you sure your HDFS is working fine?
>
> HDFS appears to be working correctly - HBase will process requests
> properly and everything appears to work correctly for hours/days, until the
> regionserver randomly falls over.  If there were HDFS issues I would expect
> to see these during normal operation, but I don't.
>

RE: HBase Regionserver randomly dies

Posted by Robert Kent <Ro...@inps.co.uk>.
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: 01 December 2014 15:31
> To: user@hbase.apache.org
> Subject: Re: HBase Regionserver randomly dies

> Can you check namenode log around the time 'Failed to close inode' error
> was thrown ?

> Thanks

Here are the errors from the logs:

2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient (DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
[snip]
2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller] hdfs.DFSClient (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(566)) - pre-sync failed
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588)) - Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389 with entries=23544, filesize=121.9 M; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient (DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller] hdfs.DFSClient (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
[snip]
2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error count=1
2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588)) - Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 with entries=5, filesize=0; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
[snip]
2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient (DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got IOException while writing trailer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error count=2
2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588)) - Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 with entries=5, filesize=0; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
[snip]
2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING region server extras1.ci.local,60020,1417171049368: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1417295579737
[snip]
2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:abort(1873)) - RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED: Failed log close in log roller
[snip]
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases (Leases.java:close(147)) - regionserver60020 closing leases
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases (Leases.java:close(150)) - regionserver60020 closed leases
2014-11-29 21:13:02,179 INFO  [regionserver60020] regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) - Closing source Indexer_mhsaudit because: Region server is closing
2014-11-29 21:13:02,179 INFO  [regionserver60020] client.HConnectionManager$HConnectionImplementation (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper sessionid=0x149f0f13a300434
2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping server extras1.ci.local,60020,1417171049368; zookeeper connection closed.
2014-11-29 21:13:02,417 INFO  [regionserver60020] regionserver.HRegionServer (HRegionServer.java:run(1061)) - regionserver60020 exiting
2014-11-29 21:13:02,494 ERROR [main] regionserver.HRegionServerCommandLine (HRegionServerCommandLine.java:start(70)) - Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook (ShutdownHook.java:run(111)) - Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode 32621
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.


On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk> wrote:
> > Looks like an HDFS issue. Are you sure your HDFS is working fine?
>
> HDFS appears to be working correctly - HBase will process requests
> properly and everything appears to work correctly for hours/days, until the
> regionserver randomly falls over.  If there were HDFS issues I would expect
> to see these during normal operation, but I don't.
>

Re: HBase Regionserver randomly dies

Posted by Ted Yu <yu...@gmail.com>.
Can you check namenode log around the time 'Failed to close inode' error
was thrown ?

Thanks

On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Ro...@inps.co.uk> wrote:

> > Looks like an HDFS issue. Are you sure your HDFS is working fine?
>
> HDFS appears to be working correctly - HBase will process requests
> properly and everything appears to work correctly for hours/days, until the
> regionserver randomly falls over.  If there were HDFS issues I would expect
> to see these during normal operation, but I don't.
>

RE: HBase Regionserver randomly dies

Posted by Robert Kent <Ro...@inps.co.uk>.
> Looks like an HDFS issue. Are you sure your HDFS is working fine?

HDFS appears to be working correctly - HBase will process requests properly and everything appears to work correctly for hours/days, until the regionserver randomly falls over.  If there were HDFS issues I would expect to see these during normal operation, but I don't.

Re: HBase Regionserver randomly dies

Posted by Bharath Vissapragada <bh...@cloudera.com>.
Looks like an HDFS issue. Are you sure your HDFS is working fine?