You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Oussama Jilal <ji...@gmail.com> on 2016/07/28 09:59:14 UTC

Master stopping automatically after few seconds

Hello,

We are facing a serious issue with our production system, we are on a
Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs
restarted... this has caused our HBase cluster (1 Master + 2 Region
Servers) to crash...

Now whenever I want to start our HBase cluster, it goes down. I checked the
HBase logs and this is what I found on the master logs:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-07-28 09:33:10,052 WARN  [main-EventThread]
coordination.SplitLogManagerCoordination: Error splitting
/hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181
091.default.1469654596681
2016-07-28 09:33:10,052 WARN  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.SplitLogManager: error while splitting logs in
[hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
installed = 1
but only 0 done
2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
executor.EventHandler: Caught throwable while processing event
M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for
datanode-2,16020,1466263181091, will retry
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs
in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
Task = installed = 1 done = 0 error = 1
        at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
        ... 4 more
2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.HMaster: Master server abort: loaded coprocessors are: []
2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for
datanode-2,16020,1466263181091, will retry
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs
in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
Task = installed = 1 done = 0 error = 1
        at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
        ... 4 more
2016-07-28 09:33:10,055 INFO  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
regionserver.HRegionServer: STOPPED: Caught throwable while processing
event M_SERVER_SHUTDOWN
2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2]
executor.EventHandler: Caught throwable while processing event
M_SERVER_SHUTDOWN
java.io.IOException: Server is stopped
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And this is the one of the Region Servers logs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-07-28 09:33:10,033 WARN  [RS_LOG_REPLAY_OPS-datanode-2:16020-0]
regionserver.SplitLogWorker: log splitting of
WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
failed, returning error
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121
file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
        at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882)
        at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563)
        at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
        at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
        at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
        at
org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
        at
org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
        at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
        at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
        at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
        at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the file "
*/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*"
does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The
filesystem under path '/' is HEALTHY*". But when I try accessing the file
bad things happens:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
IOException, will wait for 2671.5631767728437 msec.
16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2
IOException, will wait for 8977.03196466135 msec.
16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3
IOException, will wait for 13504.266287461855 msec.
cat: Checksum error:
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
at 5120 exp: 123018583 got: 48747596
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Any help is appreciated in order to recover our HBase cluster...

Re: Master stopping automatically after few seconds

Posted by Oussama Jilal <ji...@gmail.com>.

In case anyone should face the same issue... I managed to fix this issue by
running:

hadoop fs -get -ignoreCrc
hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
datanode-2,16020,1466263181091.default.1469654596681

then:

hadoop fs -rm
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681

and then:

hadoop fs -put datanode-2,16020,1466263181091.default.1469654596681
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681

And finally starting HBase again... which basically means getting the file
with the issue from HDFS (ignoring the checksum), and then putting it back!
ouf! (it might be risky... but it worked...)

Best Regards.

On Thu, Jul 28, 2016 at 10:59 AM Oussama Jilal <ji...@gmail.com>
wrote:

> Hello,
>
> We are facing a serious issue with our production system, we are on a
> Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs
> restarted... this has caused our HBase cluster (1 Master + 2 Region
> Servers) to crash...
>
> Now whenever I want to start our HBase cluster, it goes down. I checked
> the HBase logs and this is what I found on the master logs:
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-07-28 09:33:10,052 WARN  [main-EventThread]
> coordination.SplitLogManagerCoordination: Error splitting
> /hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181
> 091.default.1469654596681
> 2016-07-28 09:33:10,052 WARN  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.SplitLogManager: error while splitting logs in
> [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> installed = 1
> but only 0 done
> 2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> executor.EventHandler: Caught throwable while processing event
> M_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> datanode-2,16020,1466263181091, will retry
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error or interrupted while splitting logs
> in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> Task = installed = 1 done = 0 error = 1
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
>         ... 4 more
> 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.HMaster: Master server abort: loaded coprocessors are: []
> 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> datanode-2,16020,1466263181091, will retry
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error or interrupted while splitting logs
> in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> Task = installed = 1 done = 0 error = 1
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
>         ... 4 more
> 2016-07-28 09:33:10,055 INFO  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> regionserver.HRegionServer: STOPPED: Caught throwable while processing
> event M_SERVER_SHUTDOWN
> 2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2]
> executor.EventHandler: Caught throwable while processing event
> M_SERVER_SHUTDOWN
> java.io.IOException: Server is stopped
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> And this is the one of the Region Servers logs:
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-07-28 09:33:10,033 WARN  [RS_LOG_REPLAY_OPS-datanode-2:16020-0]
> regionserver.SplitLogWorker: log splitting of
> WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
> failed, returning error
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121
> file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
>         at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
>         at java.io.DataInputStream.read(DataInputStream.java:100)
>         at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
>         at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
>         at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
>         at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Now the file "
> */hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*"
> does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The
> filesystem under path '/' is HEALTHY*". But when I try accessing the file
> bad things happens:
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
> IOException, will wait for 2671.5631767728437 msec.
> 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2
> IOException, will wait for 8977.03196466135 msec.
> 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3
> IOException, will wait for 13504.266287461855 msec.
> cat: Checksum error:
> /hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
> at 5120 exp: 123018583 got: 48747596
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Any help is appreciated in order to recover our HBase cluster...
>

Re: File Checksum Error

Posted by Oussama Jilal <ji...@gmail.com>.

In case anyone should face the same issue... I managed to fix this issue by
running:

hadoop fs -get -ignoreCrc
hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
datanode-2,16020,1466263181091.default.1469654596681

then:

hadoop fs -rm
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681

and then:

hadoop fs -put datanode-2,16020,1466263181091.default.1469654596681
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681

And finally starting HBase again... which basically means getting the file
with the issue from HDFS (ignoring the checksum), and then putting it back!
ouf! (it might be risky... but it worked...)

Best Regards.

On Thu, Jul 28, 2016 at 11:26 AM Oussama Jilal <ji...@gmail.com>
wrote:

>
> Hello,
>
> This is originally an HBase issue, but please allow me to forward it to
> the hadoop mailing list since the error is in the checksum of a file in
> HDFS...
>
> We are facing a serious issue with our production system, we are on a
> Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs
> restarted... this has caused our HBase cluster (1 Master + 2 Region
> Servers) to crash...
>
> Now whenever I want to start our HBase cluster, it goes down. I checked
> the HBase logs and this is what I found on the master logs:
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-07-28 09:33:10,052 WARN  [main-EventThread]
> coordination.SplitLogManagerCoordination: Error splitting
> /hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181
> 091.default.1469654596681
> 2016-07-28 09:33:10,052 WARN  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.SplitLogManager: error while splitting logs in
> [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> installed = 1
> but only 0 done
> 2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> executor.EventHandler: Caught throwable while processing event
> M_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> datanode-2,16020,1466263181091, will retry
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error or interrupted while splitting logs
> in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> Task = installed = 1 done = 0 error = 1
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
>         ... 4 more
> 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.HMaster: Master server abort: loaded coprocessors are: []
> 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> datanode-2,16020,1466263181091, will retry
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error or interrupted while splitting logs
> in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> Task = installed = 1 done = 0 error = 1
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
>         ... 4 more
> 2016-07-28 09:33:10,055 INFO  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> regionserver.HRegionServer: STOPPED: Caught throwable while processing
> event M_SERVER_SHUTDOWN
> 2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2]
> executor.EventHandler: Caught throwable while processing event
> M_SERVER_SHUTDOWN
> java.io.IOException: Server is stopped
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> And this is the one of the Region Servers logs:
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-07-28 09:33:10,033 WARN  [RS_LOG_REPLAY_OPS-datanode-2:16020-0]
> regionserver.SplitLogWorker: log splitting of
> WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
> failed, returning error
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121
> file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
>         at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
>         at java.io.DataInputStream.read(DataInputStream.java:100)
>         at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
>         at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
>         at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
>         at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Now the file "
> */hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*"
> does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The
> filesystem under path '/' is HEALTHY*". But when I try accessing the file
> bad things happens:
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
> IOException, will wait for 2671.5631767728437 msec.
> 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2
> IOException, will wait for 8977.03196466135 msec.
> 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3
> IOException, will wait for 13504.266287461855 msec.
> cat: Checksum error:
> /hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
> at 5120 exp: 123018583 got: 48747596
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Any help is appreciated in order to recover our HBase cluster...
>

File Checksum Error

Posted by Oussama Jilal <ji...@gmail.com>.

Hello,

This is originally an HBase issue, but please allow me to forward it to the
hadoop mailing list since the error is in the checksum of a file in HDFS...

We are facing a serious issue with our production system, we are on a
Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs
restarted... this has caused our HBase cluster (1 Master + 2 Region
Servers) to crash...

Now whenever I want to start our HBase cluster, it goes down. I checked the
HBase logs and this is what I found on the master logs:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-07-28 09:33:10,052 WARN  [main-EventThread]
coordination.SplitLogManagerCoordination: Error splitting
/hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181
091.default.1469654596681
2016-07-28 09:33:10,052 WARN  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.SplitLogManager: error while splitting logs in
[hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
installed = 1
but only 0 done
2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
executor.EventHandler: Caught throwable while processing event
M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for
datanode-2,16020,1466263181091, will retry
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs
in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
Task = installed = 1 done = 0 error = 1
        at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
        ... 4 more
2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.HMaster: Master server abort: loaded coprocessors are: []
2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for
datanode-2,16020,1466263181091, will retry
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs
in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
Task = installed = 1 done = 0 error = 1
        at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
        at
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
        ... 4 more
2016-07-28 09:33:10,055 INFO  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
regionserver.HRegionServer: STOPPED: Caught throwable while processing
event M_SERVER_SHUTDOWN
2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2]
executor.EventHandler: Caught throwable while processing event
M_SERVER_SHUTDOWN
java.io.IOException: Server is stopped
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And this is the one of the Region Servers logs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-07-28 09:33:10,033 WARN  [RS_LOG_REPLAY_OPS-datanode-2:16020-0]
regionserver.SplitLogWorker: log splitting of
WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
failed, returning error
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121
file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
        at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882)
        at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563)
        at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
        at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
        at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
        at
org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
        at
org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
        at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
        at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
        at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
        at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the file "
*/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*"
does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The
filesystem under path '/' is HEALTHY*". But when I try accessing the file
bad things happens:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
IOException, will wait for 2671.5631767728437 msec.
16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2
IOException, will wait for 8977.03196466135 msec.
16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
at 5120
16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
at 5120
16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
java.io.IOException: No live nodes contain block
BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
Will get new block locations from namenode and retry...
16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3
IOException, will wait for 13504.266287461855 msec.
cat: Checksum error:
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
at 5120 exp: 123018583 got: 48747596
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Any help is appreciated in order to recover our HBase cluster...