You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Matteo Bertozzi (JIRA)" <ji...@apache.org> on 2015/12/21 17:15:46 UTC

[jira] [Created] (HBASE-15019) Replication stuck when HDFS is restarted

Matteo Bertozzi created HBASE-15019:
---------------------------------------

             Summary: Replication stuck when HDFS is restarted
                 Key: HBASE-15019
                 URL: https://issues.apache.org/jira/browse/HBASE-15019
             Project: HBase
          Issue Type: Bug
          Components: Replication, wal
    Affects Versions: 0.98.16.1, 1.0.3, 1.1.2, 2.0.0, 1.2.0
            Reporter: Matteo Bertozzi
            Assignee: Matteo Bertozzi


RS is normally working and writing on the WAL.
HDFS is killed and restarted, and the RS try to do a roll.
The close fail, but the roll succeed (because hdfs is now up) and everything works.
{noformat}
2015-12-11 21:52:28,058 ERROR org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException while writing trailer
java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
2015-12-11 21:52:28,059 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: Riding over HLog close failure! error count=1
{noformat}

The problem is on the replication side. that log we rolled and we were not able to close
is waiting for a lease recovery.
{noformat}
2015-12-11 21:16:31,909 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 attempts and 301124ms 
{noformat}
the WALFactory notify us about that, but there is nothing on the RS side that perform the WAL recovery.
{noformat}
2015-12-11 21:11:30,921 WARN org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have recovered. This is not expected. Will retry
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 10.51.30.152:50010, 10.51.30.155:50010]}
  at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
  at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
  at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
  at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:230)
  at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
  at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
  at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
  at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
  at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
  at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
  at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
  at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
  at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
  at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
{noformat}
the only way to trigger a WAL recovery is to restart and force the master to trigger the lease recovery on WAL split. 

since we know that the RS is still going, should we try to recover the lease on the RS side?
is it better/safer to trigger an abort on the RS, so we have only the master doing lease recovery?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)