You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Mingjie Lai (JIRA)" <ji...@apache.org> on 2015/06/15 21:02:01 UTC

[jira] [Resolved] (HBASE-13782) RS stuck after FATAL ``FSHLog: Could not append.''

     [ https://issues.apache.org/jira/browse/HBASE-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingjie Lai resolved HBASE-13782.
---------------------------------
    Resolution: Cannot Reproduce

> RS stuck after FATAL ``FSHLog: Could not append.''
> --------------------------------------------------
>
>                 Key: HBASE-13782
>                 URL: https://issues.apache.org/jira/browse/HBASE-13782
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 1.0.1
>         Environment: hbaes version: 1.0.0-cdh5.4.0
> hadoop version: 2.6.0-cdh5.4.0 
>            Reporter: Mingjie Lai
>            Priority: Critical
>         Attachments: hbase-rs.log, hbase-site.xml
>
>
> hbaes version: 1.0.0-cdh5.4.0
> hadoop version: 2.6.0-cdh5.4.0 
> Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a 30-node yarn.
> We started to see that one RS stopped to serve any client request since 2015-05-26 01:05:33, while all other RS were okay. I checked RS log and found that there are some FATAL logs when org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}:
> {code}
> 2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Archiving hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966 to hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
> 2015-05-26 01:05:33,701 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of wal 
> {code}
> Since the HDFS cluster is shared with a YARN cluster, at the time, there were some io heavy jobs running, and exhausted xciever at some of the DNs at the exact same time. I think it's the reason why the RS got ``java.io.IOException: Bad connect ack with firstBadLink''
> The problem is, the RS got stuck without any response since then. flushQueueLength grew to the ceiling and stayed there. The only log entries are from periodicFlusher:
> {code}
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting flush for region myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159. after a delay of 11000
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting flush for region myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be. after a delay of 7874
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting flush for region myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c. after a delay of 22740
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting flush for region myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe. after a delay of 3162
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting flush for region myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920. after a delay of 9142
> {code}
> Looks like there is a RS level deadlock triggered by the FATAL append exception handling. In the end, I had to restart the RS service to rescue the regions from the stuck RS.
> {code}
>       } catch (Exception e) {
>         LOG.fatal("Could not append. Requesting close of wal", e);
>         requestLogRoll();
>         throw e;
>       }
>       numEntries.incrementAndGet();
>     }
> {code}
> Maybe the RS can just suicide after the FATAL exception since it cannot append WAL to hdfs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)