You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2021/01/11 21:57:00 UTC
[jira] [Resolved] (HBASE-12830) Unreadable HLogs stuck in replication queues

     [ https://issues.apache.org/jira/browse/HBASE-12830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Kyle Purtell resolved HBASE-12830.
-----------------------------------------
    Resolution: Duplicate

> Unreadable HLogs stuck in replication queues
> --------------------------------------------
>
>                 Key: HBASE-12830
>                 URL: https://issues.apache.org/jira/browse/HBASE-12830
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.9
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> We had an incident where underlying infrastructure issues caused HDFS namenodes to go down, not at the same time, leading to periods of HDFS service outage and recovery as namenodes failed over. These clusters had replication enabled. We had some Regionservers roll logs during partial HDFS availability. Namespace entries for these HLogs were created but the first block-being-written was lost or unable to complete, leading to dead 0-length HLogs in the queues of active sources. When this happens the queue becomes stuck on the dead 0-length HLog reporting EOFExceptions whenever the source wakes up and tries to (re)open the current file like so:
> {noformat}
> 2015-01-08 18:50:40,956 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1-<hostname removed>,60020,1418764167084 Got: 
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:175)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:184)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:128)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:91)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:79)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:506)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:309)
> {noformat}
> This exception originates from where SequenceFile tries to read in the 4-byte version header from position 0.
> In ReplicationSource#run we have an active loop:
> {code}
>     // Loop until we close down
>     while (isActive()) {
>     ...
>     }
> {code}
> Within this loop we iterate over paths in the replication queue. For each path, we attempt to open it:
> {code}
>       // Open a reader on it
>       if (!openReader(sleepMultiplier)) {
>         // Reset the sleep multiplier, else it'd be reused for the next file
>         sleepMultiplier = 1;
>         continue;
>       }
> {code}
> When we have a zero length file openReader returns TRUE but this.reader is set to NULL (look at the catch of the outer try block) and we fall through the conditional to:
> {code}
>       // If we got a null reader but didn't continue, then sleep and continue
>       if (this.reader == null) {
>         if (sleepForRetries("Unable to open a reader", sleepMultiplier)) {
>           sleepMultiplier++;
>         }
>         continue;
>       }
> {code}
> We will keep trying to open the current file for a long time. The queue will be stuck until sleepMultiplier == maxRetriesMultiplier (conf "replication.source.maxretriesmultiplier", default 10), with a base sleep time of "replication.source.sleepforretries" (default 1000) ms, then we will call ReplicationSource#processEndOfFile(). 
> By default we will spin on opening the dead 0-length HLog for (1000*1) + (1000*2) ... + (1000*10) milliseconds before processing the file out of the queue. In HBASE-11964 we recommend increasing "replication.source.maxretriesmultiplier" to 300. Using the updated configuration we will wait for (1000*1) + (1000*2) ... + (1000*300) milliseconds before processing the file out of the queue. 
> There should be some way to break out of this very long wait for a 0-length or corrupt file that is blocking the queue. Perhaps we can check if there are later files in the queue and not wait for maxRetriesMultipler in that case. Some different threshold could be set for that. Or, now we can't short circuit this manually by removing the 0-length log files, but we could enable this. Currently a FNFE triggers an attempt to find where the log has gone. If it can't be found, we fall through to
> {code}
>           // TODO What happens the log is missing in both places?
> {code}
> and leave openReader() returning TRUE as above, the sleep multiplier is increased, and we continue to spin on the file with increasing wait times.
> (And having that TODO unimplemented is also not good.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)