You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Viraj Jasani (Jira)" <ji...@apache.org> on 2021/01/29 12:34:00 UTC

[jira] [Resolved] (HBASE-25536) Remove 0 length wal file from logQueue if it belongs to old sources.

     [ https://issues.apache.org/jira/browse/HBASE-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Jasani resolved HBASE-25536.
----------------------------------
    Fix Version/s: 2.4.2
                   2.3.5
                   2.5.0
                   2.2.7
                   1.7.0
                   3.0.0-alpha-1
     Hadoop Flags: Reviewed
       Resolution: Fixed

Thanks for the contribution [~shahrs87], and for the reviews [~wchevreuil] [~gjacoby] [~bharathv].

> Remove 0 length wal file from logQueue if it belongs to old sources.
> --------------------------------------------------------------------
>
>                 Key: HBASE-25536
>                 URL: https://issues.apache.org/jira/browse/HBASE-25536
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 1.6.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.2.7, 2.5.0, 2.3.5, 2.4.2
>
>
> In our production clusters, we found one case where RS is not removing 0 length file from replication queue (in memory one not the zk replication queue) if the logQueue size is 1.
>  Stack trace below:
> {noformat}
> 2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] regionserver.ReplicationSourceWALReaderThread - Failed to read stream of replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: java.io.EOFException: hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a SequenceFile
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
> Caused by: java.io.EOFException: hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a SequenceFile
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
> 	... 1 more
> {noformat}
> The wal in question is of length 0 (verified via hadoop ls command) and is from recovered sources. There is just 1 log file in the queue (verified via heap dump).
>  We have logic to remove 0 length log file from queue when we encounter EOFException and logQueue#size is greater than 1. Code snippet below.
> {code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
>   // if we get an EOF due to a zero-length log, and there are other logs in queue
>   // (highly likely we've closed the current log), we've hit the max retries, and autorecovery is
>   // enabled, then dump the log
>   private void handleEofException(IOException e) {
>     if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>        logQueue.size() > 1 && this.eofAutoRecovery) {
>       try {
>         if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
>           LOG.warn("Forcing removal of 0 length log in queue: " + logQueue.peek());
>           logQueue.remove();
>           currentPosition = 0;
>         }
>       } catch (IOException ioe) {
>         LOG.warn("Couldn't get file length information about log " + logQueue.peek());
>       }
>     }
>   }
> {code}
> This size check is valid for active sources where we need to have atleast one wal file which is the current wal file but for recovered sources where we don't add current wal file to queue, we can skip the logQueue#size check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)