You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Rushabh Shah (Jira)" <ji...@apache.org> on 2021/01/28 15:14:00 UTC

[jira] [Created] (HBASE-25536) Remove 0 length wal file from queue if it belongs to old sources.

Rushabh Shah created HBASE-25536:
------------------------------------

             Summary: Remove 0 length wal file from queue if it belongs to old sources.
                 Key: HBASE-25536
                 URL: https://issues.apache.org/jira/browse/HBASE-25536
             Project: HBase
          Issue Type: Improvement
          Components: Replication
    Affects Versions: 1.6.0
            Reporter: Rushabh Shah
            Assignee: Rushabh Shah
             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2


In our production clusters, we found one case where RS is not removing 0 length file from replication queue (in memory one not the zk replication queue) if the logQueue size is 1.
 Stack trace below:
{noformat}
2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] regionserver.ReplicationSourceWALReaderThread - Failed to read stream of replication entries
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: java.io.EOFException: hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a SequenceFile
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
Caused by: java.io.EOFException: hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a SequenceFile
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
	... 1 more
{noformat}
The wal in question is of length 0 (verified via hadoop ls command) and is from recovered sources. There is just 1 log file in the queue (verified via heap dump).

 We have logic to remove 0 length log file from queue when we encounter EOFException and logQueue#size is greater than 1. Code snippet below.
{code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
  // if we get an EOF due to a zero-length log, and there are other logs in queue
  // (highly likely we've closed the current log), we've hit the max retries, and autorecovery is
  // enabled, then dump the log
  private void handleEofException(IOException e) {
    if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
       logQueue.size() > 1 && this.eofAutoRecovery) {
      try {
        if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
          LOG.warn("Forcing removal of 0 length log in queue: " + logQueue.peek());
          logQueue.remove();
          currentPosition = 0;
        }
      } catch (IOException ioe) {
        LOG.warn("Couldn't get file length information about log " + logQueue.peek());
      }
    }
  }
{code}
This size check is valid for active sources where we need to have atleast one wal file which is the current wal file but for recovered sources where we don't add current wal file to queue, we can skip the logQueue#size check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)