You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Anoop Sam John (Jira)" <ji...@apache.org> on 2021/05/19 07:38:01 UTC
[jira] [Updated] (HBASE-25898) RS getting aborted due to NPE in Replication WALEntryStream

     [ https://issues.apache.org/jira/browse/HBASE-25898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anoop Sam John updated HBASE-25898:
-----------------------------------
    Description: 
Below sequence of events happened in a customer cluster
An empty WAL file got roll req.
The close of file failed at HDFS side but as there  file had all edits synced, we continue.
New WAL file is created and old rolled.
This old WAL file got archived to oldWAL 
{code}
2021-05-13 13:38:46.000	Riding over failed WAL close of hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678, cause="Unexpected EOF while trying to read response from server", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK
2021-05-13 13:38:46.000	Rolled WAL /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 with entries=0, filesize=90 B; new WAL /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620913126549
2021-05-13 13:38:46.000	 Archiving hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 to hdfs://xxx/oldWALs/xxxt%2C16020%2C1620828102351.1620910673678
2021-05-13 13:38:46.000	Log hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 was moved to hdfs://xxx/oldWALs/xxx%2C16020%2C1620828102351.1620910673678
{code}
As there was move of file, the WALEntryStream got IOE and we will recreate the stream .
{code}
ReplicationSourceWALReader#run
while (isReaderRunning()) {
	try {
	  entryStream =
		new WALEntryStream(logQueue, conf, currentPosition, source.getWALFileLengthProvider(),
		  source.getServerWALsBelongTo(), source.getSourceMetrics(), walGroupId);
	  while (isReaderRunning()) { 
	  ...
	  ...
	  } catch (IOException e) { // stream related
	  if (handleEofException(e, batch)) {
		sleepMultiplier = 1;
	  } else {
		LOG.warn("Failed to read stream of replication entries", e);
		if (sleepMultiplier < maxRetriesMultiplier) {
		  sleepMultiplier++;
		}
		Threads.sleep(sleepForRetries * sleepMultiplier);
	  }
}
{code}
eofAutoRecovery is turned off anyways.  So it will go to outer while loop and create new WALEntryStream object
Then we do readWALEntries
{code}
protected WALEntryBatch readWALEntries(WALEntryStream entryStream,
      WALEntryBatch batch) throws IOException, InterruptedException {
    Path currentPath = entryStream.getCurrentPath();
    if (!entryStream.hasNext()) {
{code}
Here the currentPath will be still null. 
WALEntryStream#hasNext -> tryAdvanceEntry -> checkReader -> openNextLog
{code}
private boolean openNextLog() throws IOException {
    PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId);
    Path nextPath = queue.peek();
    if (nextPath != null) {
      openReader(nextPath);
	  
private void openReader(Path path) throws IOException {
    try {
      // Detect if this is a new file, if so get a new reader else
      // reset the current reader so that we see the new data
      if (reader == null || !getCurrentPath().equals(path)) {
        closeReader();
        reader = WALFactory.createReader(fs, path, conf);
        seek();
        setCurrentPath(path);
      } else {
        resetReader();
      }
    } catch (FileNotFoundException fnfe) {
      handleFileNotFound(path, fnfe);
    }  catch (RemoteException re) {
      IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
      if (!(ioe instanceof FileNotFoundException)) {
        throw ioe;
      }
      handleFileNotFound(path, (FileNotFoundException)ioe);
    } catch (LeaseNotRecoveredException lnre) {
      // HBASE-15019 the WAL was not closed due to some hiccup.
      LOG.warn("Try to recover the WAL lease " + currentPath, lnre);
      recoverLease(conf, currentPath);
      reader = null;
    } catch (NullPointerException npe) {
      // Workaround for race condition in HDFS-4380
      // which throws a NPE if we open a file before any data node has the most recent block
      // Just sleep and retry. Will require re-reading compressed WALs for compressionContext.
      LOG.warn("Got NPE opening reader, will retry.");
      reader = null;
    }
  }
{code}
Here the call to WALFactory.createReader is not able to complete because of issue from HDFS. (Ya its on the same WAL file for which the close had an issue) We have a retry mechanism there in createReader () for 5 mns. But eventually it throws LeaseNotRecoveredException.  ya we try handle it.
But the problem here is in that call we pass the state variable currentPath which is still null here!
This will throw NPE 
{code}
java.lang.NullPointerException
	at org.apache.hadoop.fs.FileSystem.fixRelativePart(FileSystem.java:2635)
	at org.apache.hadoop.hdfs.DistributedFileSystem.fixRelativePart(DistributedFileSystem.java:3087)
	at org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:297)
	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283)
	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216)
	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.recoverLease(WALEntryStream.java:387)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:370)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:343)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:362)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
{code}
This will kill the RS!
When next RS takes this Replication Q source from this failed RS, even that also got NPE.

Here in the call recoverLease() we should have been passing the method param 'path'..  All other calls from this method do that way but seems this was miss!

  was:
Below sequence of events happened in a customer cluster
An empty WAL file got roll req.
The close of file failed at HDFS side but as there  file had all edits synced, we continue.
New WAL file is created and old rolled.
This old WAL file got archived to oldWAL 
{code}
2021-05-13 13:38:46.000	Riding over failed WAL close of hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678, cause="Unexpected EOF while trying to read response from server", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK
2021-05-13 13:38:46.000	Rolled WAL /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 with entries=0, filesize=90 B; new WAL /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620913126549
2021-05-13 13:38:46.000	 Archiving hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 to hdfs://xxx/oldWALs/xxxt%2C16020%2C1620828102351.1620910673678
2021-05-13 13:38:46.000	Log hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 was moved to hdfs://xxx/oldWALs/xxx%2C16020%2C1620828102351.1620910673678
{code}
As there was move of file, the WALEntryStream got IOE and we will recreate the stream .
{code}
ReplicationSourceWALReader#run
while (isReaderRunning()) {
	try {
	  entryStream =
		new WALEntryStream(logQueue, conf, currentPosition, source.getWALFileLengthProvider(),
		  source.getServerWALsBelongTo(), source.getSourceMetrics(), walGroupId);
	  while (isReaderRunning()) { 
	  ...
	  ...
	  } catch (IOException e) { // stream related
	  if (handleEofException(e, batch)) {
		sleepMultiplier = 1;
	  } else {
		LOG.warn("Failed to read stream of replication entries", e);
		if (sleepMultiplier < maxRetriesMultiplier) {
		  sleepMultiplier++;
		}
		Threads.sleep(sleepForRetries * sleepMultiplier);
	  }
}
{code}
eofAutoRecovery is turned off anyways.  So it will go to outer while loop and create new WALEntryStream object
Then we do readWALEntries
{code}
protected WALEntryBatch readWALEntries(WALEntryStream entryStream,
      WALEntryBatch batch) throws IOException, InterruptedException {
    Path currentPath = entryStream.getCurrentPath();
    if (!entryStream.hasNext()) {
{code}
Here the currentPath will be still null. 
WALEntryStream#hasNext -> tryAdvanceEntry -> checkReader -> openNextLog
{code}
private boolean openNextLog() throws IOException {
    PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId);
    Path nextPath = queue.peek();
    if (nextPath != null) {
      openReader(nextPath);
	  
private void openReader(Path path) throws IOException {
    try {
      // Detect if this is a new file, if so get a new reader else
      // reset the current reader so that we see the new data
      if (reader == null || !getCurrentPath().equals(path)) {
        closeReader();
        reader = WALFactory.createReader(fs, path, conf);
        seek();
        setCurrentPath(path);
      } else {
        resetReader();
      }
    } catch (FileNotFoundException fnfe) {
      handleFileNotFound(path, fnfe);
    }  catch (RemoteException re) {
      IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
      if (!(ioe instanceof FileNotFoundException)) {
        throw ioe;
      }
      handleFileNotFound(path, (FileNotFoundException)ioe);
    } catch (LeaseNotRecoveredException lnre) {
      // HBASE-15019 the WAL was not closed due to some hiccup.
      LOG.warn("Try to recover the WAL lease " + currentPath, lnre);
      recoverLease(conf, currentPath);
      reader = null;
    } catch (NullPointerException npe) {
      // Workaround for race condition in HDFS-4380
      // which throws a NPE if we open a file before any data node has the most recent block
      // Just sleep and retry. Will require re-reading compressed WALs for compressionContext.
      LOG.warn("Got NPE opening reader, will retry.");
      reader = null;
    }
  }
{code}
Here the call to WALFactory.createReader is not able to complete because of issue from HDFS.  We have retry mechanism there for 5 mns. But eventually it throws LeaseNotRecoveredException.  ya we try handle it.
But the problem here is in that call we pass the state variable currentPath which is still null here!
This will throw NPE 
{code}
java.lang.NullPointerException
	at org.apache.hadoop.fs.FileSystem.fixRelativePart(FileSystem.java:2635)
	at org.apache.hadoop.hdfs.DistributedFileSystem.fixRelativePart(DistributedFileSystem.java:3087)
	at org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:297)
	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283)
	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216)
	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.recoverLease(WALEntryStream.java:387)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:370)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:343)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:362)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175)
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
{code}
This will kill the RS!
When next RS takes this Replication Q source from this failed RS, even that also got NPE.

Here in the call recoverLease() we should have been passing the method param 'path'..  All other calls from this method do that way but seems this was miss!


> RS getting aborted due to NPE in Replication WALEntryStream
> -----------------------------------------------------------
>
>                 Key: HBASE-25898
>                 URL: https://issues.apache.org/jira/browse/HBASE-25898
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>            Priority: Critical
>
> Below sequence of events happened in a customer cluster
> An empty WAL file got roll req.
> The close of file failed at HDFS side but as there  file had all edits synced, we continue.
> New WAL file is created and old rolled.
> This old WAL file got archived to oldWAL 
> {code}
> 2021-05-13 13:38:46.000	Riding over failed WAL close of hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678, cause="Unexpected EOF while trying to read response from server", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK
> 2021-05-13 13:38:46.000	Rolled WAL /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 with entries=0, filesize=90 B; new WAL /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620913126549
> 2021-05-13 13:38:46.000	 Archiving hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 to hdfs://xxx/oldWALs/xxxt%2C16020%2C1620828102351.1620910673678
> 2021-05-13 13:38:46.000	Log hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 was moved to hdfs://xxx/oldWALs/xxx%2C16020%2C1620828102351.1620910673678
> {code}
> As there was move of file, the WALEntryStream got IOE and we will recreate the stream .
> {code}
> ReplicationSourceWALReader#run
> while (isReaderRunning()) {
> 	try {
> 	  entryStream =
> 		new WALEntryStream(logQueue, conf, currentPosition, source.getWALFileLengthProvider(),
> 		  source.getServerWALsBelongTo(), source.getSourceMetrics(), walGroupId);
> 	  while (isReaderRunning()) { 
> 	  ...
> 	  ...
> 	  } catch (IOException e) { // stream related
> 	  if (handleEofException(e, batch)) {
> 		sleepMultiplier = 1;
> 	  } else {
> 		LOG.warn("Failed to read stream of replication entries", e);
> 		if (sleepMultiplier < maxRetriesMultiplier) {
> 		  sleepMultiplier++;
> 		}
> 		Threads.sleep(sleepForRetries * sleepMultiplier);
> 	  }
> }
> {code}
> eofAutoRecovery is turned off anyways.  So it will go to outer while loop and create new WALEntryStream object
> Then we do readWALEntries
> {code}
> protected WALEntryBatch readWALEntries(WALEntryStream entryStream,
>       WALEntryBatch batch) throws IOException, InterruptedException {
>     Path currentPath = entryStream.getCurrentPath();
>     if (!entryStream.hasNext()) {
> {code}
> Here the currentPath will be still null. 
> WALEntryStream#hasNext -> tryAdvanceEntry -> checkReader -> openNextLog
> {code}
> private boolean openNextLog() throws IOException {
>     PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId);
>     Path nextPath = queue.peek();
>     if (nextPath != null) {
>       openReader(nextPath);
> 	  
> private void openReader(Path path) throws IOException {
>     try {
>       // Detect if this is a new file, if so get a new reader else
>       // reset the current reader so that we see the new data
>       if (reader == null || !getCurrentPath().equals(path)) {
>         closeReader();
>         reader = WALFactory.createReader(fs, path, conf);
>         seek();
>         setCurrentPath(path);
>       } else {
>         resetReader();
>       }
>     } catch (FileNotFoundException fnfe) {
>       handleFileNotFound(path, fnfe);
>     }  catch (RemoteException re) {
>       IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
>       if (!(ioe instanceof FileNotFoundException)) {
>         throw ioe;
>       }
>       handleFileNotFound(path, (FileNotFoundException)ioe);
>     } catch (LeaseNotRecoveredException lnre) {
>       // HBASE-15019 the WAL was not closed due to some hiccup.
>       LOG.warn("Try to recover the WAL lease " + currentPath, lnre);
>       recoverLease(conf, currentPath);
>       reader = null;
>     } catch (NullPointerException npe) {
>       // Workaround for race condition in HDFS-4380
>       // which throws a NPE if we open a file before any data node has the most recent block
>       // Just sleep and retry. Will require re-reading compressed WALs for compressionContext.
>       LOG.warn("Got NPE opening reader, will retry.");
>       reader = null;
>     }
>   }
> {code}
> Here the call to WALFactory.createReader is not able to complete because of issue from HDFS. (Ya its on the same WAL file for which the close had an issue) We have a retry mechanism there in createReader () for 5 mns. But eventually it throws LeaseNotRecoveredException.  ya we try handle it.
> But the problem here is in that call we pass the state variable currentPath which is still null here!
> This will throw NPE 
> {code}
> java.lang.NullPointerException
> 	at org.apache.hadoop.fs.FileSystem.fixRelativePart(FileSystem.java:2635)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.fixRelativePart(DistributedFileSystem.java:3087)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:297)
> 	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283)
> 	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216)
> 	at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.recoverLease(WALEntryStream.java:387)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:370)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:343)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:362)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
> {code}
> This will kill the RS!
> When next RS takes this Replication Q source from this failed RS, even that also got NPE.
> Here in the call recoverLease() we should have been passing the method param 'path'..  All other calls from this method do that way but seems this was miss!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)