You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Felix N (Jira)" <ji...@apache.org> on 2024/04/22 08:11:00 UTC
[jira] [Created] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed
Felix N created HDFS-17488:
------------------------------
Summary: DN can fail IBRs with NPE when a volume is removed
Key: HDFS-17488
URL: https://issues.apache.org/jira/browse/HDFS-17488
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs
Reporter: Felix N
Assignee: Felix N
Error logs
{code:java}
2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
java.lang.NullPointerException
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
at org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
at java.lang.Thread.run(Thread.java:748) {code}
The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's called on a block belonging to a volume already removed prior. Because the volume was already removed
{code:java}
private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
String delHint, String storageUuid, boolean isOnTransientStorage) {
checkBlock(block);
final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
block.getLocalBlock(), status, delHint);
final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
// storage == null here because it's already removed earlier.
for (BPServiceActor actor : bpServices) {
actor.getIbrManager().notifyNamenodeBlock(info, storage,
isOnTransientStorage);
}
} {code}
so IBRs with a null storage are now pending.
The reason why notifyNamenodeBlock can trigger on such blocks is up in DirectoryScanner#reconcile
{code:java}
public void reconcile() throws IOException {
LOG.debug("reconcile start DirectoryScanning");
scan();
// If a volume is removed here after scan() already finished running,
// diffs is stale and checkAndUpdate will run on a removed volume
// HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
// long
int loopCount = 0;
synchronized (diffs) {
for (final Map.Entry<String, ScanInfo> entry : diffs.getEntries()) {
dataset.checkAndUpdate(entry.getKey(), entry.getValue());
...
} {code}
Inside checkAndUpdate, memBlockInfo is null because all the block meta in memory is removed during the volume removal, but diskFile still exists. Then DataNode#notifyNamenodeDeletedBlock (and further down the line, notifyNamenodeBlock) is called on this block.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org