You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/07/13 03:42:00 UTC
[jira] [Work logged] (HDFS-16657) Changing pool-level lock to volume-level lock for invalidation of blocks
[ https://issues.apache.org/jira/browse/HDFS-16657?focusedWorklogId=790241&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-790241 ]
ASF GitHub Bot logged work on HDFS-16657:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 13/Jul/22 03:41
Start Date: 13/Jul/22 03:41
Worklog Time Spent: 10m
Work Description: yuanboliu opened a new pull request, #4558:
URL: https://github.com/apache/hadoop/pull/4558
The key code is:
// code placeholder
try {
File blockFile = new File(info.getBlockURI());
if (blockFile != null && blockFile.getParentFile() == null) {
errors.add("Failed to delete replica " + invalidBlks[i]
+ ". Parent not found for block file: " + blockFile);
continue;
}
} catch(IllegalArgumentException e) {
LOG.warn("Parent directory check failed; replica " + info
+ " is not backed by a local file");
}
DN is trying to locate parent path of block file, thus there is a disk I/O in pool-level lock. When the disk becomes very busy with high io wait, All the pending threads will be blocked by the pool-level lock, and the time of heartbeat is high. We proposal to change the pool-level lock to volume-level lock for block invalidation
Issue Time Tracking
-------------------
Worklog Id: (was: 790241)
Remaining Estimate: 0h
Time Spent: 10m
> Changing pool-level lock to volume-level lock for invalidation of blocks
> ------------------------------------------------------------------------
>
> Key: HDFS-16657
> URL: https://issues.apache.org/jira/browse/HDFS-16657
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Yuanbo Liu
> Priority: Major
> Attachments: image-2022-07-13-10-25-37-383.png, image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Recently we see that the heartbeating of dn become slow in a very busy cluster, here is the chart:
> !image-2022-07-13-10-25-37-383.png|width=665,height=245!
>
> After getting jstack of the dn, we find that dn heartbeat stuck in invalidation of blocks:
> !image-2022-07-13-10-27-01-386.png|width=658,height=308!
> !image-2022-07-13-10-27-44-258.png|width=502,height=325!
> The key code is:
> {code:java}
> // code placeholder
> try {
> File blockFile = new File(info.getBlockURI());
> if (blockFile != null && blockFile.getParentFile() == null) {
> errors.add("Failed to delete replica " + invalidBlks[i]
> + ". Parent not found for block file: " + blockFile);
> continue;
> }
> } catch(IllegalArgumentException e) {
> LOG.warn("Parent directory check failed; replica " + info
> + " is not backed by a local file");
> } {code}
> DN is trying to locate parent path of block file, thus there is a disk I/O in pool-level lock. When the disk becomes very busy with high io wait, All the pending threads will be blocked by the pool-level lock, and the time of heartbeat is high. We proposal to change the pool-level lock to volume-level lock for block invalidation
> cc: [~hexiaoqiao] [~Aiphag0]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org