You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/07/13 03:42:00 UTC

[jira] [Work logged] (HDFS-16657) Changing pool-level lock to volume-level lock for invalidation of blocks

     [ https://issues.apache.org/jira/browse/HDFS-16657?focusedWorklogId=790241&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-790241 ]

ASF GitHub Bot logged work on HDFS-16657:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Jul/22 03:41
            Start Date: 13/Jul/22 03:41
    Worklog Time Spent: 10m 
      Work Description: yuanboliu opened a new pull request, #4558:
URL: https://github.com/apache/hadoop/pull/4558

   The key code is:
   
   // code placeholder
   try {
     File blockFile = new File(info.getBlockURI());
     if (blockFile != null && blockFile.getParentFile() == null) {
       errors.add("Failed to delete replica " + invalidBlks[i]
           +  ". Parent not found for block file: " + blockFile);
       continue;
     }
   } catch(IllegalArgumentException e) {
     LOG.warn("Parent directory check failed; replica " + info
         + " is not backed by a local file");
   } 
   DN is trying to locate parent path of block file, thus there is a disk I/O in pool-level lock. When the disk becomes very busy with high io wait, All the pending threads will be blocked by the pool-level lock, and the time of heartbeat is high. We proposal to change the pool-level lock to volume-level lock for block invalidation




Issue Time Tracking
-------------------

            Worklog Id:     (was: 790241)
    Remaining Estimate: 0h
            Time Spent: 10m

> Changing pool-level lock to volume-level lock for invalidation of blocks
> ------------------------------------------------------------------------
>
>                 Key: HDFS-16657
>                 URL: https://issues.apache.org/jira/browse/HDFS-16657
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Yuanbo Liu
>            Priority: Major
>         Attachments: image-2022-07-13-10-25-37-383.png, image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently we see that the heartbeating of dn become slow in a very busy cluster, here is the chart:
> !image-2022-07-13-10-25-37-383.png|width=665,height=245!
>  
> After getting jstack of the dn, we find that dn heartbeat stuck in invalidation of blocks:
> !image-2022-07-13-10-27-01-386.png|width=658,height=308!
> !image-2022-07-13-10-27-44-258.png|width=502,height=325!
> The key code is:
> {code:java}
> // code placeholder
> try {
>   File blockFile = new File(info.getBlockURI());
>   if (blockFile != null && blockFile.getParentFile() == null) {
>     errors.add("Failed to delete replica " + invalidBlks[i]
>         +  ". Parent not found for block file: " + blockFile);
>     continue;
>   }
> } catch(IllegalArgumentException e) {
>   LOG.warn("Parent directory check failed; replica " + info
>       + " is not backed by a local file");
> } {code}
> DN is trying to locate parent path of block file, thus there is a disk I/O in pool-level lock. When the disk becomes very busy with high io wait, All the pending threads will be blocked by the pool-level lock, and the time of heartbeat is high. We proposal to change the pool-level lock to volume-level lock for block invalidation
> cc: [~hexiaoqiao] [~Aiphag0] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org