You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Mladjan Gadzic (Jira)" <ji...@apache.org> on 2023/07/15 11:59:00 UTC

[jira] [Updated] (HDDS-9022) DiskChecker incorrectly reporting errors

     [ https://issues.apache.org/jira/browse/HDDS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mladjan Gadzic updated HDDS-9022:
---------------------------------
    Description: 
During load test of an aws based ozone cluster, we see that datanodes are shutting down. We believe it is caused by the new DiskChecker incorrectly reporting errors.
It happens because the class for StorageVolume's runs the check method without initializing the diskCheckDir field.
Thanks [~xBis]  for research and notes on it!

Diff showing new log messages:
{code:java}
diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
index b267b1d47..1cb0d0085 100644
--- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
@@ -37,6 +37,8 @@
  * where the disk is mounted.
  */
 public final class DiskCheckUtil {
+  private static final Logger LOG =
+      LoggerFactory.getLogger(DiskCheckUtil.class);
   private DiskCheckUtil() { }
 
   // For testing purposes, an alternate check implementation can be provided
@@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
 
   public static boolean checkReadWrite(File storageDir, File testFileDir,
       int numBytesToWrite) {
+    if (storageDir == null) {
+      LOG.info("###storageDir is null. Printing stack trace: {}",
+          Arrays.toString(new NullPointerException().getStackTrace()));
+    } else {
+      LOG.info("###storageDir path={}", storageDir.getPath());
+    }
+
+    if (testFileDir == null) {
+      LOG.info("###testFileDir is null. Printing stack trace: {}",
+          Arrays.toString(new NullPointerException().getStackTrace()));
+    } else {
+      LOG.info("###testFileDir path={}", testFileDir.getPath());
+    }
+
     return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
   }
  {code}
 

Stacktrace (specifically look at lines with "###")
{code:java}
2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir path=/hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled health check for volume /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir path=/hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir is null. Printing stack trace: [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76), org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629), org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68), org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143), com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131), com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75), com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:750)]
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume /hadoop/ozone/data/disk1/datanode failed health check. Could not find file disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
java.io.FileNotFoundException: disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
        at org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
        at org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
        at org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
        at org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
        at org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{code}
 

Bug could not be reproduced using Docker cluster, only on real cluster. Steps to reproduce it:
 # Add/modify properties as such:
{code:java}
OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1 OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}

 # Wait for volume to fail health check

Check attached log file for more context.

Suggested workaround diff:
{code:java}
diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
index 95d1b2c2d..634b15a8e 100644
--- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
@@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable Boolean unused)
       return VolumeCheckResult.HEALTHY;
     }
 
+    if (diskCheckDir == null) {
+      diskCheckDir = storageDir;
+    }
+
     // Since IO errors may be intermittent, volume remains healthy until the
     // threshold of failures is crossed.
     boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
 {code}

  was:
During load test of an aws based ozone cluster, we see that datanodes are shutting down. We believe it is caused by the new DiskChecker incorrectly reporting errors.
It happens because the class for StorageVolume's runs the check method without initializing the diskCheckDir field.
Thanks [~xBis]  for research and notes on it!

Diff showing new log messages:
{code:java}
diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
index b267b1d47..1cb0d0085 100644
--- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
@@ -37,6 +37,8 @@
  * where the disk is mounted.
  */
 public final class DiskCheckUtil {
+  private static final Logger LOG =
+      LoggerFactory.getLogger(DiskCheckUtil.class);
   private DiskCheckUtil() { }
 
   // For testing purposes, an alternate check implementation can be provided
@@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
 
   public static boolean checkReadWrite(File storageDir, File testFileDir,
       int numBytesToWrite) {
+    if (storageDir == null) {
+      LOG.info("###storageDir is null. Printing stack trace: {}",
+          Arrays.toString(new NullPointerException().getStackTrace()));
+    } else {
+      LOG.info("###storageDir path={}", storageDir.getPath());
+    }
+
+    if (testFileDir == null) {
+      LOG.info("###testFileDir is null. Printing stack trace: {}",
+          Arrays.toString(new NullPointerException().getStackTrace()));
+    } else {
+      LOG.info("###testFileDir path={}", testFileDir.getPath());
+    }
+
     return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
   }
  {code}
 

Stacktrace (specifically look at lines with "###")
{code:java}
2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir path=/hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled health check for volume /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir path=/hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir is null. Printing stack trace: [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76), org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629), org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68), org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143), com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131), com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75), com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:750)]
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume /hadoop/ozone/data/disk1/datanode failed health check. Could not find file disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
java.io.FileNotFoundException: disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
        at org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
        at org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
        at org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
        at org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
        at org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{code}
 

Bug could not be reproduced using Docker cluster, only on real cluster. Steps to reproduce it:
 # Add/modify properties as such:

{code:java}
OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1
OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}

 # Wait for volume to fail health check

Check attached log file for more context.

Suggested workaround diff:
{code:java}
diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
index 95d1b2c2d..634b15a8e 100644
--- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
@@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable Boolean unused)
       return VolumeCheckResult.HEALTHY;
     }
 
+    if (diskCheckDir == null) {
+      diskCheckDir = storageDir;
+    }
+
     // Since IO errors may be intermittent, volume remains healthy until the
     // threshold of failures is crossed.
     boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
 {code}


> DiskChecker incorrectly reporting errors
> ----------------------------------------
>
>                 Key: HDDS-9022
>                 URL: https://issues.apache.org/jira/browse/HDDS-9022
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Mladjan Gadzic
>            Priority: Major
>         Attachments: dn1.log
>
>
> During load test of an aws based ozone cluster, we see that datanodes are shutting down. We believe it is caused by the new DiskChecker incorrectly reporting errors.
> It happens because the class for StorageVolume's runs the check method without initializing the diskCheckDir field.
> Thanks [~xBis]  for research and notes on it!
> Diff showing new log messages:
> {code:java}
> diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> index b267b1d47..1cb0d0085 100644
> --- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> +++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> @@ -37,6 +37,8 @@
>   * where the disk is mounted.
>   */
>  public final class DiskCheckUtil {
> +  private static final Logger LOG =
> +      LoggerFactory.getLogger(DiskCheckUtil.class);
>    private DiskCheckUtil() { }
>  
>    // For testing purposes, an alternate check implementation can be provided
> @@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
>  
>    public static boolean checkReadWrite(File storageDir, File testFileDir,
>        int numBytesToWrite) {
> +    if (storageDir == null) {
> +      LOG.info("###storageDir is null. Printing stack trace: {}",
> +          Arrays.toString(new NullPointerException().getStackTrace()));
> +    } else {
> +      LOG.info("###storageDir path={}", storageDir.getPath());
> +    }
> +
> +    if (testFileDir == null) {
> +      LOG.info("###testFileDir is null. Printing stack trace: {}",
> +          Arrays.toString(new NullPointerException().getStackTrace()));
> +    } else {
> +      LOG.info("###testFileDir path={}", testFileDir.getPath());
> +    }
> +
>      return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
>    }
>   {code}
>  
> Stacktrace (specifically look at lines with "###")
> {code:java}
> 2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir path=/hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled health check for volume /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir path=/hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir is null. Printing stack trace: [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76), org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629), org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68), org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143), com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131), com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75), com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:750)]
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume /hadoop/ozone/data/disk1/datanode failed health check. Could not find file disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
> java.io.FileNotFoundException: disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory)
>         at java.io.FileOutputStream.open0(Native Method)
>         at java.io.FileOutputStream.open(FileOutputStream.java:270)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
>         at org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
>         at org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
>         at org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
>         at org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
>         at org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
>         at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
>         at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
>         at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
>  
> Bug could not be reproduced using Docker cluster, only on real cluster. Steps to reproduce it:
>  # Add/modify properties as such:
> {code:java}
> OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1 OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}
>  # Wait for volume to fail health check
> Check attached log file for more context.
> Suggested workaround diff:
> {code:java}
> diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> index 95d1b2c2d..634b15a8e 100644
> --- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> +++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> @@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable Boolean unused)
>        return VolumeCheckResult.HEALTHY;
>      }
>  
> +    if (diskCheckDir == null) {
> +      diskCheckDir = storageDir;
> +    }
> +
>      // Since IO errors may be intermittent, volume remains healthy until the
>      // threshold of failures is crossed.
>      boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org