You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2022/10/19 22:31:00 UTC

[jira] [Updated] (HDDS-7100) Container scanner incorrectly marks containers unhealthy when DN is shutdown

     [ https://issues.apache.org/jira/browse/HDDS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Rose updated HDDS-7100:
-----------------------------
    Summary: Container scanner incorrectly marks containers unhealthy when DN is shutdown  (was: Scrubber incorrectly marks containers unhealthy when DN is shutdown)

> Container scanner incorrectly marks containers unhealthy when DN is shutdown
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-7100
>                 URL: https://issues.apache.org/jira/browse/HDDS-7100
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Datanode
>            Reporter: Stephen O'Donnell
>            Assignee: Szabolcs Gál
>            Priority: Critical
>
> When the DN is shutdown, the ContainerDataScanner.shutdown() method is called:
> {code}
>   public synchronized void shutdown() {
>     this.stopping = true;
>     this.canceler.cancel(
>         String.format(NAME_FORMAT, volume) + " is shutting down");
>     this.interrupt();
>     try {
>       this.join();
>     } catch (InterruptedException ex) {
>       LOG.warn("Unexpected exception while stopping data scanner for volume "
>           + volume, ex);
>       Thread.currentThread().interrupt();
>     }
>   }
> {code}
> This interrupts the current thread. The code to scan a container looks like:
> {code}
>   public boolean fullCheck(DataTransferThrottler throttler, Canceler canceler) {
>     boolean valid;
>     try {
>       valid = fastCheck();
>       if (valid) {
>         scanData(throttler, canceler);
>       }
>     } catch (IOException e) {
>       handleCorruption(e);
>       valid = false;
>     }
>     return valid;
>   }
> {code}
> The interrupt causes the some method further down the stack to thrown an exception, which is then caught by the IOException handler. Right now, it assume any exception is due to the container being unhealthy, and marks the container as such.
> Adding some debug code, we can see the real exception when this occurs is "java.nio.channels.ClosedByInterruptException":
> {code}
> datanode_1  | 2022-08-05 12:08:51,676 [ContainerDataScanner(/data/hdds/hdds)] INFO keyvalue.KeyValueContainerCheck: IO exception in checker
> datanode_1  | java.nio.channels.ClosedByInterruptException
> datanode_1  | 	at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
> datanode_1  | 	at java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
> datanode_1  | 	at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:366)
> datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:295)
> datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:272)
> datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:128)
> datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:849)
> datanode_1  | 	at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.runIteration(ContainerDataScanner.java:106)
> datanode_1  | 	at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.run(ContainerDataScanner.java:81)
> {code}
> I am not sure if there could be other type of exception raised, so simply catching ClosedByInterruptException is probably not a good solution. I feel we should raise specific container integrity exceptions if the container should be marked unhealthy, and the catch all IOException probably should not be used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org