You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2022/09/01 02:16:00 UTC

[jira] [Comment Edited] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

    [ https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598707#comment-17598707 ] 

Ethan Rose edited comment on HDDS-7103 at 9/1/22 2:15 AM:
----------------------------------------------------------

[~NeilJoshi] it looks like what will currently happen if the exception is thrown is that the datanode will shut down. We could probably change the code to catch this exception, skip loading the group, and queue a close pipeline action to send to SCM once the datanode has registered. It looks like RATIS-1677 [will be reverted|https://github.com/apache/ratis/pull/718#issuecomment-1231215723] on the Ratis 2.4.0 release branch so this won't be needed for Ozone 1.3.0. We will also need to check the format options Ozone is using based on [this comment|https://issues.apache.org/jira/browse/RATIS-1694?focusedCommentId=17598402&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17598402].


was (Author: erose):
[~NeilJoshi] it looks like what will currently happen if the exception is thrown is that the datanode will shut down. We could probably change the code to catch this exception, skip loading the group, and queue a close pipeline action to send to SCM once the datanode has registered. It looks like RATIS-1677 got reverted on the Ratis 2.4.0 release branch so this won't be needed for Ozone 1.3.0. We will also need to check the format options Ozone is using based on [this comment|https://issues.apache.org/jira/browse/RATIS-1694?focusedCommentId=17598402&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17598402].

> Ratis log storage directories unchecked causing unhandled exception on datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple disks and there is a corruption causing the same directory found on each disk, ratis throws an unhandled exception.  The unhandled exception prevents the datanode from creating pipelines.  The datanode remains up with the user only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. _dn1,dn2_ . Having the same directories in both disks.  On datanode start error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 01a173a0-6bd2-478a-8598-05df3a6f318a: [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been identified and discussed in a previous PR for the hdds volume diskchecker, PR #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue instead of throwing and unhandled IOException, see https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org