You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2022/08/10 23:30:00 UTC

[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

    [ https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578181#comment-17578181 ] 

Ethan Rose commented on HDDS-7103:
----------------------------------

bq. When there was an existing group directory and it failed, Ratis should just throw an exception but not try to start with a new directory.
[~szetszwo] I don't quite follow this part. Directories do not fail, disks fail, which means Ratis may not see the directory on the failed disk and know it corresponds to an existing group. Consider the following example:
# /disk1 and /disk2 are configured for Ratis storage directories. They each map to a different disk's mount point.
# Ratis creates group1 in directory /disk1/group1.
# /disk1 becomes intermittently flaky, such that on restart none of its contents can be read. Ratis creates a new directory for the group on /disk2/group1 since it cannot read anything under the /disk1 path.
# On another restart, /disk1 comes back so now there are duplicate directories for group1. This will still cause the exception in this Jira.


> Ratis log storage directories unchecked causing unhandled exception on datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple disks and there is a corruption causing the same directory found on each disk, ratis throws an unhandled exception.  The unhandled exception prevents the datanode from creating pipelines.  The datanode remains up with the user only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. _dn1,dn2_ . Having the same directories in both disks.  On datanode start error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 01a173a0-6bd2-478a-8598-05df3a6f318a: [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been identified and discussed in a previous PR for the hdds volume diskchecker, PR #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue instead of throwing and unhandled IOException, see https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org