You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Bharat Viswanadham (Jira)" <ji...@apache.org> on 2021/08/10 06:43:00 UTC

[jira] [Resolved] (HDDS-5546) OM Service ID change causes OM startup failure

     [ https://issues.apache.org/jira/browse/HDDS-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharat Viswanadham resolved HDDS-5546.
--------------------------------------
    Fix Version/s: 1.2.0
       Resolution: Fixed

> OM Service ID change causes OM startup failure
> ----------------------------------------------
>
>                 Key: HDDS-5546
>                 URL: https://issues.apache.org/jira/browse/HDDS-5546
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Bharat Viswanadham
>            Assignee: Bharat Viswanadham
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.2.0
>
>
> In OM HA, raftGroupID is generated from service ID.
> So, if there is a change in OM Service ID OM startup fails with below error
> {code:java}
> 2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with exception
> java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
>         at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>         at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
>         at org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
>         at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
>         at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
>         at org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
>         at org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
>         at org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
>         at org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
>         at org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
>         at picocli.CommandLine.access$1100(CommandLine.java:145)
>         at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
>         at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
>         at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
>         at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
>         at org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
> Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
>         at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
>         at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
>         at org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
>         at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
>         at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
>         at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
>         at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
>         at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is now a new ratis group directory is created and StateMachine instance is shared between them. The error is confusing to end users as it is not clear that it is due to change in OM serviceId this caused failure.
> This Jira is to add some safeguard code and give clear message to know om startup failure. I will raise another jira to not to use om service id in ratis group ID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org