You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Marton Elek (Jira)" <ji...@apache.org> on 2020/06/29 13:16:00 UTC

[jira] [Commented] (HDDS-830) Datanode should not start XceiverServerRatis before getting version information from SCM

    [ https://issues.apache.org/jira/browse/HDDS-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147763#comment-17147763 ] 

Marton Elek commented on HDDS-830:
----------------------------------

Discussed with [~nanda] today.

 1. It seems that we don't need the complexity of EndpotStateMachine for Datanode any more. We can remove that.
 2. Nanda suggested to to this on SCM-HA branch  (together with related changes)
 3. We didn't see this problem on prod, it seems to be fixed with a workaround and removing the EndpointStateMachine will be long-term fix

Moving out from 0.7.0

> Datanode should not start XceiverServerRatis before getting version information from SCM
> ----------------------------------------------------------------------------------------
>
>                 Key: HDDS-830
>                 URL: https://issues.apache.org/jira/browse/HDDS-830
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 0.3.0
>            Reporter: Nanda kumar
>            Priority: Major
>              Labels: TriagePending
>
> If a datanode restarts quickly before SCM detects, it will rejoin the ratis ring (existing pipeline). Since SCM didn't detect this restart, the pipeline is not closed. Now there is a time gap after the datanode is started and it got the version information from SCM. During this time, the SCM ID in datanode is not set(null). If a client tries to use this pipeline during that time, the container state machine will throw {{java.lang.NullPointerException: scmId cannot be nul}}. This will cause {{RaftLogWorker}} to terminate resulting in datanode crash.
> {code}
> 2018-11-12 19:45:31,811 ERROR storage.RaftLogWorker (ExitUtils.java:terminate(86)) - Terminating with exit status 1: 407fd181-2ff7-4651-9a47-a0927ede4c51-RaftLogWorker failed.
> java.io.IOException: java.lang.NullPointerException: scmId cannot be null
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:83)
>   at org.apache.ratis.server.storage.RaftLogWorker$StateMachineDataPolicy.getFromFuture(RaftLogWorker.java:76)
>   at org.apache.ratis.server.storage.RaftLogWorker$WriteLog.execute(RaftLogWorker.java:344)
>   at org.apache.ratis.server.storage.RaftLogWorker.run(RaftLogWorker.java:216)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException: scmId cannot be null
>   at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
>   at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:106)
>   at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:242)
>   at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:165)
>   at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:206)
>   at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:124)
>   at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:274)
>   at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.runCommand(ContainerStateMachine.java:280)
>   at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$handleWriteChunk$1(ContainerStateMachine.java:301)
>   at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   ... 1 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org