You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2021/10/20 20:37:12 UTC

[jira] [Updated] (HDDS-660) StatusRuntimeException : DataNode going dead

     [ https://issues.apache.org/jira/browse/HDDS-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Rose updated HDDS-660:
----------------------------
    Target Version/s: 1.3.0  (was: 1.2.0)

I am managing the 1.2.0 release and we currently have more than 600 issues targeted for 1.2.0. I am moving the target field to 1.3.0.

If you are actively working on this jira and believe this should be targeted for the 1.2.0 release, Please reach out to me via Apache email or Slack.

> StatusRuntimeException : DataNode going dead
> --------------------------------------------
>
>                 Key: HDDS-660
>                 URL: https://issues.apache.org/jira/browse/HDDS-660
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Filesystem
>    Affects Versions: 0.3.0
>            Reporter: Soumitra Sulav
>            Priority: Major
>
> Issue 1 : hdfs operations throw error as *INTERNAL_ERROR* when one of the datanode is down, reason being it isn't able to replicate to minimum datanodes. _ERROR log could be more specific._
> Issue 2 : Datanode process is running but is in a dead state as per SCM. Also there are exceptions in DataNode logs *StatusRuntimeException: INTERNAL: group-4D3A6FFFBFE2 not found.* Is there a way to fix any filesystem corruptions or a fsck utility like hdfs.
> +Steps followed to encounter the above issue :+
> I had a clean setup of ozone cluster and tried starting HDP services on o3 as defaultFS.
> Startup of YARN failed and on seeing the logs and UI, I see that one of the datanode's state is going to DEAD.
> The hdfs cli commands on ozone fs gives below exception :
> {code:java}
> [root@hcatest-1 ~]# ozone fs -put ozone-site.xml /
> 2018-10-15 09:33:20,385 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2018-10-15 09:33:21,774 ERROR io.ChunkGroupOutputStream: Try to allocate more blocks for write failed, already allocated 0 blocks for this write.
> put: Allocate block failed, error:INTERNAL_ERROR
> {code}
> Error logs on SCM :
> {code:java}
> 2018-10-15 10:16:54,303 WARN org.apache.hadoop.hdds.scm.block.BlockManagerImpl: Unable to allocate container: {}
> org.apache.hadoop.hdds.scm.exceptions.SCMException
> at org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.getReplicationPipeline(PipelineSelector.java:268)
> at org.apache.hadoop.hdds.scm.container.ContainerStateManager.allocateContainer(ContainerStateManager.java:270)
> at org.apache.hadoop.hdds.scm.container.SCMContainerManager.allocateContainer(SCMContainerManager.java:312)
> at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.preAllocateContainers(BlockManagerImpl.java:165)
> at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:279)
> at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
> at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
> at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> 2018-10-15 10:16:54,303 ERROR org.apache.hadoop.hdds.scm.block.BlockManagerImpl: Unable to allocate a block for the size: 268435456, type: RATIS, factor: THREE{code}
> DataNode error logs :
> {code:java}
> 2018-10-15 10:33:13,522 INFO org.apache.ratis.server.impl.LeaderElection: 0e4e7c9b-84a9-48a3-b44d-d906231e77b2 got exception when requesting votes: {}
> java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d: group-4D3A6FFFBFE2 not found.
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d: group-4D3A6FFFBFE2 not found.
> at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
> at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:203)
> at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:132)
> at org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
> at org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:61)
> at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:150)
> at org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-10-15 10:33:13,523 INFO org.apache.ratis.server.impl.LeaderElection: 0e4e7c9b-84a9-48a3-b44d-d906231e77b2: Election REJECTED; received 0 response(s) [] and 2 exception(s); 0e4e7c9b-84a9-48a3-b44d-d906231e77b2:t140, leader=null, voted=0e4e7c9b-84a9-48a3-b44d-d906231e77b2, raftlog=[(t:1, i:1)], conf=0: [76b2ad5f-1a40-4a28-9fc1-b91437fe1398:172.22.119.190:9858, 0e4e7c9b-84a9-48a3-b44d-d906231e77b2:172.22.119.189:9858, 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d:172.22.119.19:9858], old=null
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org