You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Arpit Agarwal (Jira)" <ji...@apache.org> on 2020/06/01 01:21:00 UTC

[jira] [Updated] (HDDS-660) StatusRuntimeException : DataNode going dead

     [ https://issues.apache.org/jira/browse/HDDS-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arpit Agarwal updated HDDS-660:
-------------------------------
    Target Version/s: 0.7.0

> StatusRuntimeException : DataNode going dead
> --------------------------------------------
>
>                 Key: HDDS-660
>                 URL: https://issues.apache.org/jira/browse/HDDS-660
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Filesystem
>    Affects Versions: 0.3.0
>            Reporter: Soumitra Sulav
>            Priority: Major
>
> Issue 1 : hdfs operations throw error as *INTERNAL_ERROR* when one of the datanode is down, reason being it isn't able to replicate to minimum datanodes. _ERROR log could be more specific._
> Issue 2 : Datanode process is running but is in a dead state as per SCM. Also there are exceptions in DataNode logs *StatusRuntimeException: INTERNAL: group-4D3A6FFFBFE2 not found.* Is there a way to fix any filesystem corruptions or a fsck utility like hdfs.
> +Steps followed to encounter the above issue :+
> I had a clean setup of ozone cluster and tried starting HDP services on o3 as defaultFS.
> Startup of YARN failed and on seeing the logs and UI, I see that one of the datanode's state is going to DEAD.
> The hdfs cli commands on ozone fs gives below exception :
> {code:java}
> [root@hcatest-1 ~]# ozone fs -put ozone-site.xml /
> 2018-10-15 09:33:20,385 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2018-10-15 09:33:21,774 ERROR io.ChunkGroupOutputStream: Try to allocate more blocks for write failed, already allocated 0 blocks for this write.
> put: Allocate block failed, error:INTERNAL_ERROR
> {code}
> Error logs on SCM :
> {code:java}
> 2018-10-15 10:16:54,303 WARN org.apache.hadoop.hdds.scm.block.BlockManagerImpl: Unable to allocate container: {}
> org.apache.hadoop.hdds.scm.exceptions.SCMException
> at org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.getReplicationPipeline(PipelineSelector.java:268)
> at org.apache.hadoop.hdds.scm.container.ContainerStateManager.allocateContainer(ContainerStateManager.java:270)
> at org.apache.hadoop.hdds.scm.container.SCMContainerManager.allocateContainer(SCMContainerManager.java:312)
> at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.preAllocateContainers(BlockManagerImpl.java:165)
> at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:279)
> at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
> at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
> at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> 2018-10-15 10:16:54,303 ERROR org.apache.hadoop.hdds.scm.block.BlockManagerImpl: Unable to allocate a block for the size: 268435456, type: RATIS, factor: THREE{code}
> DataNode error logs :
> {code:java}
> 2018-10-15 10:33:13,522 INFO org.apache.ratis.server.impl.LeaderElection: 0e4e7c9b-84a9-48a3-b44d-d906231e77b2 got exception when requesting votes: {}
> java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d: group-4D3A6FFFBFE2 not found.
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d: group-4D3A6FFFBFE2 not found.
> at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
> at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:203)
> at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:132)
> at org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
> at org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:61)
> at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:150)
> at org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-10-15 10:33:13,523 INFO org.apache.ratis.server.impl.LeaderElection: 0e4e7c9b-84a9-48a3-b44d-d906231e77b2: Election REJECTED; received 0 response(s) [] and 2 exception(s); 0e4e7c9b-84a9-48a3-b44d-d906231e77b2:t140, leader=null, voted=0e4e7c9b-84a9-48a3-b44d-d906231e77b2, raftlog=[(t:1, i:1)], conf=0: [76b2ad5f-1a40-4a28-9fc1-b91437fe1398:172.22.119.190:9858, 0e4e7c9b-84a9-48a3-b44d-d906231e77b2:172.22.119.189:9858, 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d:172.22.119.19:9858], old=null
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org