You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Attila Doroszlai (Jira)" <ji...@apache.org> on 2024/01/05 08:48:00 UTC

[jira] [Updated] (HDDS-8982) Prevent infinite loop in getContainer which causes log flooded by WritableRatisContainerProvider if pipeline's nodes are not found

     [ https://issues.apache.org/jira/browse/HDDS-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Attila Doroszlai updated HDDS-8982:
-----------------------------------
    Fix Version/s: 1.5.0
                       (was: 1.4.0)

> Prevent infinite loop in getContainer which causes log flooded by WritableRatisContainerProvider if pipeline's nodes are not found
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-8982
>                 URL: https://issues.apache.org/jira/browse/HDDS-8982
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Attila Doroszlai
>            Assignee: Attila Doroszlai
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.5.0
>
>
> HDDS-8981 revealed an infinite loop.
> Steps (based on broken test case):
> # Stop all datanodes
> # Restart SCM
> # Force SCM exit safe mode
> # Allocate block
> WritableRatisContainerProvider is stuck in an infinite loop:
> {code}
> [IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c not found.
> [IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode d37619cb-6c2f-4ca2-895f-33a0c8ee91be not found.
> [IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode 65bcd90d-da66-4520-87f5-dc49e35f7dc8 not found.
> [IPC Server handler 0 on default port 15001] WARN  container.ContainerManagerImpl (ContainerManagerImpl.java:getMatchingContainer(342)) - Container allocation failed on pipeline=Pipeline[ Id: b7134617-f12a-460c-9ef7-36bb6f2810c6, Nodes: b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c(localhost/127.0.0.1)d37619cb-6c2f-4ca2-895f-33a0c8ee91be(localhost/127.0.0.1)65bcd90d-da66-4520-87f5-dc49e35f7dc8(localhost/127.0.0.1), ReplicationConfig: RATIS/THREE, State:OPEN, leaderId:65bcd90d-da66-4520-87f5-dc49e35f7dc8, CreationTimestamp2023-07-06T11:51:52.213+02:00[Europe/Vienna]]
> java.lang.IllegalArgumentException
> 	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
> 	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1118)
> 	at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:648)
> 	at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:349)
> 	at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:329)
> 	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:204)
> 	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:101)
> 	at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:70)
> 	at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:164)
> 	at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
> 	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192)
> 	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142)
> 	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> 	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113)
> 	at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14238)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
> {code}
> Pipeline is open, datanodes are stopped, thus they do not register after SCM restart.  
> {{minHealthyVolumeNum}} throws exception, {{getMatchingContainer}} catches it and returns {{null}}.  The loop in {{WritableRatisContainerProvider}} exits only after such pipelines are closed (4,5 minutes, 57768149 iterations in this case):
> {code:title=disabled per-attempt log, log number of attempts on exit from loop}
> 2023-07-06 12:48:05,378 [IPC Server handler 3 on default port 15001] INFO  ha.SequenceIdGenerator (SequenceIdGenerator.java:getNextId(140)) - Allocate a batch for delTxnId, change lastId from 0 to 1000.
> ...
> 2023-07-06 12:52:33,880 [BackgroundPipelineScrubberThread] INFO  pipeline.PipelineManagerImpl (PipelineManagerImpl.java:closePipeline(510)) - Pipeline Pipeline[ Id: 1fcef9a6-655d-485d-ac83-20ec2bcc9225, Nodes: 9ec03363-f1aa-410a-aa90-5cac6833e36c(localhost/127.0.0.1), ReplicationConfig: RATIS/ONE, State:OPEN, leaderId:9ec03363-f1aa-410a-aa90-5cac6833e36c, CreationTimestamp2023-07-06T12:47:33.874+02:00[Europe/Vienna]] moved to CLOSED state
> 2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] INFO  pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(158)) - Could not find available pipeline of repConfig: RATIS/THREE even after retrying
> 2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(177)) - Give up after 57768149 attempts, unable to allocate block size=4194304, RATIS/THREE
> 2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR block.BlockManagerImpl (BlockManagerImpl.java:allocateBlock(172)) - Unable to allocate a block for the size: 4194304, replicationConfig: RATIS/THREE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org