You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Krishna Kumar Asawa (Jira)" <ji...@apache.org> on 2023/08/25 07:37:00 UTC

[jira] [Assigned] (HDDS-9184) Re-replication of UNHEALTHY replica takes a long time in Ratis rack-scatter environment

     [ https://issues.apache.org/jira/browse/HDDS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krishna Kumar Asawa reassigned HDDS-9184:
-----------------------------------------

    Assignee: Siddhant Sangwan

> Re-replication of UNHEALTHY replica takes a long time in Ratis rack-scatter environment
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-9184
>                 URL: https://issues.apache.org/jira/browse/HDDS-9184
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Jyotirmoy Sinha
>            Assignee: Siddhant Sangwan
>            Priority: Major
>
> Steps :
>  # Create volume/bucket/key (ratis) in a rack-scatter environment
>  # Restart 1 datanode of the container of above key
>  # Simulate UNHEALTHY replica in the above datanode
> Observation - Re-replication and removal of UNHEALTHY datanode takes about 1 hour.
> dn-container.log -
> {code:java}
> 2023-08-10 09:23:48,422 | INFO  | ID=11011 | Index=0 | BCSID=132 | State=CLOSED |  
> 2023-08-10 09:28:14,105 | ERROR | ID=11010 | Index=0 | BCSID=129 | State=UNHEALTHY | INCONSISTENT_CHUNK_LENGTH for file /hadoop-ozone/datanode/data632176/hdds/CID-d7f3b02c-172b-41ad-8944-c3ff90bb7382/current/containerDir21/11010/chunks/111677748019215006.block. Message: Inconsistent read for chunk=111677748019215006_chunk_1 expected length=4194304 actual length=0 for block conID: 11010 locID: 111677748019215006 bcsId: 129 |  
> 2023-08-10 09:29:11,631 | INFO  | ID=12001 | Index=0 | BCSID=140 | State=CLOSED |  
> 2023-08-10 10:21:11,361 | INFO  | ID=10003 | Index=0 | BCSID=227 | State=CLOSED |  
> 2023-08-10 10:31:09,934 | INFO  | ID=11010 | Index=0 | BCSID=129 | State=DELETED | Container force deleted |  
> 2023-08-10 10:36:15,827 | INFO  | ID=11010 | Index=0 | BCSID=129 | State=CLOSED |  {code}
> ozone-scm.log -
> {code:java}
> 2023-08-10 09:29:09,572 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter: No satisfied datanode to meet the constraints. Metadatadata size required: 0 Data size required: 5368709120, scope /rack_h1wbu, excluded nodes [c396dcdf-3253-4827-995e-e339ec9527e1(quasar-ssbwak-7.quasar-ssbwak.root.hwx.site/172.27.53.195), 00ecbf81-ef0b-4827-b6aa-e818a42ac522(quasar-ssbwak-5.quasar-ssbwak.root.hwx.site/172.27.105.129), 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)]
> 2023-08-10 09:29:09,572 WARN [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter: Placement policy could not choose the enough nodes from available racks. Chosen nodes size from Unique Racks: 0, but required nodes to choose from Unique Racks: 1 do not match. Available racks count: 1, Excluded nodes count: 1, UsedNodes count: 2
> 2023-08-10 09:29:09,572 ERROR [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor: Error processing Health result of class: class org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult for container ContainerInfo{id=#11010, state=CLOSED, stateEnterTime=2023-08-10T09:12:16.522733Z, pipelineID=PipelineID=e58f6cad-1824-4701-a26a-1d400e9be503, owner=om124}
> org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter did not return any nodes. Number of required Nodes 1, Datasize Required: 5368709120
> 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:93)
> 	at org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:284)
> 	at org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:118)
> 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:773)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:156)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:116)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:165)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-08-10 09:29:09,573 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter: Chosen nodes: [616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)]. isPolicySatisfied: true. {code}
> In SCM logs the unhealthy re-replication is failing for a long time due to above error. It resolved after 1 hour -
> {code:java}
> 2023-08-10 10:28:07,158 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter: No satisfied datanode to meet the constraints. Metadatadata size required: 0 Data size required: 5368709120, scope /rack_6us5i, excluded nodes [c396dcdf-3253-4827-995e-e339ec9527e1(quasar-ssbwak-7.quasar-ssbwak.root.hwx.site/172.27.53.195), 00ecbf81-ef0b-4827-b6aa-e818a42ac522(quasar-ssbwak-5.quasar-ssbwak.root.hwx.site/172.27.105.129), 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)]
> 2023-08-10 10:28:07,158 WARN [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter: Placement policy could not choose the enough nodes from available racks. Chosen nodes size from Unique Racks: 0, but required nodes to choose from Unique Racks: 1 do not match. Available racks count: 1, Excluded nodes count: 1, UsedNodes count: 2
> 2023-08-10 10:28:07,158 ERROR [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor: Error processing Health result of class: class org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult for container ContainerInfo{id=#11010, state=CLOSED, stateEnterTime=2023-08-10T09:12:16.522Z, pipelineID=PipelineID=e58f6cad-1824-4701-a26a-1d400e9be503, owner=om124}
> org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter did not return any nodes. Number of required Nodes 1, Datasize Required: 5368709120
> 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:93)
> 	at org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:284)
> 	at org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:118)
> 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:773)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:156)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:116)
> 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:165)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-08-10 10:28:07,159 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor: Processed 0 containers with health state counts {}, failed processing 1, deferred due to load 0
> 2023-08-10 10:28:11,717 INFO [Socket Reader #1 for port 9860]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for recon/quasar-ssbwak-7.quasar-ssbwak.root.hwx.site@ROOT.HWX.SITE (auth:KERBEROS)
> 2023-08-10 10:28:11,723 INFO [Socket Reader #1 for port 9860]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for recon/quasar-ssbwak-7.quasar-ssbwak.root.hwx.site@ROOT.HWX.SITE (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
> 2023-08-10 10:28:12,932 INFO [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler: A dead datanode is detected. 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)
> 2023-08-10 10:28:12,934 INFO [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send pipeline:PipelineID=942c5e9b-c5d6-44e6-ab4e-2b4c4a12b344 close command to datanode 616e9d5b-537a-42cf-83f1-8eb267cf119d
> 2023-08-10 10:28:12,944 INFO [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id: 942c5e9b-c5d6-44e6-ab4e-2b4c4a12b344, Nodes: 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1), ReplicationConfig: RATIS/ONE, State:CLOSED, leaderId:616e9d5b-537a-42cf-83f1-8eb267cf119d, CreationTimestamp2023-08-10T10:10:36.905504Z[UTC]] removed.
> 2023-08-10 10:28:12,951 INFO [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler: Clearing command queue of size 1 for DN 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)
> 2023-08-10 10:28:12,951 INFO [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl: Removed a node: /rack_6us5i/616e9d5b-537a-42cf-83f1-8eb267cf119d
> 2023-08-10 10:28:16,637 WARN [IPC Server handler 9 on 9863]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Cannot find node for address 172.27.76.5
> 2023-08-10 10:28:16,817 INFO [Socket Reader #1 for port 9860]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for om/quasar-ssbwak-3.quasar-ssbwak.root.hwx.site@ROOT.HWX.SITE (auth:KERBEROS)
> 2023-08-10 10:28:16,822 INFO [Socket Reader #1 for port 9860]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for om/quasar-ssbwak-3.quasar-ssbwak.root.hwx.site@ROOT.HWX.SITE (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
> 2023-08-10 10:28:16,845 WARN [IPC Server handler 11 on 9863]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Cannot find node for address 172.27.76.5
> 2023-08-10 10:28:37,160 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter: Chosen nodes: [d6b31a38-2e8e-4ba9-aed9-cfc5b41456aa(quasar-ssbwak-3.quasar-ssbwak.root.hwx.site/172.27.131.7)]. isPolicySatisfied: true.
> 2023-08-10 10:28:37,160 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending command [replicateContainerCommand: containerId=11010, replicaIndex=0, targetNode=d6b31a38-2e8e-4ba9-aed9-cfc5b41456aa(quasar-ssbwak-3.quasar-ssbwak.root.hwx.site/172.27.131.7), priority=NORMAL] for container ContainerInfo{id=#11010, state=CLOSED, stateEnterTime=2023-08-10T09:12:16.522Z, pipelineID=PipelineID=e58f6cad-1824-4701-a26a-1d400e9be503, owner=om124} to c396dcdf-3253-4827-995e-e339ec9527e1(quasar-ssbwak-7.quasar-ssbwak.root.hwx.site/172.27.53.195) with datanode deadline 1691663887160 and scm deadline 1691663917160
> 2023-08-10 10:28:37,160 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor: Processed 1 containers with health state counts {UNDER_REPLICATED=1}, failed processing 0, deferred due to load 0 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org