You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Sammi Chen (Jira)" <ji...@apache.org> on 2020/07/13 06:54:00 UTC

[jira] [Resolved] (HDDS-3920) Too many redudant replications due to fail to get node's ancestor in ReplicationManager

     [ https://issues.apache.org/jira/browse/HDDS-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammi Chen resolved HDDS-3920.
------------------------------
    Resolution: Fixed

> Too many redudant replications due to fail to get node's ancestor in ReplicationManager
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-3920
>                 URL: https://issues.apache.org/jira/browse/HDDS-3920
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: over-replicated-container-list.txt
>
>
> In our production cluster, we turn on the network topology configuraiton.  Due to fail to get the node's ancestor(the datanode object used doesn't have parent corrently set)  in ReplicationManager during the under-replicate and over-replicate check, ReplicationManager think the replicas of the container doean't meet the acrossing more than one rack requirement, then treat the container as under-replicate although it already has many replicas, and send command to datanodes to replicate the container again and again.  
> 2020-07-03 16:26:45,200 [ReplicationMonitor] INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: Container #105228 is over replicated. Expected replica count is 3, but found 31.
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: Handling underreplicated container: 210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: deletionInFlight of container {}#210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: replicationInFlight of container {}#210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.20.43
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: source of container {}#210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.5.41
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.251
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.8.85
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.250
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.8.35
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.8.67
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.135
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.144.104
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.20.58
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.198
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.20.222
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.container.ReplicationManager: Process container #210413 error:
> java.lang.IllegalArgumentException
>         at com.google.common.base.Preconditions.checkArgument(Preconditions.java:128)
>         at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:101)
>         at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:568)
>         at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:331)
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of node :f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e{ip: 9.179.142.251, host: host251, networkLocation: /rack3, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of node :826dda09-1259-4c5c-9a80-56b985665dc4{ip: 9.180.6.157, host: host-9-180-6-157, networkLocation: /rack10, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of node :b85962f2-6647-463b-9944-3c9b24e4e313{ip: 9.180.19.148, host: host-9-180-19-148, networkLocation: /rack3, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of node :039cb21e-4e2e-47e2-bf3e-b025319ee856{ip: 9.179.142.158, host: host158, networkLocation: /rack1, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack1/33b49c34-caa2-4b4f-894e-dce7db4f97b9, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack3/b1e555d4-7114-4b80-b425-93086b0f2036, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack1/55148789-0cdb-4631-a3b3-c1da774523aa, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack3/32e8d855-b702-438d-b829-ac43dc567afc, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack2/2e1b2fdd-f8fb-4252-bfc1-31d5339681be, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack3/db854037-4846-4093-89de-e492e0f14239, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack3/f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack10/826dda09-1259-4c5c-9a80-56b985665dc4, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack3/b85962f2-6647-463b-9944-3c9b24e4e313, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node: /rack1/039cb21e-4e2e-47e2-bf3e-b025319ee856, generation to exclude: 1, generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: Container: #210419. The container is mis-replicated as it is on 1 racks but should be on 2 racks.
> 2020-07-03 10:48:00,161 [ReplicationMonitor] INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending replicate container command for container #210419 to datanode 5cb315e9-7326-4592-8dd6-21f4342b09c1{ip: 9.180.8.85, host: host-9-180-8-85, networkLocation: /rack10, certSerialId: null}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org