You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Wei-Chiu Chuang (JIRA)" <ji...@apache.org> on 2018/08/09 16:34:00 UTC

[jira] [Commented] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

    [ https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575101#comment-16575101 ] 

Wei-Chiu Chuang commented on HADOOP-15317:
------------------------------------------

Shouldn't this be a HDFS jira? :)

> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
>                 Key: HADOOP-15317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>             Fix For: 2.10.0, 2.8.4, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
>         Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch, HADOOP-15317.06.patch, Screen Shot 2018-03-28 at 7.23.32 PM.png
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite loop. From the logs it seems it just went through a rolling restart, and DNs are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory (thought about incorrect locking, or the Node object being modified outside of NetworkTopology, both seem impossible) why this is happening, but we should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org