You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/10/28 05:26:19 UTC

[GitHub] [ozone] errose28 commented on pull request #2757: HDDS-5820. Intermittent failure in TestPipelinePlacementPolicy#testPickLowestLoadAnchor

errose28 commented on pull request #2757:
URL: https://github.com/apache/ozone/pull/2757#issuecomment-953513749


   Thanks for the write up @sky76093016 and great job identifying the cause. This is a tricky bug in some very confusing code. Your analysis was very helpful in building my understanding of what is happening here. Let me know what you think of this summary:
   
   The code referenced here is all within `PipelinePlacementPolicy`. The test in question calls `chooseDatanodes`, which calls `filterViableNodes`. `filterViableNodes` returns the nodes sorted based on pipeline load. Given this sorted list, `filterViableNodes` then eventually calls `chooseNode` whether or not topology information is present (PipelinePlacementPolicy lines 240-249). The comments on these lines indicate two cases:
   
   **case 1**: If there is no topology information, choose a random node.
   **case 2**: If we have topology information, choose the anchor node as the one with lowest pipeline load (this would be the first node in the list returned by `filterViableNodes`), and choose the other two around it based on topology. 
   
   Pre HDDS-4710, `chooseNode` was always returning the first element in the list it was passed. This failed **case 1**, because it would always choose the same node in the absence of topology information. However, it passed **case 2**, because the list passed to `chooseNode` was already sorted by pipeline usage, so returning the first element gave correct behavior.
   
   HDDS-4710 intended to fix **case 1** and leave **case 2** functioning correctly. In the PR description, it says
   > The other usages of chooseNode in class PipelinePlacementPolicy won't affect the topology & rack-awareness based node-choose policy
   
   Unfortunately the PR did not actually do this, because the same `chooseNode` method is eventually called for both cases, so it fixed **case 1** and broke **case 2** by choosing a random node when it was supposed to choose the node with lowest load.
   
   This test calculates the maximum pipelines that should be able to be created given the nodes and pipeline limit per node. Most of the time, the random selection actually picks optimal pipeline placements so the test passes. However, sometimes the random selection makes a suboptimal choice as you showed in your write up. This is why we see the intermittent failure. If we were choosing the anchor based on pipeline load, we would be able to place all the pipelines every time.
   
   The current fix in the PR works by reverting **case 2** to its original function. Since **case 2** is the only one used in this test, it now passes like before. **case 1** might be satisfied by this fix, because when the nodes are equal, passing the results of the shuffle into the sort on the following lines *might* result in a different order every time, depending on the sort implementation. I think a better fix is to just do what the comments in `PipelinePlacementPolicy` line 240 say. We can leave `chooseNode` unaltered so when `SCMCommonPlacementPolicy#getResultSet` calls it, it will return a random node as expected. In the overridden `PipelinePlacementPolicy#getResultSet`, we cannot call `chooseNode`. Instead, we can make a new method that does almost the same thing, but returns the first node in the list instead of a random one. Now **case 1** and **case 2** should be satisfied. Let me know what you think of this solution.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org