You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wilfred Spiegelenburg (JIRA)" <ji...@apache.org> on 2019/03/12 12:23:00 UTC
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

    [ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780101#comment-16780101 ] 

Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:22 PM:
-----------------------------------------------------------------------

Two things:
* I still think limiting the number of nodes is something we need to approach with care.
* randomising a 10,000 entry long list each time we pre-empt will also become expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List<FSSchedulerNode> potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
    Random rand = new Random();
    current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
    stop = current;
  }
  do {
    FSSchedulerNode mine = potentialNodes.get(current);
    // Identify the containers
    ....
    current++;
    // flip at the end of the list  
    if (current > size) {
      current = 0;
    }
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. 
We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large cluster.


was (Author: wilfreds):
Two things:
* I still think limiting the number of nodes is something we need to approach with care.
* randomising a 10,000 entry long list each time we pre-empt will also become expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List<FSSchedulerNode> potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
    Random rand = new Random();
    current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
  }
  do {
    FSSchedulerNode mine = potentialNodes.get(current);
    // Identify the containers
    ....
    current++;
    // flip at the end of the list  
    if (current > size) {
      current = 0;
    }
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. 
We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large cluster.

> Shuffle nodes when selecting to be preempted nodes
> --------------------------------------------------
>
>                 Key: YARN-9278
>                 URL: https://issues.apache.org/jira/browse/YARN-9278
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: fairscheduler
>            Reporter: Zhaohui Xin
>            Assignee: Zhaohui Xin
>            Priority: Major
>         Attachments: YARN-9278.001.patch
>
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List<FSSchedulerNode> newPotentialNodes = new ArrayList<FSSchedulerNode>();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org