You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Konstantinos Karanasos (JIRA)" <ji...@apache.org> on 2016/05/03 02:07:12 UTC
[jira] [Commented] (YARN-2888) Corrective mechanisms for rebalancing NM container queues

    [ https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267779#comment-15267779 ] 

Konstantinos Karanasos commented on YARN-2888:
----------------------------------------------

Thanks for the patch, [~asuresh]. Please find some comments below.
Once we fix these, I will give it an extra look in case I see something I didn't notice with this first pass.

In {{YarnConfiguration}}:
* I would use everywhere "max-queue-length", rather than "queue-limit". It is more informative, and we might eventually have "max-queue-wait-time", so it will be easier to differentiate.
* MEAN_SIGMA -> MEAN_STDEV
* As above, DIST_SCHEDULING_QUEUE_LIMIT_MIN -> DIST_SCHEDULING_MIN_QUEUE_LENGTH. Similar for MAX.
* Do we want to make the min and max queue lengths specific to distributed scheduling? Maybe we can keep them general, in which case we could rename the parameters to something like nm-queuing.max-queue-length.

Rename ContainerQueuingLimit* to NMQueuingLimit*?

In {{Context.java}}:
* Remove line break from import.
* Why is it needed to change the return type of getContainerManager() to ContainerManager (instead of ContainerManagementProtocol)? Same goes for the {{NodeManager}}.

In {{NodeStatusUpdaterImpl}}:
* There seem to be changes in the copyright, which are not needed (due to formatting).
* Remove line breaks from imports.
* There is reformatting in various places regarding code that you are not touching in this patch. We might want to revert those changes, because they make hard to follow the actual changes in the patch.
* Line 863, QueuingLimits -> QueuingLimit, queueing -> queuing.

In {{ContainerManager}}:
* updateQueuingLimits -> updateQueuingLimit

In {{QueuingContainerManagerImpl}}:
* Maybe move the setMaxQueueLength(-1) and the setMaxWaitTime(-1) inside the newInstance() call?
* I don't think you need a synchronized in the updateQueuingLimits.
* In the updateQueuingLimits, probably we want to update the queue wait time too. Would it be better to set directly the queuingLimit instead of setting each parameter?
* Line 499, is maxQueueLength ever -1? If dist scheduling is not enabled, we do not update the limits. Also, I think we should call pruneOpportunisticContainers() only if queue length is greater than 0.
* Maybe pruneOpportunisticContainerQueue() -> pruneQueuedOpportunisticContainers() or shedQueuedOpportunisticContainers()?
* In pruneOpportunisticContainerQueue(), let's use more descriptive variable names than counter and iterator.
* In pruneOpportunisticContainerQueue(), let's use the same logic/code as in the stopContainerInternal().

In {{DistributedSchedulingService}}:
* Remove line break from import.

In {{QueueLimitCalculator}}:
* Remove line breaks from imports.
* I think we can get rid of the median_sigma. Having mean_sigma should be sufficient. Moreover, standard deviation should not depend on whether we are using mean or median (but this will not be a problem if we remove the median).
* The calculation of the mean and stdev should be done over all nodes and not just the top k.

> Corrective mechanisms for rebalancing NM container queues
> ---------------------------------------------------------
>
>                 Key: YARN-2888
>                 URL: https://issues.apache.org/jira/browse/YARN-2888
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Konstantinos Karanasos
>            Assignee: Arun Suresh
>         Attachments: YARN-2888-yarn-2877.001.patch, YARN-2888-yarn-2877.002.patch, YARN-2888.003.patch, YARN-2888.004.patch
>
>
> Bad queuing decisions by the LocalRMs (e.g., due to the distributed nature of the scheduling decisions or due to having a stale image of the system) may lead to an imbalance in the waiting times of the NM container queues. This can in turn have an impact in job execution times and cluster utilization.
> To this end, we introduce corrective mechanisms that may remove (whenever needed) container requests from overloaded queues, adding them to less-loaded ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org