You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Carlo Curino (JIRA)" <ji...@apache.org> on 2015/03/29 23:54:53 UTC

[jira] [Commented] (YARN-2670) Adding feedback capability to capacity scheduler from external systems

    [ https://issues.apache.org/jira/browse/YARN-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385978#comment-14385978 ] 

Carlo Curino commented on YARN-2670:
------------------------------------

I really welcome this line of thinking. 

We did some work in this space, and demonstrated that under certain experimental conditions a feedback loop on instantaneous cluster conditions, when coupled with proper extensions of the scheduler can lead to substantial perf improvements (the Limplock paper http://dl.acm.org/citation.cfm?id=2523627 discuss related ideas).

This is in particular relevant, as YARN does not track all resources (e.g., no disk, net bookeeping/policing). Also this is needed to account for load produced by other services running on the box, but not managed by YARN, e.g., HDFS / HBase.

I look forward to hear more about Astro and how you are attacking this, do you have any document or initial patch for this?

> Adding feedback capability to capacity scheduler from external systems
> ----------------------------------------------------------------------
>
>                 Key: YARN-2670
>                 URL: https://issues.apache.org/jira/browse/YARN-2670
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Mayank Bansal
>            Assignee: Mayank Bansal
>
> The sheer growth in data volume and Hadoop cluster size make it a significant challenge to diagnose and locate problems in a production-level cluster environment efficiently and within a short period of time. Often times, the distributed monitoring systems are not capable of detecting a problem well in advance when a large-scale Hadoop cluster starts to deteriorate in performance or becomes unavailable. Thus, incoming workloads, scheduled between the time when cluster starts to deteriorate and the time when the problem is identified, suffer from longer execution times. As a result, both reliability and throughput of the cluster reduce significantly. we address this problem by proposing a system called Astro, which consists of a predictive model and an extension to the Capacity scheduler. The predictive model in Astro takes into account a rich set of cluster behavioral information that are collected by monitoring processes and model them using machine learning algorithms to predict future behavior of the cluster. The Astro predictive model detects anomalies in the cluster and also identifies a ranked set of metrics that have contributed the most towards the problem. The Astro scheduler uses the prediction outcome and the list of metrics to decide whether it needs to move and reduce workloads from the problematic cluster nodes or to prevent additional workload allocations to them, in order to improve both throughput and reliability of the cluster.
> This JIRA is only for adding feedback capabilities to Capacity Scheduler which can take feedback from external systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)