You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Li Lu (JIRA)" <ji...@apache.org> on 2016/04/06 02:49:25 UTC

[jira] [Updated] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

     [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Li Lu updated YARN-3816:
------------------------
    Attachment: YARN-3816-YARN-2928-v5.patch

OK I've done a major revise of the existing patch. Some key changes:
- Refactored the patch so that it applies to the latest branch. 
- Got some offline discussion with [~vinodkv]. We focus on real time aggregation for single data metrics for now. Reflect this in the latest patch. Specifically, this aggregation addresses the case where all containers post their metrics to the same collector, and we aggregate to get the total metric for the whole application. This aggregation is currently done by maintaining an aggregation table in the collector and periodically aggregate the table. 
- Provide an extendable interface to support more aggregation operations in future. Most binary commutative and associative operations (like average) can fit in this model. 
- Extend TimelineMetrics according to the suggestions from [~sjlee0]. However, instead of using "counters" and "gauges" to categorize all metrics, I used the type of the real time aggregation operation as the metadata of the metric. I was hoping in this way we're not limiting timeline metrics in the Hadoop scope. 

Some future works:
- Decide the reader API for the aggregated entities. From a web ui point of view, it would be cool to integrate those data with applications. I.e., when an user request timeline data for one application, we can return the aggregated data back. 
- My goal is to make the aggregation process to be eventually consistent. However, maybe there are some concurrency related issues in this patch. Please feel free to point of there there's any. 
- More unit tests. 
- Support taking averages in aggregations. With the current code framework I think this should be a quick change, but it's of low priority so not in the first draft. (new JIRAs are welcome if anyone has the bandwidth.)
- Decide configs for the aggregation period. 
- Fault tolerance, not there yet... (new JIRAs are welcome if anyone has the bandwidth. )

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: yarn-2928-1st-milestone
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, YARN-3816-YARN-2928-v5.patch, YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU, Memory) consumption across all containers, number of containers launched/completed/failed, etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level aggregations rather than raw entity-level data as much less raws need to scan (with filter out non-aggregated entities, like: events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)