You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2017/02/09 20:37:41 UTC

[jira] [Commented] (REEF-1732) Build Metrics System

    [ https://issues.apache.org/jira/browse/REEF-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860155#comment-15860155 ] 

Chris Douglas commented on REEF-1732:
-------------------------------------

bq. +1 on simplifications. Dhruv Mahajan, Chris Douglas: The original design contained some learnings from the Hadoop metrics system. Anything in particular we should look out for?

Nothing comes to mind. Hadoop metrics2 is a pub/sub model that's pretty familiar.

Task counters in MapReduce work slightly differently, and have different goals. Specifically, only counters that contribute to output are included, so every execution of a (deterministic) job run on the same data should have identical job counters, regardless of failures or speculative execution. In contrast to metrics tracking utilization, gc time, and other runtime statistics (i.e., what resources did this application consume?), task counters are for users to quickly and approximately validate job correctness (i.e., what work did this application accomplish?).

I'm not sure if this distinction is useful to REEF applications, but it leads to very different read semantics. Particularly handling non-deterministic execution with failures: task counters from a re-executed task may actually have multiple versions of "correct" values, since multiple versions of the same task affected the output.

> Build Metrics System
> --------------------
>
>                 Key: REEF-1732
>                 URL: https://issues.apache.org/jira/browse/REEF-1732
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF
>            Reporter: Julia
>            Assignee: Julia
>         Attachments: IMRU Metrics System.docx
>
>
> IMRU Metrics is to provide metrics data to the system so that it can be shown to the user for monitoring or diagnosis. The goal is to build an E2E flow with simple/basic metrics data. We can then add more data later. 
> * IMetricsProvider - there are multiple sources of metrics data:
>   1.Task metrics. This is in particular for IMRU task such as current iteration, progress. Each task can send task state back to driver and let driver to aggregate it. Alternatively, as UpdateTask knows current iterations and progress, to start with, we can just get task status from update task. The task metrics can be provided by task function like IUpdateFunction and send to driver by task host as TaskMessage with heartbeat. 
>   2. Driver metrics – For IMRU driver, it can be system state such as WaitingForEvaluator or TasksRunning, current retry number, etc. Those driver states are maintained inside IMRU driver. 
>  3. IMRUDriver will implement IMetricsProvider and supply metrics data. 
> * IMetricsSink – the metrics data will be output somewhere so that it can be consumed by a monitoring tool. An interface IMetricsSink will be defined to sink metrics data. An implementation of the interface can store the data to a remote storage. Multiple sinks can be injected. 
> * MetricsManager – It schedules a timer to get metrics from IMetricsProviders and output the metrics data with IMetricsSinks
> Attached file shows the diagram of the design. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)