You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhu Zhu (Jira)" <ji...@apache.org> on 2019/09/25 10:08:00 UTC

[jira] [Updated] (FLINK-14206) Make fullRestart metric to count fine grained restarts as well

     [ https://issues.apache.org/jira/browse/FLINK-14206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhu Zhu updated FLINK-14206:
----------------------------
    Description: 
With fine grained recovery introduced in 1.9.0, the {{fullRestart}} metric only counts how many times the entire graph has been restarted, not including the number of fine grained failure restarts.

As many users leverage this metric for failure detecting monitoring and alerting, I'd propose to make it also count fine grained failure restarts.

The concrete proposal is:
1. Add a counter  {{numberOfRestartCounter}} in ExecutionGraph to count all restarts. The counter is not to be registered to metric groups.
2. Let {{fullRestart}} query the value of the counter, instead of {{ExecutionGraph#globalModVersion}}
3. increment {{numberOfRestartCounter}} in {{ExecutionGraph#failGlobal}}
4. increment {{numberOfRestartCounter}} in {{ExecutionGraph#notifyExecutionChange}} where notifying the failover strategy, or maybe in {{AdaptedRestartPipelinedRegionStrategyNG}} to only count failovers really happened


  was:
With fine grained recovery introduced in 1.9.0, the {{fullRestart}} metric only counts how many times the entire graph has been restarted, not including the number of fine grained failure restarts.

As many users leverage this metric for failure detecting monitoring and alerting, I'd propose to make it also count fine grained failure restarts.

The concrete proposal is:
1. Add a counter  {{numberOfRestartCounter}} in ExecutionGraph to count all restarts. The counter is not to be registered to metric groups.
2. Let {{fullRestart}} query the value of the counter, instead of {{ExecutionGraph#globalModVersion}}
3. increment {{numberOfRestartCounter}} in {{ExecutionGraph#failGlobal}}
4. increment {{numberOfRestartCounter}} in {{ExecutionGraph#notifyExecutionChange}} where notifying the failover strategy, or maybe in {{AdaptedRestartPipelinedRegionStrategyNG}} to only count those failover really happens



> Make fullRestart metric to count fine grained restarts as well
> --------------------------------------------------------------
>
>                 Key: FLINK-14206
>                 URL: https://issues.apache.org/jira/browse/FLINK-14206
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.9.1
>
>
> With fine grained recovery introduced in 1.9.0, the {{fullRestart}} metric only counts how many times the entire graph has been restarted, not including the number of fine grained failure restarts.
> As many users leverage this metric for failure detecting monitoring and alerting, I'd propose to make it also count fine grained failure restarts.
> The concrete proposal is:
> 1. Add a counter  {{numberOfRestartCounter}} in ExecutionGraph to count all restarts. The counter is not to be registered to metric groups.
> 2. Let {{fullRestart}} query the value of the counter, instead of {{ExecutionGraph#globalModVersion}}
> 3. increment {{numberOfRestartCounter}} in {{ExecutionGraph#failGlobal}}
> 4. increment {{numberOfRestartCounter}} in {{ExecutionGraph#notifyExecutionChange}} where notifying the failover strategy, or maybe in {{AdaptedRestartPipelinedRegionStrategyNG}} to only count failovers really happened



--
This message was sent by Atlassian Jira
(v8.3.4#803005)