You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Fei Feng (Jira)" <ji...@apache.org> on 2023/03/16 05:38:00 UTC
[jira] [Created] (FLINK-31482) support count jobmanager-failed failover times
Fei Feng created FLINK-31482:
--------------------------------
Summary: support count jobmanager-failed failover times
Key: FLINK-31482
URL: https://issues.apache.org/jira/browse/FLINK-31482
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination, Runtime / Metrics
Affects Versions: 1.16.1
Reporter: Fei Feng
we have a metric `numRestarts` which indicate how many times a job failover , but we don't have a metric indicate the job recover from ha ( high availability).
there are two problems:
1. when a jobmanager process crashed , we have no way of knowing that jobmanager is crash and job was recovered from metric system
2. when a new jobmanager become leader, the `numRestarts` will started from zero,
Sometimes misleading our users。most user think that whether failover because of a JM failure or because of a job failure, these failover is same , the effect, at least, is the same.
I suggest we can
1. add new metric that indicate how many time the job was recovered from ha
2. metric `numRestarts` also count the times recover from ha
--
This message was sent by Atlassian Jira
(v8.20.10#820010)