You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Fei Feng (Jira)" <ji...@apache.org> on 2023/03/16 05:38:00 UTC

[jira] [Created] (FLINK-31482) support count jobmanager-failed failover times

Fei Feng created FLINK-31482:
--------------------------------

             Summary: support count jobmanager-failed failover times
                 Key: FLINK-31482
                 URL: https://issues.apache.org/jira/browse/FLINK-31482
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination, Runtime / Metrics
    Affects Versions: 1.16.1
            Reporter: Fei Feng


we have a  metric `numRestarts` which indicate how many times a job failover ， but we don't have a metric indicate the job recover from ha ( high availability).

there are two problems:

1. when a  jobmanager process crashed , we have no way of knowing that jobmanager is crash and job was recovered from metric system 

2. when a new jobmanager become leader, the  `numRestarts`  will started from zero, 
Sometimes misleading our users。most user think that whether failover because of a JM failure or because of a job failure, these failover is same , the effect, at least, is the same.
 
I suggest we can 
1. add new metric that indicate how many time the job was recovered from ha
2. metric `numRestarts` also count the times recover from ha  
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)