You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Daren Wong (Jira)" <ji...@apache.org> on 2022/07/05 16:52:00 UTC

[jira] [Created] (FLINK-28411) OperatorCoordinator exception may fail Session Cluster

Daren Wong created FLINK-28411:
----------------------------------

             Summary: OperatorCoordinator exception may fail Session Cluster
                 Key: FLINK-28411
                 URL: https://issues.apache.org/jira/browse/FLINK-28411
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Common
            Reporter: Daren Wong
             Fix For: 1.15.2


Part of Scheduler's startScheduling procedure involves starting all OperatorCoordinatorHolder, and when one of the OperatorCoordinator fails to start, the exception is forwarded up the stack triggering a JobMaster failover. However, JobMaster failover only works if HA is enabled[1]. If HA is not enabled the fatal error handler will simply exit the JM process killing the entire cluster. This is problematic in the case of a session cluster where there may be multiple jobs running. It also does not play well with external tooling that does not expect job failure to cause a full cluster failure. 

 

It would be preferable if failure to start an OperatorCoordinator did not take down the entire cluster, but instead failed that particular job. 

 

This issue is similar to https://issues.apache.org/jira/browse/FLINK-24303 which fix this issue for a SourceCoordinator specifically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)