You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yun Gao (Jira)" <ji...@apache.org> on 2022/04/13 06:28:07 UTC

[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

     [ https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yun Gao updated FLINK-24386:
----------------------------
    Fix Version/s: 1.16.0

> JobMaster should guard against exceptions from OperatorCoordinator
> ------------------------------------------------------------------
>
>                 Key: FLINK-24386
>                 URL: https://issues.apache.org/jira/browse/FLINK-24386
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.13.2
>            Reporter: David Morávek
>            Priority: Major
>             Fix For: 1.15.0, 1.16.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, I have an _OperatorCoordinator_ that throws an exception in _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit dangerous to me. If there is some bug in any part of the failover logic, we have no safety net. No "hard crash" and let the process be restarted. We only see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually invoke the fatal error handler. If an exception propagates out of a main thread action, we need to call off all bets and assume things have gotten inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an error happens while processing the global failover, then we need to treat this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures accordingly. We want to avoid unnecessary JVM terminations as much as possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)