You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (Jira)" <ji...@apache.org> on 2020/05/18 18:19:00 UTC
[jira] [Closed] (FLINK-17781) OperatorCoordinator Context must support calls from thread other than JobMaster Main Thread

     [ https://issues.apache.org/jira/browse/FLINK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephan Ewen closed FLINK-17781.
--------------------------------

> OperatorCoordinator Context must support calls from thread other than JobMaster Main Thread
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-17781
>                 URL: https://issues.apache.org/jira/browse/FLINK-17781
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>
> Currently, calls on the Context in the OperatorCoordinator go directly synchronously to the ExcutionGraph.
> There are two critical problems are:
>   - It is common that the code in the OperatorCoordinator runs in a separate thread (for example, because it executes blocking operations). Calling the scheduler from another thread causes the Scheduler to crash (Assertion Error, violation of single threaded property)
>   - Calls on the ExecutionGraph are removed as part of removing the legacy scheduler. Certain calls do not work any more.
> +Problem Level 1:+
> The solution would be to pass in the scheduler and a main thread executor to interact with it.
> However, to do that the scheduler needs to be created before the OperatorCoordinators are created. One could do that by creating the Coordinators lazily after the Scheduler.
> +Problem Level 2:+
> The Scheduler restores the savepoints as part of the scheduler creation, when the ExecutionGraph and the CheckpointCoordinator are created early in the constructor.
> (Side note: That design is tricky in itself, because it means state is restored before the scheduler is even properly constructed.)
> That means the OperatorCoordinator needs to exist (or an in placeholder component needs to exist) to accept the restored state.
> That brings us to a cyclic dependency:
>   - OperatorCoordinator (context) needs Scheduler and MainThreadExecutor
>   - Scheduler and MainThreadExecutor need constructed ExecutionGraph
>   - ExecutionGraph needs CheckpointCoordinator
>   - CheckpointCoordinator needs OperatorCoordinator
> +Breaking the Cycle+
> The only way we can do this is with a form of lazy initialization:
>   - We eagerly create the OperatorCoordinators so they exist for state restore
>   - We provide an uninitialized context to them
>   - When the Scheduler is started (after leadership is granted) we initialize the context with the (then readily constructed) Scheduler and MainThreadExecutor
> +Longer-term Solution+
> The longer term solution would require a major change in the Scheduler and CheckpointCoordinator setup. Something like this:
>   - Scheduler (and ExecutionGraph) are constructed first
>   - JobMaster waits for leadership
>   - Upon leader grant, Operator Coordinators are constructed and can reference the Scheduler and FencedMainThreadExecutor
>   - CheckpointCoordinator is constructed and references ExecutionGraph and OperatorCoordinators
>   - Savepoint or latest checkpoint is restored
> The implementation of the current should try to couple parts as loosely as possible to make it easy to implement the above approach later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)