You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (Jira)" <ji...@apache.org> on 2020/05/14 08:47:00 UTC

[jira] [Assigned] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

     [ https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephan Ewen reassigned FLINK-16357:
------------------------------------

    Assignee: Stephan Ewen

> Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-16357
>                 URL: https://issues.apache.org/jira/browse/FLINK-16357
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire execution graph) and "regional failure" (recover a region with transient pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to handle unexpected failures or inconsistencies (full reset of ExecutionGraph recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global failover" case. In the "regional failover" case, they are only notified of the tasks that are reset and keep their internal state and adjust it for the failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)