You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by StephanEwen <gi...@git.apache.org> on 2018/01/09 18:01:14 UTC

[GitHub] flink issue #4828: [FLINK-4816] [checkpoints] Executions failed from "DEPLOY...

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4828
  
    I think this approach is not yet sufficient. There can be various reasons why a failure in DEPLOY happens, failed checkpoint restore is only one of the reasons.
    
    This also adds some coupling of execution graph state and checkpoint coordinator (last restored checkpoint ID) which breaks design and responsibilities.
    
    A proper solution here is probably a bit more comprehensive - and need a bit more thinking, probably a bigger design document. my first though would be to report a proper RestoreException from the TaskManager, keeping a history of exceptions that triggered recovery, using that to evaluate fallback, etc.


---