You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2017/05/22 18:39:04 UTC

[jira] [Created] (FLINK-6666) RestartStrategy should differentiate between types of recovery (global / local / resource missing)

Stephan Ewen created FLINK-6666:
-----------------------------------

             Summary: RestartStrategy should differentiate between types of recovery (global / local / resource missing)
                 Key: FLINK-6666
                 URL: https://issues.apache.org/jira/browse/FLINK-6666
             Project: Flink
          Issue Type: Sub-task
          Components: Distributed Coordination
    Affects Versions: 1.3.0
            Reporter: Stephan Ewen


Currently, the {{RestrartStrategy}} has a single method that is called when a failure requires an ExecutionGraph restart.

With the new addition of incremental recovery, it is desirable to distinguish between the type of failover that happens.

I would suggest to extend the {{RestartStrategy}} to support three cases/methods:

  - {{restartGlobal()}} for a full restart recovery
  - {{restartLocal()}} for a recovery coordinated by the {{FailoverStrategy}}
  - {{restartOnMissingResources()}} if the failure cause was missing slots

The last case is interesting, in my opinion, because it is commonly desirable that regular failover has no delay, but failover on missing resources has a short delay (1s or so) to avoid very fast cycles of restart attempts (in standalone mode, there can easily be 100,000 restarts after a second, when no resources are available and no delay happens during restarts).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)