You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Elias Levy (JIRA)" <ji...@apache.org> on 2017/09/20 04:00:00 UTC

[jira] [Created] (FLINK-7646) Restart failed jobs with configurable parallelism range

Elias Levy created FLINK-7646:
---------------------------------

             Summary: Restart failed jobs with configurable parallelism range
                 Key: FLINK-7646
                 URL: https://issues.apache.org/jira/browse/FLINK-7646
             Project: Flink
          Issue Type: Improvement
          Components: DataStream API
    Affects Versions: 1.3.2
            Reporter: Elias Levy


Currently, if a TaskManager fails the whole job is terminated and then, depending on the restart policy, may be attempted to be restarted.  If the failed TaskManager has not been replaced, and there are no spare task slots in the cluster, the job will fail to be restarted.

There are situations where restoring or adding a new TaskManager may take a while  For instance, in AWS an Auto Scaling Group can only be used to manage a group of instances in a single availability zone.  If you have a cluster of TaskManagers that spans an AZ, managed by one ASG per AZ, and an AZ goes dark, the other ASGs won't scale automatically to make up for the lost TaskManagers.  To resolve the situation the healthy ASGs will need to be modified manually or by systems external to AWS.

With that in mind, it would be useful if you could specify a range for the parallelism parameter.  Under normal circumstances the job would execute with the maximum parallelism of the range.  But if TaskManagers were lost and not replaced after some time, the job would accept being execute with some lower parallelism within the range.

I understand that this may not be feasible with checkpoints, as savepoints are supposed to be the mechanism used to change parallelism of a stateful job.  Therefore, this proposal may need to wait until the implementation of the periodic savepoint feature (FLINK-4511).

This feature would aid the availability of Flink jobs.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)