You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Stanislav Borissov <sk...@gmail.com> on 2020/12/14 16:23:55 UTC

Fine-grained task recovery

Hi,

I'm running a simple, "embarassingly parallel" ETL-type job. I noticed that
a failure in one subtask causes the entire job to restart. Even with the
region failover strategy, all subtasks of this task and connected ones
would fail. Is there any way to limit restarting to only the single subtask
that failed, so all other subtasks can stay alive and keep working?

For context, I use Flink 1.11 in AWS Kinesis Data Analytics, so some
configuration is not controlled by me
<https://docs.aws.amazon.com/kinesisanalytics/latest/java/reference-flink-settings.title.html>
.

Thanks

Re: Fine-grained task recovery

Posted by Robert Metzger <rm...@apache.org>.

If a TaskManager fails, the data stored on it will be lost and needs to be
recomputed. So even with the batch mode configured, more tasks might need a
restart.
To mitigate that, the Flink developers need to implement support for
external shuffle services.

On Wed, Dec 16, 2020 at 9:10 AM Robert Metzger <rm...@apache.org> wrote:

> With region failover strategy, all connected subtasks will fail.
>
> If you are using the DataSet API with env.getConfig().setExecutionMode(
> ExecutionMode.BATCH);, you should get the desired behavior.
>
> On Mon, Dec 14, 2020 at 5:24 PM Stanislav Borissov <sk...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm running a simple, "embarassingly parallel" ETL-type job. I noticed
>> that a failure in one subtask causes the entire job to restart. Even with
>> the region failover strategy, all subtasks of this task and connected ones
>> would fail. Is there any way to limit restarting to only the single subtask
>> that failed, so all other subtasks can stay alive and keep working?
>>
>> For context, I use Flink 1.11 in AWS Kinesis Data Analytics, so some
>> configuration is not controlled by me
>> <https://docs.aws.amazon.com/kinesisanalytics/latest/java/reference-flink-settings.title.html>
>> .
>>
>> Thanks
>>
>

Re: Fine-grained task recovery

Posted by Robert Metzger <rm...@apache.org>.

With region failover strategy, all connected subtasks will fail.

If you are using the DataSet API with env.getConfig().setExecutionMode(
ExecutionMode.BATCH);, you should get the desired behavior.

On Mon, Dec 14, 2020 at 5:24 PM Stanislav Borissov <sk...@gmail.com>
wrote:

> Hi,
>
> I'm running a simple, "embarassingly parallel" ETL-type job. I noticed
> that a failure in one subtask causes the entire job to restart. Even with
> the region failover strategy, all subtasks of this task and connected ones
> would fail. Is there any way to limit restarting to only the single subtask
> that failed, so all other subtasks can stay alive and keep working?
>
> For context, I use Flink 1.11 in AWS Kinesis Data Analytics, so some
> configuration is not controlled by me
> <https://docs.aws.amazon.com/kinesisanalytics/latest/java/reference-flink-settings.title.html>
> .
>
> Thanks
>