You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ovidiu-Cristian MARCU <ov...@inria.fr> on 2016/02/22 18:00:11 UTC

Batch Processing Fault Tolerance (DataSet API)

Hi

In case of failure of a node what does it mean 'Fault tolerance for programs in the DataSet API works by retrying failed executions’ [1] ?
-work already done by the rest of the nodes is not lost, only work of the lost node is recomputed, job execution will continue
or
-entire job execution is retried

[1] https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html <https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html>

Best,
Ovidiu

Re: Batch Processing Fault Tolerance (DataSet API)

Posted by Till Rohrmann <tr...@apache.org>.

At the moment, the system can only deal with lost slots (nodes) if either
there are some excess slots which have not been used before or if the died
node is restarted. The latter is the case for yarn applications, for
example. There the application master will restart containers which have
died.

In the future it is also conceivable to adjust the parallelism of the job
to the current number of available slots. However, this is not yet
implemented. For the streaming part, we're currently working in this
direction. It is subsumed by the topic of dynamic scaling in/out.

Cheers,
Till

On Mon, Feb 22, 2016 at 6:33 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Thank you, Till!
>
> The current (in progress) implementation is considering also the problem
> related to losing the task's slots of the failed node(s), something related
> to [2] ?
>
> [2] https://issues.apache.org/jira/browse/FLINK-3047
>
> Best,
> Ovidiu
>
> On 22 Feb 2016, at 18:13, Till Rohrmann <tr...@apache.org> wrote:
>
> Hi Ovidiu,
>
> at the moment Flink's batch fault tolerance restarts the whole job in case
> of a failure. However, parts of the logic to do partial backtracking such
> as intermediate result partitions and the backtracking algorithm are
> already implemented or exist as a PR [1]. So we hope to complete the
> partial backtracking soon.
>
> [1] https://github.com/apache/flink/pull/640
>
> Cheers,
> Till
>
> On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
>> Hi
>>
>> In case of failure of a node what does it mean 'Fault tolerance for
>> programs in the *DataSet API* works by retrying failed executions’ [1] ?
>> -work already done by the rest of the nodes is not lost, only work of the
>> lost node is recomputed, job execution will continue
>> or
>> -entire job execution is retried
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html
>>
>> Best,
>> Ovidiu
>>
>
>
>

Re: Batch Processing Fault Tolerance (DataSet API)

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Thank you, Till!

The current (in progress) implementation is considering also the problem related to losing the task's slots of the failed node(s), something related to [2] ?

[2] https://issues.apache.org/jira/browse/FLINK-3047

Best,
Ovidiu

> On 22 Feb 2016, at 18:13, Till Rohrmann <tr...@apache.org> wrote:
> 
> Hi Ovidiu,
> 
> at the moment Flink's batch fault tolerance restarts the whole job in case of a failure. However, parts of the logic to do partial backtracking such as intermediate result partitions and the backtracking algorithm are already implemented or exist as a PR [1]. So we hope to complete the partial backtracking soon.
> 
> [1] https://github.com/apache/flink/pull/640 <https://github.com/apache/flink/pull/640>
> 
> Cheers,
> Till
> 
> On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>> wrote:
> Hi
> 
> In case of failure of a node what does it mean 'Fault tolerance for programs in the DataSet API works by retrying failed executions’ [1] ?
> -work already done by the rest of the nodes is not lost, only work of the lost node is recomputed, job execution will continue
> or
> -entire job execution is retried
> 
> [1] https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html <https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html>
> 
> Best,
> Ovidiu 
>

Re: Batch Processing Fault Tolerance (DataSet API)

Posted by Till Rohrmann <tr...@apache.org>.

Hi Ovidiu,

at the moment Flink's batch fault tolerance restarts the whole job in case
of a failure. However, parts of the logic to do partial backtracking such
as intermediate result partitions and the backtracking algorithm are
already implemented or exist as a PR [1]. So we hope to complete the
partial backtracking soon.

[1] https://github.com/apache/flink/pull/640

Cheers,
Till

On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Hi
>
> In case of failure of a node what does it mean 'Fault tolerance for
> programs in the *DataSet API* works by retrying failed executions’ [1] ?
> -work already done by the rest of the nodes is not lost, only work of the
> lost node is recomputed, job execution will continue
> or
> -entire job execution is retried
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html
>
> Best,
> Ovidiu
>