You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Márton Balassi <ba...@gmail.com> on 2014/11/04 09:14:01 UTC

Coarse-grained FT implementation

Stephan,

Could you please summarize how the new coarse grained FT works? [1]

I'm sure that we'll be facing this question a lot. :)

Thanks,

Marton

[1]
https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341

Re: Coarse-grained FT implementation

Posted by Stephan Ewen <se...@apache.org>.
Hey everyone!

Sorry to be late to answer to this question.

The short anser is: Our fault tolerance is very comparable to Spark's RDD
lineage. We internally build the computation graph of the operators (we
call it JobGraph / ExecutionGraph) which we use both for execution and
re-execution in case of failures. The subgraph rooted at each operator can
be thought of as the lineage of the result computed by that operator.


The longer answer (with a few more details):

 - An operator is a data source, a function (map/join/reduce/...) or a
built-in operation (aggregate, iteration controller, ...).

 - The JobGraph is a compact version of the program that describes which
operators produce which intermediate results and which ones consume them.

 - The ExecutionGraph is the parallelized version of that graph, that
contains an ExecutionVertex for each parallel instance of an operator. The
ExecutionVertex tracks the state of that parallel instance of the operator.

 - The ExecutionVertex can have multiple ExecutionAttempts. If everything
works fine, there is only one attempt, but attempts can be canceled and new
attempts can be deployed. An execution attempt may trigger other
ExecutionAttempts, if predecessors need to be recomputed.

Stephan




On Fri, Nov 7, 2014 at 7:35 PM, Henry Saputra <he...@gmail.com>
wrote:

> HI Kostas,
>
> Thanks for the reply, yep you were right it is in current master already.
> But as Marton has mentioned before, I believe there was no
> documentation on how it suppose to work and the git commit comment
> does not have much details on the impl details.
>
> Some questions from the meetup on how to deal with fault in workflow
> process execution, and mostly comparing to Spark RDD lineage
> recomputation.
>
> - Henry
>
> On Fri, Nov 7, 2014 at 10:15 AM, Kostas Tzoumas <kt...@apache.org>
> wrote:
> > Hi Henry,
> >
> > Afaik this is already in the current master, see
> ExecutionGraph.restart().
> >
> > The goal is now to make fault tolerance more fine grained by restarting
> > from checkpointed intermediate data sets, not from the base data.
> >
> > Kostas
> >
> > On Fri, Nov 7, 2014 at 6:49 PM, Henry Saputra <he...@gmail.com>
> > wrote:
> >
> >> Stephan,
> >>
> >> Could you share your thoughts and design/ plan to implement this new
> >> coarse grained fault tolerant?
> >>
> >> From last talk in Palo Alto seemed some interests about it.
> >>
> >> - Henry
> >>
> >> On Tue, Nov 4, 2014 at 12:14 AM, Márton Balassi
> >> <ba...@gmail.com> wrote:
> >> > Stephan,
> >> >
> >> > Could you please summarize how the new coarse grained FT works? [1]
> >> >
> >> > I'm sure that we'll be facing this question a lot. :)
> >> >
> >> > Thanks,
> >> >
> >> > Marton
> >> >
> >> > [1]
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341
> >>
>

Re: Coarse-grained FT implementation

Posted by Henry Saputra <he...@gmail.com>.
HI Kostas,

Thanks for the reply, yep you were right it is in current master already.
But as Marton has mentioned before, I believe there was no
documentation on how it suppose to work and the git commit comment
does not have much details on the impl details.

Some questions from the meetup on how to deal with fault in workflow
process execution, and mostly comparing to Spark RDD lineage
recomputation.

- Henry

On Fri, Nov 7, 2014 at 10:15 AM, Kostas Tzoumas <kt...@apache.org> wrote:
> Hi Henry,
>
> Afaik this is already in the current master, see ExecutionGraph.restart().
>
> The goal is now to make fault tolerance more fine grained by restarting
> from checkpointed intermediate data sets, not from the base data.
>
> Kostas
>
> On Fri, Nov 7, 2014 at 6:49 PM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> Stephan,
>>
>> Could you share your thoughts and design/ plan to implement this new
>> coarse grained fault tolerant?
>>
>> From last talk in Palo Alto seemed some interests about it.
>>
>> - Henry
>>
>> On Tue, Nov 4, 2014 at 12:14 AM, Márton Balassi
>> <ba...@gmail.com> wrote:
>> > Stephan,
>> >
>> > Could you please summarize how the new coarse grained FT works? [1]
>> >
>> > I'm sure that we'll be facing this question a lot. :)
>> >
>> > Thanks,
>> >
>> > Marton
>> >
>> > [1]
>> >
>> https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341
>>

Re: Coarse-grained FT implementation

Posted by Kostas Tzoumas <kt...@apache.org>.
Hi Henry,

Afaik this is already in the current master, see ExecutionGraph.restart().

The goal is now to make fault tolerance more fine grained by restarting
from checkpointed intermediate data sets, not from the base data.

Kostas

On Fri, Nov 7, 2014 at 6:49 PM, Henry Saputra <he...@gmail.com>
wrote:

> Stephan,
>
> Could you share your thoughts and design/ plan to implement this new
> coarse grained fault tolerant?
>
> From last talk in Palo Alto seemed some interests about it.
>
> - Henry
>
> On Tue, Nov 4, 2014 at 12:14 AM, Márton Balassi
> <ba...@gmail.com> wrote:
> > Stephan,
> >
> > Could you please summarize how the new coarse grained FT works? [1]
> >
> > I'm sure that we'll be facing this question a lot. :)
> >
> > Thanks,
> >
> > Marton
> >
> > [1]
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341
>

Re: Coarse-grained FT implementation

Posted by Henry Saputra <he...@gmail.com>.
Stephan,

Could you share your thoughts and design/ plan to implement this new
coarse grained fault tolerant?

>From last talk in Palo Alto seemed some interests about it.

- Henry

On Tue, Nov 4, 2014 at 12:14 AM, Márton Balassi
<ba...@gmail.com> wrote:
> Stephan,
>
> Could you please summarize how the new coarse grained FT works? [1]
>
> I'm sure that we'll be facing this question a lot. :)
>
> Thanks,
>
> Marton
>
> [1]
> https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341

Re: Coarse-grained FT implementation

Posted by Ufuk Celebi <uc...@apache.org>.
Hey Marton! In my understanding it works by simply restarting the failed
job completely.

– Ufuk

On Tuesday, November 4, 2014, Márton Balassi <ba...@gmail.com>
wrote:

> Stephan,
>
> Could you please summarize how the new coarse grained FT works? [1]
>
> I'm sure that we'll be facing this question a lot. :)
>
> Thanks,
>
> Marton
>
> [1]
>
> https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341
>