You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Piotr Nowojski <pi...@ververica.com> on 2019/10/07 07:57:09 UTC

Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Hi Arvid,

Thanks for coming up with this FLIP. I think it addresses the issues raised in the previous mailing list discussion [2]. 

For the record: +1 from my side to implement this.

Piotrek

> On 30 Sep 2019, at 14:31, Arvid Heise <ar...@ververica.com> wrote:
> 
> Hi Devs,
> 
> I would like to start the formal discussion about FLIP-76 [1], which
> improves the checkpoint latency in systems under backpressure, where a
> checkpoint can take hours to complete in the worst case. I recommend the
> thread "checkpointing under backpressure" [2] to get a good idea why users
> are not satisfied with the current behavior. The key points:
> 
>   - Since the checkpoint barrier flows much slower through the
>   back-pressured channels, the other channels and their upstream operators
>   are effectively blocked during checkpointing.
>   - The checkpoint barrier takes a long time to reach the sinks causing
>   long checkpointing times. A longer checkpointing time in turn means that
>   the checkpoint will be fairly outdated once done. Since a heavily utilized
>   pipeline is inherently more fragile, we may run into a vicious cycle of
>   late checkpoints, crash, recovery to a rather outdated checkpoint, more
>   back pressure, and even later checkpoints, which would result in little to
>   no progress in the application.
> 
> The FLIP proposes "unaligned checkpoints" which improves the current state,
> such that
> 
>   - Upstream processes can continue to produce data, even if some operator
>   still waits on a checkpoint barrier on a specific input channel.
>   - Checkpointing times are heavily reduced across the execution graph,
>   even for operators with a single input channel.
>   - End-users will see more progress even in unstable environments as more
>   up-to-date checkpoints will avoid too many recomputations.
>   - Facilitate faster rescaling.
> 
> The key idea is to allow checkpoint barriers to be forwarded to downstream
> tasks before the synchronous part of the checkpointing has been conducted
> (see Fig. 1). To that end, we need to store in-flight data as part of the
> checkpoint as described in greater details in this FLIP.
> 
> Although the basic idea was already sketched in [2], we would like get
> broader feedback in this dedicated mail thread.
> 
> Best,
> 
> Arvid
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html

Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Posted by Congxian Qiu <qc...@gmail.com>.

Thanks for the FLIP, Arvid.

This is a good improvement for checkpoint under backpressure. Currently, if
a job under backpressure, it almost can't complete the checkpoint. so +1
from my side.

Best,
Congxian


zhijiang <wa...@aliyun.com.invalid> 于2019年10月10日周四 上午11:02写道：

> Thanks for writing up this FLIP, Arvid!
>
> Many users would expect this feature and also +1 from my side.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Piotr Nowojski <pi...@ververica.com>
> Send Time:2019年10月7日(星期一) 10:13
> To:dev <de...@flink.apache.org>
> Subject:Re: [DISCUSS] FLIP-76: Unaligned checkpoints
>
> Hi Arvid,
>
> Thanks for coming up with this FLIP. I think it addresses the issues
> raised in the previous mailing list discussion [2].
>
> For the record: +1 from my side to implement this.
>
> Piotrek
>
> > On 30 Sep 2019, at 14:31, Arvid Heise <ar...@ververica.com> wrote:
> >
> > Hi Devs,
> >
> > I would like to start the formal discussion about FLIP-76 [1], which
> > improves the checkpoint latency in systems under backpressure, where a
> > checkpoint can take hours to complete in the worst case. I recommend the
> > thread "checkpointing under backpressure" [2] to get a good idea why
> users
> > are not satisfied with the current behavior. The key points:
> >
> >   - Since the checkpoint barrier flows much slower through the
> >   back-pressured channels, the other channels and their upstream
> operators
> >   are effectively blocked during checkpointing.
> >   - The checkpoint barrier takes a long time to reach the sinks causing
> >   long checkpointing times. A longer checkpointing time in turn means
> that
> >   the checkpoint will be fairly outdated once done. Since a heavily
> utilized
> >   pipeline is inherently more fragile, we may run into a vicious cycle of
> >   late checkpoints, crash, recovery to a rather outdated checkpoint, more
> >   back pressure, and even later checkpoints, which would result in
> little to
> >   no progress in the application.
> >
> > The FLIP proposes "unaligned checkpoints" which improves the current
> state,
> > such that
> >
> >   - Upstream processes can continue to produce data, even if some
> operator
> >   still waits on a checkpoint barrier on a specific input channel.
> >   - Checkpointing times are heavily reduced across the execution graph,
> >   even for operators with a single input channel.
> >   - End-users will see more progress even in unstable environments as
> more
> >   up-to-date checkpoints will avoid too many recomputations.
> >   - Facilitate faster rescaling.
> >
> > The key idea is to allow checkpoint barriers to be forwarded to
> downstream
> > tasks before the synchronous part of the checkpointing has been conducted
> > (see Fig. 1). To that end, we need to store in-flight data as part of the
> > checkpoint as described in greater details in this FLIP.
> >
> > Although the basic idea was already sketched in [2], we would like get
> > broader feedback in this dedicated mail thread.
> >
> > Best,
> >
> > Arvid
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > [2]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html
>
>

Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Posted by zhijiang <wa...@aliyun.com.INVALID>.

Thanks for writing up this FLIP, Arvid! 

Many users would expect this feature and also +1 from my side.

Best,
Zhijiang
------------------------------------------------------------------
From:Piotr Nowojski <pi...@ververica.com>
Send Time:2019年10月7日(星期一) 10:13
To:dev <de...@flink.apache.org>
Subject:Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Hi Arvid,

Thanks for coming up with this FLIP. I think it addresses the issues raised in the previous mailing list discussion [2]. 

For the record: +1 from my side to implement this.

Piotrek

> On 30 Sep 2019, at 14:31, Arvid Heise <ar...@ververica.com> wrote:
> 
> Hi Devs,
> 
> I would like to start the formal discussion about FLIP-76 [1], which
> improves the checkpoint latency in systems under backpressure, where a
> checkpoint can take hours to complete in the worst case. I recommend the
> thread "checkpointing under backpressure" [2] to get a good idea why users
> are not satisfied with the current behavior. The key points:
> 
>   - Since the checkpoint barrier flows much slower through the
>   back-pressured channels, the other channels and their upstream operators
>   are effectively blocked during checkpointing.
>   - The checkpoint barrier takes a long time to reach the sinks causing
>   long checkpointing times. A longer checkpointing time in turn means that
>   the checkpoint will be fairly outdated once done. Since a heavily utilized
>   pipeline is inherently more fragile, we may run into a vicious cycle of
>   late checkpoints, crash, recovery to a rather outdated checkpoint, more
>   back pressure, and even later checkpoints, which would result in little to
>   no progress in the application.
> 
> The FLIP proposes "unaligned checkpoints" which improves the current state,
> such that
> 
>   - Upstream processes can continue to produce data, even if some operator
>   still waits on a checkpoint barrier on a specific input channel.
>   - Checkpointing times are heavily reduced across the execution graph,
>   even for operators with a single input channel.
>   - End-users will see more progress even in unstable environments as more
>   up-to-date checkpoints will avoid too many recomputations.
>   - Facilitate faster rescaling.
> 
> The key idea is to allow checkpoint barriers to be forwarded to downstream
> tasks before the synchronous part of the checkpointing has been conducted
> (see Fig. 1). To that end, we need to store in-flight data as part of the
> checkpoint as described in greater details in this FLIP.
> 
> Although the basic idea was already sketched in [2], we would like get
> broader feedback in this dedicated mail thread.
> 
> Best,
> 
> Arvid
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html