You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Mohammad Hosseinian <mo...@id1.de> on 2019/07/30 15:50:56 UTC

Checkpoints very slow with high backpressure

Hi,

I'm still facing the same issue under 1.8. Our pipeline uses end-to-end
exactly-once semantic, which means the consumer program cannot read the
messages until they are committed. So in case of an outage, the whole
runtime delay is passed over to the next stream processor application and
creates an even larger delay in our processing pipeline. Is there any way to
force the checkpoint to complete even under backpressure situation?

Thank you in advance.

Regards,

--

Mohammad Hosseinian
Software Developer
Information Design One AG

Phone +49-69-244502-0
Fax +49-69-244502-10
Web www.id1.de<http://www.id1.de>


Information Design One AG, Baseler Strasse 10, 60329 Frankfurt am Main, Germany
Registration: Amtsgericht Frankfurt am Main, HRB 52596
Executive Board: Robert Peters, Benjamin Walther, Supervisory Board: Christian Hecht

Re: Checkpoints very slow with high backpressure

Posted by Piotr Nowojski <pi...@ververica.com>.

Hi,

For Flink 1.8 (and 1.9) the only thing that you can do, is to try to limit amount of data buffered between the nodes (check Flink network configuration [1] for number of buffers and or buffer pool sizes). This can reduce maximal throughput (but only if the network transfer is a significant cost, for example if your records are extremely quick to process), but it will speed up checkpointing during back pressure.

There are some plans to address this and maybe there will be some improvement in Flink 1.10. 

If your job is completely stalled because of an outage, then I don’t think that you can do much now, since even with only one single buffered record the checkpoints will not progress. We might try to address this, but that’s further down the road.

Piotrek

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html>

> On 30 Jul 2019, at 17:50, Mohammad Hosseinian <mo...@id1.de> wrote:
> 
> Hi, 
> 
> I'm still facing the same issue under 1.8. Our pipeline uses end-to-end
> exactly-once semantic, which means the consumer program cannot read the
> messages until they are committed. So in case of an outage, the whole
> runtime delay is passed over to the next stream processor application and
> creates an even larger delay in our processing pipeline. Is there any way to
> force the checkpoint to complete even under backpressure situation? 
> 
> Thank you in advance. 
> 
> Regards, 
> -- 
> Mohammad Hosseinian
> Software Developer
> Information Design One AG
> 
> Phone +49-69-244502-0
> Fax +49-69-244502-10
> Web www.id1.de <http://www.id1.de/>
> 
> 
> Information Design One AG, Baseler Strasse 10, 60329 Frankfurt am Main, Germany
> Registration: Amtsgericht Frankfurt am Main, HRB 52596
> Executive Board: Robert Peters, Benjamin Walther, Supervisory Board: Christian Hecht
>