You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Biao Liu <mm...@gmail.com> on 2019/09/02 15:01:30 UTC

Re: checkpoint failure suddenly even state size is into 10 mb around

Hi Sushant,

Your screenshot shows the checkpoint expired. It means checkpoint did not
finish in time.
I guess the reason is the heavy back pressure blocks both data and barrier.
But I can't tell why there was a heavy back pressure.

If this scenario happens again, you could pay more attention to the tasks
which cause this heavy back pressure.
The task manager log, GC log, and some other tools like jstack might help.

Thanks,
Biao /'bɪ.aʊ/

On Fri, 23 Aug 2019 at 15:27, Sushant Sawant <su...@gmail.com>
wrote:

> Hi all,
> m facing two issues which I believe are co-related though.
> 1. Kafka source shows high back pressure.
> 2. Sudden checkpoint failure for entire day until restart.
>
> My job does following thing,
> a. Read from Kafka
> b. Asyncio to external system
> c. Dumping in Cassandra, Elasticsearch
>
> Checkpointing is using file system.
> This flink job is proven under high load,
> around 5000/sec throughput.
> But recently we scaled down parallelism since, there wasn't any load in
> production and these issues started.
>
> Please find the status shown by flink dashboard.
> The github folder contains image where there was high back pressure and
> checkpoint failure
>
> https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
> and  after restart, "everything is fine" images in this folder,
>
> https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing
>
> --
> Could anyone point me towards direction what would have went wrong/
> trouble shooting??
>
>
> Thanks & Regards,
> Sushant Sawant
>