You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Sushant Sawant <su...@gmail.com> on 2019/09/06 09:34:50 UTC
Re: checkpoint failure suddenly even state size less than 1 mb

Hi Yun,
Have captured the heap dump which includes thread stack.
There is an lock in thread in elasticsearch sink operator.
Screenshot of Jprofiler
https://github.com/sushantbprise/flink-dashboard/tree/master/thread-blocked

And I see lot many threads in waiting state.

I found this link which is kind a similar to my problem.
http://mail-archives.apache.org/mod_mbox/flink-user/201710.mbox/%3cFDD0A701-D6FE-433F-B343-19FD24CB328A@data-artisans.com%3e

How could I over come this condition?


Thanks & Regards,
Sushant Sawant

On Fri, 30 Aug 2019, 12:48 Yun Tang, <my...@live.com> wrote:

> Hi Sushant
>
> What confuse me is that why source task cannot complete checkpoint in 3
> minutes [1]. If no sub-task has ever completed the checkpoint, which means
> even source task cannot complete. Actually source task would not need to
> buffer the data. From what I see, it might be affected by acquiring the
> lock which hold by stream task main thread to process elements [2]. Could
> you use jstack to capture your java process' threads to know what happened
> when checkpoint failed?
>
> [1]
> https://github.com/sushantbprise/flink-dashboard/blob/master/failed-checkpointing/state2.png
> [2]
> https://github.com/apache/flink/blob/ccc7eb431477059b32fb924104c17af953620c74/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L758
>
> Best
> Yun Tang
> ------------------------------
> *From:* Sushant Sawant <su...@gmail.com>
> *Sent:* Tuesday, August 27, 2019 15:01
> *To:* user <us...@flink.apache.org>
> *Subject:* Re: checkpoint failure suddenly even state size less than 1 mb
>
> Hi team,
> Anyone for help/suggestion, now we have stopped all input in kafka, there
> is no processing, no sink but checkpointing is failing.
> Is it like once checkpoint fails it keeps failing forever until job
> restart.
>
> Help appreciated.
>
> Thanks & Regards,
> Sushant Sawant
>
> On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <su...@gmail.com>
> wrote:
>
> Hi all,
> m facing two issues which I believe are co-related though.
> 1. Kafka source shows high back pressure.
> 2. Sudden checkpoint failure for entire day until restart.
>
> My job does following thing,
> a. Read from Kafka
> b. Asyncio to external system
> c. Dumping in Cassandra, Elasticsearch
>
> Checkpointing is using file system.
> This flink job is proven under high load,
> around 5000/sec throughput.
> But recently we scaled down parallelism since, there wasn't any load in
> production and these issues started.
>
> Please find the status shown by flink dashboard.
> The github folder contains image where there was high back pressure and
> checkpoint failure
>
> https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
> and  after restart, "everything is fine" images in this folder,
>
> https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing
>
> --
> Could anyone point me towards direction what would have went wrong/
> trouble shooting??
>
>
> Thanks & Regards,
> Sushant Sawant
>
>
>