You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yun Tang (Jira)" <ji...@apache.org> on 2022/06/14 02:25:00 UTC
[jira] [Closed] (FLINK-28030) Checkpoint always hangs when running some jobs
[ https://issues.apache.org/jira/browse/FLINK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yun Tang closed FLINK-28030.
----------------------------
Resolution: Duplicate
> Checkpoint always hangs when running some jobs
> ----------------------------------------------
>
> Key: FLINK-28030
> URL: https://issues.apache.org/jira/browse/FLINK-28030
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.3
> Reporter: Pauli Gandhi
> Priority: Major
>
> We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23 acknowledgments (65%). There is no cpu activity but yet there are number of tasks reporting 100% back pressure. It is peculiar to this job and slight modifications to this job. We have created many Flink jobs in the past and never encountered the issue.
> Here are the things we tried to narrow down the problem
> * The job runs fine if checkpointing is disabled.
> * Increasing the number of task managers and parallelism to 2 seems to help the job complete. However, it stalled again when we sent a larger data set.
> * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help.
> * Sometimes restarting the job manager helps but at other times not.
> * Breaking up the job into smaller parts helps the job to finish.
> * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state.
> Here are the environment details
> * Flink version 1.14.3
> * Running Kubernetes
> * Using RocksDB state backend.
> * Checkpoint storage is S3 storage using the Presto library
> * Exactly Once Semantics with unaligned checkpoints enabled.
> * Checkpoint timeout 2 hours
> * Maximum concurrent checkpoints is 1
> * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
> * Using Kafka for input and output
> I have attached the task manager logs, thread dump, and screen shots of the job graph and stalled checkpoint.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)