You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Pauli Gandhi (Jira)" <ji...@apache.org> on 2022/06/13 22:44:00 UTC
[jira] [Updated] (FLINK-28032) Flink checkpointing hangs and times out with some jobs
[ https://issues.apache.org/jira/browse/FLINK-28032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pauli Gandhi updated FLINK-28032:
---------------------------------
Summary: Flink checkpointing hangs and times out with some jobs (was: Flink checkpointing hangs with some jobs)
> Flink checkpointing hangs and times out with some jobs
> ------------------------------------------------------
>
> Key: FLINK-28032
> URL: https://issues.apache.org/jira/browse/FLINK-28032
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.3
> Environment: Here are the environment details
> * Flink version 1.14.3
> * Running Kubernetes
> * Using RocksDB state backend.
> * Checkpoint storage is S3 storage using the Presto library
> * Exactly Once Semantics with unaligned checkpoints enabled.
> * Checkpoint timeout 2 hours
> * Maximum concurrent checkpoints is 1
> * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
> * Using Kafka for input and output
> Reporter: Pauli Gandhi
> Priority: Major
> Attachments: checkpoint snapshot.png, jobgraph.png, taskmanager_10.112.55.143_6122-969889_log, taskmanager_10.112.55.143_6122-969889_thread_dump
>
>
> We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23(65%) acknowledgments. There is no cpu/record processing activity but yet there are a number of tasks reporting 100% back pressure. It is peculiar to this job and slight modifications to this job. We have created many Flink jobs in the past and never encountered the issue.
> Here are the things we tried to narrow down the problem
> * The job runs fine if checkpointing is disabled.
> * Increasing the number of task managers and parallelism to 2 seems to help the job complete. However, it stalled again when we sent a larger data set.
> * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help.
> * Sometimes restarting the job manager helps but at other times not.
> * Breaking up the job into smaller parts helps the job to finish.
> * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state.
> I have attached the task manager logs, thread dump, and screen shots of the job graph and stalled checkpoint.
> Your help in resolving this issue is greatly appreciated.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)