You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Yun Tang (Jira)" <ji...@apache.org> on 2022/06/14 02:25:00 UTC

[jira] [Closed] (FLINK-28030) Checkpoint always hangs when running some jobs

     [ https://issues.apache.org/jira/browse/FLINK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yun Tang closed FLINK-28030.
----------------------------
    Resolution: Duplicate

> Checkpoint always hangs when running some jobs
> ----------------------------------------------
>
>                 Key: FLINK-28030
>                 URL: https://issues.apache.org/jira/browse/FLINK-28030
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.3
>            Reporter: Pauli Gandhi
>            Priority: Major
>
> We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23 acknowledgments (65%).  There is no cpu activity but yet there are number of tasks reporting 100% back pressure.  It is peculiar to this job and slight modifications to this job.  We have created many Flink jobs in the past and never encountered the issue.  
> Here are the things we tried to narrow down the problem
>  * The job runs fine if checkpointing is disabled.
>  * Increasing the number of task managers and parallelism to 2 seems to help the job complete.  However, it stalled again when we sent a larger data set.
>  * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help.
>  * Sometimes restarting the job manager helps but at other times not.
>  * Breaking up the job into smaller parts helps the job to finish.
>  * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state.
> Here are the environment details
>  * Flink version 1.14.3
>  * Running Kubernetes
>  * Using RocksDB state backend.
>  * Checkpoint storage is S3 storage using the Presto library
>  * Exactly Once Semantics with unaligned checkpoints enabled.
>  * Checkpoint timeout 2 hours
>  * Maximum concurrent checkpoints is 1
>  * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
>  * Using Kafka for input and output
> I have attached the task manager logs, thread dump, and screen shots of the job graph and stalled checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)