You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Pauli Gandhi (Jira)" <ji...@apache.org> on 2022/06/13 22:56:00 UTC
[jira] [Updated] (FLINK-28032) Checkpointing hangs and times out with some jobs

     [ https://issues.apache.org/jira/browse/FLINK-28032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pauli Gandhi updated FLINK-28032:
---------------------------------
    Summary: Checkpointing hangs and times out with some jobs  (was: Flink checkpointing hangs and times out with some jobs)

> Checkpointing hangs and times out with some jobs
> ------------------------------------------------
>
>                 Key: FLINK-28032
>                 URL: https://issues.apache.org/jira/browse/FLINK-28032
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.3
>         Environment: Here are the environment details
>  * Flink version 1.14.3
>  * Running Kubernetes
>  * Using RocksDB state backend.
>  * Checkpoint storage is S3 storage using the Presto library
>  * Exactly Once Semantics with unaligned checkpoints enabled.
>  * Checkpoint timeout 2 hours
>  * Maximum concurrent checkpoints is 1
>  * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
>  * Using Kafka for input and output
>            Reporter: Pauli Gandhi
>            Priority: Major
>         Attachments: checkpoint snapshot.png, jobgraph.png, taskmanager_10.112.55.143_6122-969889_log, taskmanager_10.112.55.143_6122-969889_thread_dump
>
>
> We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23(65%) acknowledgments.  There is no cpu/record processing activity but yet there are a number of tasks reporting 100% back pressure.  It is peculiar to this job and slight modifications to this job.  We have created many Flink jobs in the past and never encountered the issue.  
> Here are the things we tried to narrow down the problem
>  * The job runs fine if checkpointing is disabled.
>  * Increasing the number of task managers and parallelism to 2 seems to help the job complete.  However, it stalled again when we sent a larger data set.
>  * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help.
>  * Sometimes restarting the job manager helps but at other times not.
>  * Breaking up the job into smaller parts helps the job to finish.
>  * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state.
> I have attached the task manager logs (including debug logs for checkpointing), thread dump, and screen shots of the job graph and stalled checkpoint.
> Your help in resolving this issue is greatly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)