You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Pauli Gandhi (Jira)" <ji...@apache.org> on 2022/06/13 22:42:00 UTC

[jira] [Created] (FLINK-28032) Flink checkpointing hangs with some jobs

Pauli Gandhi created FLINK-28032:
------------------------------------

             Summary: Flink checkpointing hangs with some jobs
                 Key: FLINK-28032
                 URL: https://issues.apache.org/jira/browse/FLINK-28032
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.14.3
         Environment: Here are the environment details
 * Flink version 1.14.3
 * Running Kubernetes
 * Using RocksDB state backend.
 * Checkpoint storage is S3 storage using the Presto library
 * Exactly Once Semantics with unaligned checkpoints enabled.
 * Checkpoint timeout 2 hours
 * Maximum concurrent checkpoints is 1
 * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
 * Using Kafka for input and output
            Reporter: Pauli Gandhi
         Attachments: checkpoint snapshot.png, jobgraph.png, taskmanager_10.112.55.143_6122-969889_log, taskmanager_10.112.55.143_6122-969889_thread_dump

We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23(65%) acknowledgments.  There is no cpu/record processing activity but yet there are a number of tasks reporting 100% back pressure.  It is peculiar to this job and slight modifications to this job.  We have created many Flink jobs in the past and never encountered the issue.  

Here are the things we tried to narrow down the problem
 * The job runs fine if checkpointing is disabled.
 * Increasing the number of task managers and parallelism to 2 seems to help the job complete.  However, it stalled again when we sent a larger data set.
 * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help.
 * Sometimes restarting the job manager helps but at other times not.
 * Breaking up the job into smaller parts helps the job to finish.
 * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state.

I have attached the task manager logs, thread dump, and screen shots of the job graph and stalled checkpoint.

Your help in resolving this issue is greatly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)