You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Pauli Gandhi (Jira)" <ji...@apache.org> on 2022/06/13 22:33:00 UTC

[jira] [Created] (FLINK-28030) Checkpoint always hangs when running some jobs

Pauli Gandhi created FLINK-28030:
------------------------------------

             Summary: Checkpoint always hangs when running some jobs
                 Key: FLINK-28030
                 URL: https://issues.apache.org/jira/browse/FLINK-28030
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.14.3
            Reporter: Pauli Gandhi


We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23 acknowledgments (65%).  There is no cpu activity but yet there are number of tasks reporting 100% back pressure.  It is peculiar to this job and slight modifications to this job.  We have created many Flink jobs in the past and never encountered the issue.  

Here are the things we tried to narrow down the problem
 * The job runs fine if checkpointing is disabled.
 * Increasing the number of task managers and parallelism to 2 seems to help the job complete.  However, it stalled again when we sent a larger data set.
 * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help.
 * Sometimes restarting the job manager helps but at other times not.
 * Breaking up the job into smaller parts helps the job to finish.
 * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state.

Here are the environment details
 * Flink version 1.14.3
 * Running Kubernetes
 * Using RocksDB state backend.
 * Checkpoint storage is S3 storage using the Presto library
 * Exactly Once Semantics with unaligned checkpoints enabled.
 * Checkpoint timeout 2 hours
 * Maximum concurrent checkpoints is 1
 * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
 * Using Kafka for input and output

I have attached the task manager logs, thread dump, and screen shots of the job graph and stalled checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)