You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Ethan T Yang <iv...@gmail.com> on 2023/05/31 04:44:28 UTC

Flink checkpoint timeout

Hello all,

We recently start to experience Checkpoint timeout randomly. Here are some background information
1. We are on Flink 1.13.1
2. We have been running these type of streaming jobs for years. When checkpoint succeeds, it only take a few seconds. After a week ago, we start to see random checkpoint time outs. When it timeout, feels like it stuck somewhere, couldn’t more forward. After timeout, the job was able to continue from the previous checkpoint and move forward. 
3. Our job has quite many parallelisms. 50 ~ 100s. Looking at the checkpoint page. We saw 1 of the subtasks are not acknowledging, which eventually lead to the timeout. 
4. The Flink job is running on AWS EKS, the nature of job is relatively simple, read from AWS kinesis and do some transformation and write parquet files to AWS s3.

My goal is to seek some suggestions of where to start trouble shooting. Below is TaskManager log around the time when checkpoint timeout

==================================================
2023-05-25 14:47:30,248 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - Flushing mem columnStore to file. allocated memory: 122742926
2023-05-25 14:47:32,904 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - mem size 134322047 > 134217728: flushing 343857 records to disk.
2023-05-25 14:47:32,979 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - Flushing mem columnStore to file. allocated memory: 122816246
2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0) switched from RUNNING to CANCELING.
2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
 -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 30 ...
2023-05-25 14:47:34,532 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 30 ...
2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 30 ...
java.lang.InterruptedException: sleep interrupted
       at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j
ar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
       at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
2023-05-25 14:47:34,535 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 30 ...
2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6) switched from RUNNING to CANCELING.
2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
 -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
2023-05-25 14:47:34,546 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 29 ...
2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc) switched from RUNNING to CANCELING.
2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
 -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
2023-05-25 14:47:34,550 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 29 ...
java.lang.InterruptedException: sleep interrupted
       at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j
ar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
       at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
2023-05-25 14:47:34,551 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 29 ...
2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 29 ...
2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 28 ...
2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 28 ...
2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed) switched from RUNNING to CANCELING.
2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
 -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 28 ...
java.lang.InterruptedException: sleep interrupted
       at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j
ar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
       at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 27 ...
2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306) switched from RUNNING to CANCELING.
2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
 -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 27 ...
java.lang.InterruptedException: sleep interrupted
       at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
       at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 33 ...
2023-05-25 14:47:34,560 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 28 ...
2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 27 ...
2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed t
o events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1).
2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events ->
 (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1) switched from RUNNING to CANCELING.
2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map dec
ompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1).
2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 33 ...
java.lang.InterruptedException: sleep interrupted
       at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
       at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
2023-05-25 14:47:34,562 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 32 ...
2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 32 ...
2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 33 ...
2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 32 ...
===================================

Job manager log only has this

===================================
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
    at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:98)
    at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:67)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1934)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1906)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:96)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1990)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
==============================

Thanks in advance
Ivan

Re: Flink checkpoint timeout

Posted by Hangxiang Yu <ma...@gmail.com>.
Hi, Ethan.
Thanks for sharing.
Maybe You could see which subtask stuck on in the History Detail of the
Checkpoint UI, just like the example shown in
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/monitoring/checkpoint_monitoring/#history-tab
.
Then, you could use some tools such as JStack to check where the blocked
thread is stuck in the corresponding TM (by TM id shown in the Checkpoint
History Detail).
In addition, as Shammon mentioned, you could also check the resource usage
of the relevant TMs, such as whether GC is normal, etc.
Hope this helps.

On Mon, Jun 5, 2023 at 8:42 AM Shammon FY <zj...@gmail.com> wrote:

> Hi Ethan,
>
> When the specified subtask does not have a successful checkpoint, I think
> you can check the resource usage of the TaskManager through metrics, as
> well as whether there is backpressure in the subtask.
>
> Best,
> Shammon FY
>
> On Sun, Jun 4, 2023 at 3:18 AM Ethan T Yang <iv...@gmail.com> wrote:
>
>> Hi Hangxiang,
>>
>> I am not sure which operator it stuck on. I don’t think it's random. I
>> believe when it stuck, it’s for the same reason, which I try to find out.
>> You can see there are 654 parallelisms for the chained operator. When
>> stuck, there is always 653/654 acknowledges, one missing. Other unchained
>> subtasks are sinks to different outputs, no issues there. Since the job
>> reads from kinesis, writes data to AWS s3, and also doing checkpoint on s3,
>> I suspect one of the API calls may fail causing the stuck (just a guess).
>> We have been running this job for years. Only recently we start to see this
>> issue.
>> Appreciate you are looking into it. If you have any idea, I can provide
>> more information. Thanks - Ivan
>>
>> [image: Screenshot 2023-06-03 at 12.03.45 PM.png]
>>
>> On Jun 1, 2023, at 8:43 PM, Hangxiang Yu <ma...@gmail.com> wrote:
>>
>> HI, Ivan. Could you provide more information about it: 1. Which operator
>> subtask is stuck ? or is it random ?
>> 2. Could you share the stack or flame graph of the stuck subtask ?
>>
>> On Wed, May 31, 2023 at 12:45 PM Ethan T Yang <iv...@gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> We recently start to experience Checkpoint timeout randomly. Here are
>>> some background information
>>> 1. We are on Flink 1.13.1
>>> 2. We have been running these type of streaming jobs for years. When
>>> checkpoint succeeds, it only take a few seconds. After a week ago, we start
>>> to see random checkpoint time outs. When it timeout, feels like it stuck
>>> somewhere, couldn’t more forward. After timeout, the job was able to
>>> continue from the previous checkpoint and move forward.
>>> 3. Our job has quite many parallelisms. 50 ~ 100s. Looking at the
>>> checkpoint page. We saw 1 of the subtasks are not acknowledging, which
>>> eventually lead to the timeout.
>>> 4. The Flink job is running on AWS EKS, the nature of job is relatively
>>> simple, read from AWS kinesis and do some transformation and write parquet
>>> files to AWS s3.
>>>
>>> My goal is to seek some suggestions of where to start trouble shooting.
>>> Below is TaskManager log around the time when checkpoint timeout
>>>
>>> ==================================================
>>> 2023-05-25 14:47:30,248 INFO
>>> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] -
>>> Flushing mem columnStore to file. allocated memory: 122742926
>>> 2023-05-25 14:47:32,904 INFO
>>> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - mem
>>> size 134322047 > 134217728: flushing 343857 records to disk.
>>> 2023-05-25 14:47:32,979 INFO
>>> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] -
>>> Flushing mem columnStore to file. allocated memory: 122816246
>>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Attempting to cancel task Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>>> -> map batch to events -> (map decompressed to events -> fl
>>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>>> Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0
>>> (c2905cda734172afa6675014ca1271a0) switched from RUNNING to CANCELING.
>>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Triggering cancellation of task code Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>>> Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>>> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 30 ...
>>> 2023-05-25 14:47:34,532 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 30 ...
>>> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 30 ...
>>> java.lang.InterruptedException: sleep interrupted
>>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>>        at
>>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>>> ~[flink-connector-kinesis_2.11-1.13.1.j
>>> ar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> [?:1.8.0_302]
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> [?:1.8.0_302]
>>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>>> 2023-05-25 14:47:34,535 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 30 ...
>>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Attempting to cancel task Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>>> -> map batch to events -> (map decompressed to events -> fl
>>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>>> Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0
>>> (0ad93341f291a2aa84be39556b1362e6) switched from RUNNING to CANCELING.
>>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Triggering cancellation of task code Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>>> Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>>> 2023-05-25 14:47:34,546 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 29 ...
>>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Attempting to cancel task Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>>> -> map batch to events -> (map decompressed to events -> fl
>>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>>> Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0
>>> (cb16394cd68fe4775987de4d4e6dc6bc) switched from RUNNING to CANCELING.
>>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Triggering cancellation of task code Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>>> Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 29 ...
>>> java.lang.InterruptedException: sleep interrupted
>>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>>        at
>>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>>> ~[flink-connector-kinesis_2.11-1.13.1.j
>>> ar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> [?:1.8.0_302]
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> [?:1.8.0_302]
>>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>>> 2023-05-25 14:47:34,551 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 29 ...
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 29 ...
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 28 ...
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 28 ...
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Attempting to cancel task Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>>> -> map batch to events -> (map decompressed to events -> fl
>>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>>> Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0
>>> (4e1286825df33628e63ea9e4ef6d17ed) switched from RUNNING to CANCELING.
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Triggering cancellation of task code Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>>> Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 28 ...
>>> java.lang.InterruptedException: sleep interrupted
>>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>>        at
>>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>>> ~[flink-connector-kinesis_2.11-1.13.1.j
>>> ar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> [?:1.8.0_302]
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> [?:1.8.0_302]
>>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 27 ...
>>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Attempting to cancel task Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>>> -> map batch to events -> (map decompressed to events -> fl
>>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>>> Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0
>>> (211379391d2c2a39dd2d11b50439a306) switched from RUNNING to CANCELING.
>>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Triggering cancellation of task code Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>>> Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 27 ...
>>> java.lang.InterruptedException: sleep interrupted
>>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>>        at
>>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> [?:1.8.0_302]
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> [?:1.8.0_302]
>>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 33 ...
>>> 2023-05-25 14:47:34,560 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 28 ...
>>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 27 ...
>>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Attempting to cancel task Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map
>>> decompressed t
>>> o events -> flat events -> (Sink: Parquet Sink, Filter -> Sink:
>>> Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (33/50)#0 (baf2652040493f67373c3877b825a1d1).
>>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>>> -> map batch to events -> (map decompressed to events -> flat events ->
>>>  (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process ->
>>> Map -> Sink: Parquet PB Raw Sink) (33/50)#0
>>> (baf2652040493f67373c3877b825a1d1) switched from RUNNING to CANCELING.
>>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>>>                   [] - Triggering cancellation of task code Source:
>>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map dec
>>> ompressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>>> (33/50)#0 (baf2652040493f67373c3877b825a1d1).
>>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 33 ...
>>> java.lang.InterruptedException: sleep interrupted
>>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>>        at
>>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>>        at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> [?:1.8.0_302]
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> [?:1.8.0_302]
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> [?:1.8.0_302]
>>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>>> 2023-05-25 14:47:34,562 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 32 ...
>>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 32 ...
>>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Shutting down the shard consumer threads of subtask 33 ...
>>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>>> subtask 32 ...
>>> ===================================
>>>
>>> Job manager log only has this
>>>
>>> ===================================
>>> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
>>> tolerable failure threshold.
>>>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>>> .handleCheckpointException(CheckpointFailureManager.java:98)
>>>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>>> .handleJobLevelCheckpointException(CheckpointFailureManager.java:67)
>>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>> .abortPendingCheckpoint(CheckpointCoordinator.java:1934)
>>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>> .abortPendingCheckpoint(CheckpointCoordinator.java:1906)
>>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>> .access$600(CheckpointCoordinator.java:96)
>>>     at org.apache.flink.runtime.checkpoint.
>>> CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator
>>> .java:1990)
>>>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors
>>> .java:511)
>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>     at java.util.concurrent.
>>> ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(
>>> ScheduledThreadPoolExecutor.java:180)
>>>     at java.util.concurrent.
>>> ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
>>> ScheduledThreadPoolExecutor.java:293)
>>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1149)
>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:624)
>>>     at java.lang.Thread.run(Thread.java:748)
>>> ==============================
>>>
>>> Thanks in advance
>>> Ivan
>>>
>>
>>
>> --
>> Best,
>> Hangxiang.
>>
>>
>>

-- 
Best,
Hangxiang.

Re: Flink checkpoint timeout

Posted by Shammon FY <zj...@gmail.com>.
Hi Ethan,

When the specified subtask does not have a successful checkpoint, I think
you can check the resource usage of the TaskManager through metrics, as
well as whether there is backpressure in the subtask.

Best,
Shammon FY

On Sun, Jun 4, 2023 at 3:18 AM Ethan T Yang <iv...@gmail.com> wrote:

> Hi Hangxiang,
>
> I am not sure which operator it stuck on. I don’t think it's random. I
> believe when it stuck, it’s for the same reason, which I try to find out.
> You can see there are 654 parallelisms for the chained operator. When
> stuck, there is always 653/654 acknowledges, one missing. Other unchained
> subtasks are sinks to different outputs, no issues there. Since the job
> reads from kinesis, writes data to AWS s3, and also doing checkpoint on s3,
> I suspect one of the API calls may fail causing the stuck (just a guess).
> We have been running this job for years. Only recently we start to see this
> issue.
> Appreciate you are looking into it. If you have any idea, I can provide
> more information. Thanks - Ivan
>
> [image: Screenshot 2023-06-03 at 12.03.45 PM.png]
>
> On Jun 1, 2023, at 8:43 PM, Hangxiang Yu <ma...@gmail.com> wrote:
>
> HI, Ivan. Could you provide more information about it: 1. Which operator
> subtask is stuck ? or is it random ?
> 2. Could you share the stack or flame graph of the stuck subtask ?
>
> On Wed, May 31, 2023 at 12:45 PM Ethan T Yang <iv...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> We recently start to experience Checkpoint timeout randomly. Here are
>> some background information
>> 1. We are on Flink 1.13.1
>> 2. We have been running these type of streaming jobs for years. When
>> checkpoint succeeds, it only take a few seconds. After a week ago, we start
>> to see random checkpoint time outs. When it timeout, feels like it stuck
>> somewhere, couldn’t more forward. After timeout, the job was able to
>> continue from the previous checkpoint and move forward.
>> 3. Our job has quite many parallelisms. 50 ~ 100s. Looking at the
>> checkpoint page. We saw 1 of the subtasks are not acknowledging, which
>> eventually lead to the timeout.
>> 4. The Flink job is running on AWS EKS, the nature of job is relatively
>> simple, read from AWS kinesis and do some transformation and write parquet
>> files to AWS s3.
>>
>> My goal is to seek some suggestions of where to start trouble shooting.
>> Below is TaskManager log around the time when checkpoint timeout
>>
>> ==================================================
>> 2023-05-25 14:47:30,248 INFO
>> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] -
>> Flushing mem columnStore to file. allocated memory: 122742926
>> 2023-05-25 14:47:32,904 INFO
>> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - mem
>> size 134322047 > 134217728: flushing 343857 records to disk.
>> 2023-05-25 14:47:32,979 INFO
>> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] -
>> Flushing mem columnStore to file. allocated memory: 122816246
>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Attempting to cancel task Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>> (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>> -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>> Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0
>> (c2905cda734172afa6675014ca1271a0) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Triggering cancellation of task code Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>> Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 30 ...
>> 2023-05-25 14:47:34,532 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 30 ...
>> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 30 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at
>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>> ~[flink-connector-kinesis_2.11-1.13.1.j
>> ar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,535 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 30 ...
>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Attempting to cancel task Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>> (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>> -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>> Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0
>> (0ad93341f291a2aa84be39556b1362e6) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Triggering cancellation of task code Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>> Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>> 2023-05-25 14:47:34,546 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 29 ...
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Attempting to cancel task Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>> (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>> -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>> Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0
>> (cb16394cd68fe4775987de4d4e6dc6bc) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Triggering cancellation of task code Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>> Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 29 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at
>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>> ~[flink-connector-kinesis_2.11-1.13.1.j
>> ar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,551 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 29 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 29 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 28 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 28 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Attempting to cancel task Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>> (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>> -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>> Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0
>> (4e1286825df33628e63ea9e4ef6d17ed) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Triggering cancellation of task code Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>> Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 28 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at
>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>> ~[flink-connector-kinesis_2.11-1.13.1.j
>> ar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 27 ...
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Attempting to cancel task Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>> (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>> -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
>> Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0
>> (211379391d2c2a39dd2d11b50439a306) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Triggering cancellation of task code Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
>> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
>> Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 27 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at
>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 33 ...
>> 2023-05-25 14:47:34,560 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 28 ...
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 27 ...
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Attempting to cancel task Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map
>> decompressed t
>> o events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted
>> Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0
>> (baf2652040493f67373c3877b825a1d1).
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod
>> -> map batch to events -> (map decompressed to events -> flat events ->
>>  (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process ->
>> Map -> Sink: Parquet PB Raw Sink) (33/50)#0
>> (baf2652040493f67373c3877b825a1d1) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>>                   [] - Triggering cancellation of task code Source:
>> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map dec
>> ompressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
>> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
>> (33/50)#0 (baf2652040493f67373c3877b825a1d1).
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 33 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at
>> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
>> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
>> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> [?:1.8.0_302]
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,562 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 32 ...
>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 32 ...
>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Shutting down the shard consumer threads of subtask 33 ...
>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
>> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
>> subtask 32 ...
>> ===================================
>>
>> Job manager log only has this
>>
>> ===================================
>> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
>> tolerable failure threshold.
>>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>> .handleCheckpointException(CheckpointFailureManager.java:98)
>>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>> .handleJobLevelCheckpointException(CheckpointFailureManager.java:67)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>> .abortPendingCheckpoint(CheckpointCoordinator.java:1934)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>> .abortPendingCheckpoint(CheckpointCoordinator.java:1906)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>> .access$600(CheckpointCoordinator.java:96)
>>     at org.apache.flink.runtime.checkpoint.
>> CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:
>> 1990)
>>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors
>> .java:511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at java.util.concurrent.
>> ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(
>> ScheduledThreadPoolExecutor.java:180)
>>     at java.util.concurrent.
>> ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
>> ScheduledThreadPoolExecutor.java:293)
>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1149)
>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>> ==============================
>>
>> Thanks in advance
>> Ivan
>>
>
>
> --
> Best,
> Hangxiang.
>
>
>

Re: Flink checkpoint timeout

Posted by Ethan T Yang <iv...@gmail.com>.
Hi Hangxiang,

I am not sure which operator it stuck on. I don’t think it's random. I believe when it stuck, it’s for the same reason, which I try to find out. You can see there are 654 parallelisms for the chained operator. When stuck, there is always 653/654 acknowledges, one missing. Other unchained subtasks are sinks to different outputs, no issues there. Since the job reads from kinesis, writes data to AWS s3, and also doing checkpoint on s3, I suspect one of the API calls may fail causing the stuck (just a guess). We have been running this job for years. Only recently we start to see this issue.
Appreciate you are looking into it. If you have any idea, I can provide more information. Thanks - Ivan



> On Jun 1, 2023, at 8:43 PM, Hangxiang Yu <ma...@gmail.com> wrote:
> 
> HI, Ivan.
> Could you provide more information about it:
> 1. Which operator subtask is stuck ? or is it random ?
> 2. Could you share the stack or flame graph of the stuck subtask ?
> 
> 
> On Wed, May 31, 2023 at 12:45 PM Ethan T Yang <ivanygyang@gmail.com <ma...@gmail.com>> wrote:
>> Hello all,
>> 
>> We recently start to experience Checkpoint timeout randomly. Here are some background information
>> 1. We are on Flink 1.13.1
>> 2. We have been running these type of streaming jobs for years. When checkpoint succeeds, it only take a few seconds. After a week ago, we start to see random checkpoint time outs. When it timeout, feels like it stuck somewhere, couldn’t more forward. After timeout, the job was able to continue from the previous checkpoint and move forward. 
>> 3. Our job has quite many parallelisms. 50 ~ 100s. Looking at the checkpoint page. We saw 1 of the subtasks are not acknowledging, which eventually lead to the timeout. 
>> 4. The Flink job is running on AWS EKS, the nature of job is relatively simple, read from AWS kinesis and do some transformation and write parquet files to AWS s3.
>> 
>> My goal is to seek some suggestions of where to start trouble shooting. Below is TaskManager log around the time when checkpoint timeout
>> 
>> ==================================================
>> 2023-05-25 14:47:30,248 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - Flushing mem columnStore to file. allocated memory: 122742926
>> 2023-05-25 14:47:32,904 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - mem size 134322047 > 134217728: flushing 343857 records to disk.
>> 2023-05-25 14:47:32,979 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - Flushing mem columnStore to file. allocated memory: 122816246
>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 30 ...
>> 2023-05-25 14:47:34,532 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 30 ...
>> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 30 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j
>> ar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,535 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 30 ...
>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>> 2023-05-25 14:47:34,546 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 29 ...
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>> 2023-05-25 14:47:34,550 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 29 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j
>> ar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,551 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 29 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 29 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 28 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 28 ...
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 28 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j
>> ar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 27 ...
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>> compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl
>> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 27 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 33 ...
>> 2023-05-25 14:47:34,560 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 28 ...
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 27 ...
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed t
>> o events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1).
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events ->
>>  (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1) switched from RUNNING to CANCELING.
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map dec
>> ompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1).
>> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 33 ...
>> java.lang.InterruptedException: sleep interrupted
>>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302]
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
>>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>> 2023-05-25 14:47:34,562 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 32 ...
>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 32 ...
>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 33 ...
>> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 32 ...
>> ===================================
>> 
>> Job manager log only has this
>> 
>> ===================================
>> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
>>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:98)
>>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:67)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1934)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1906)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:96)
>>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1990)
>>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>> ==============================
>> 
>> Thanks in advance
>> Ivan
> 
> 
> -- 
> Best,
> Hangxiang.


Re: Flink checkpoint timeout

Posted by Hangxiang Yu <ma...@gmail.com>.
HI, Ivan. Could you provide more information about it: 1. Which operator
subtask is stuck ? or is it random ?
2. Could you share the stack or flame graph of the stuck subtask ?

On Wed, May 31, 2023 at 12:45 PM Ethan T Yang <iv...@gmail.com> wrote:

> Hello all,
>
> We recently start to experience Checkpoint timeout randomly. Here are some
> background information
> 1. We are on Flink 1.13.1
> 2. We have been running these type of streaming jobs for years. When
> checkpoint succeeds, it only take a few seconds. After a week ago, we start
> to see random checkpoint time outs. When it timeout, feels like it stuck
> somewhere, couldn’t more forward. After timeout, the job was able to
> continue from the previous checkpoint and move forward.
> 3. Our job has quite many parallelisms. 50 ~ 100s. Looking at the
> checkpoint page. We saw 1 of the subtasks are not acknowledging, which
> eventually lead to the timeout.
> 4. The Flink job is running on AWS EKS, the nature of job is relatively
> simple, read from AWS kinesis and do some transformation and write parquet
> files to AWS s3.
>
> My goal is to seek some suggestions of where to start trouble shooting.
> Below is TaskManager log around the time when checkpoint timeout
>
> ==================================================
>
> 2023-05-25 14:47:30,248 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] -
> Flushing mem columnStore to file. allocated memory: 122742926
>
> 2023-05-25 14:47:32,904 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] - mem
> size 134322047 > 134217728: flushing 343857 records to disk.
>
> 2023-05-25 14:47:32,979 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter        [] -
> Flushing mem columnStore to file. allocated memory: 122816246
>
> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Attempting to cancel task Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>
> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
> (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>
> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod ->
> map batch to events -> (map decompressed to events -> fl
>
> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
> Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0
> (c2905cda734172afa6675014ca1271a0) switched from RUNNING to CANCELING.
>
> 2023-05-25 14:47:34,530 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Triggering cancellation of task code Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>
>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
> Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0).
>
> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 30 ...
>
> 2023-05-25 14:47:34,532 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 30 ...
>
> 2023-05-25 14:47:34,531 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 30 ...
>
> java.lang.InterruptedException: sleep interrupted
>
>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>
>        at
> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
> ~[flink-connector-kinesis_2.11-1.13.1.j
>
> ar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_302]
>
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_302]
>
>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>
> 2023-05-25 14:47:34,535 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 30 ...
>
> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Attempting to cancel task Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>
> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
> (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>
> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod ->
> map batch to events -> (map decompressed to events -> fl
>
> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
> Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0
> (0ad93341f291a2aa84be39556b1362e6) switched from RUNNING to CANCELING.
>
> 2023-05-25 14:47:34,545 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Triggering cancellation of task code Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>
>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
> Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6).
>
> 2023-05-25 14:47:34,546 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 29 ...
>
> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Attempting to cancel task Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>
> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
> (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>
> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod ->
> map batch to events -> (map decompressed to events -> fl
>
> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
> Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0
> (cb16394cd68fe4775987de4d4e6dc6bc) switched from RUNNING to CANCELING.
>
> 2023-05-25 14:47:34,550 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Triggering cancellation of task code Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>
>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
> Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc).
>
> 2023-05-25 14:47:34,550 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 29 ...
>
> java.lang.InterruptedException: sleep interrupted
>
>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>
>        at
> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
> ~[flink-connector-kinesis_2.11-1.13.1.j
>
> ar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_302]
>
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_302]
>
>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>
> 2023-05-25 14:47:34,551 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 29 ...
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 29 ...
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 28 ...
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 28 ...
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Attempting to cancel task Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>
> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
> (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod ->
> map batch to events -> (map decompressed to events -> fl
>
> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
> Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0
> (4e1286825df33628e63ea9e4ef6d17ed) switched from RUNNING to CANCELING.
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Triggering cancellation of task code Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>
>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
> Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed).
>
> 2023-05-25 14:47:34,552 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 28 ...
>
> java.lang.InterruptedException: sleep interrupted
>
>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>
>        at
> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
> ~[flink-connector-kinesis_2.11-1.13.1.j
>
> ar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_302]
>
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_302]
>
>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>
> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 27 ...
>
> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Attempting to cancel task Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de
>
> compressed to events -> flat events -> (Sink: Parquet Sink, Filter ->
> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
> (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>
> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod ->
> map batch to events -> (map decompressed to events -> fl
>
> at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink),
> Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0
> (211379391d2c2a39dd2d11b50439a306) switched from RUNNING to CANCELING.
>
> 2023-05-25 14:47:34,559 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Triggering cancellation of task code Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events
>
>  -> (map decompressed to events -> flat events -> (Sink: Parquet Sink,
> Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB
> Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306).
>
> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 27 ...
>
> java.lang.InterruptedException: sleep interrupted
>
>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>
>        at
> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_302]
>
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_302]
>
>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>
> 2023-05-25 14:47:34,559 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 33 ...
>
> 2023-05-25 14:47:34,560 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 28 ...
>
> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 27 ...
>
> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Attempting to cancel task Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map
> decompressed t
>
> o events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted
> Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0
> (baf2652040493f67373c3877b825a1d1).
>
> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod ->
> map batch to events -> (map decompressed to events -> flat events ->
>
>  (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process ->
> Map -> Sink: Parquet PB Raw Sink) (33/50)#0
> (baf2652040493f67373c3877b825a1d1) switched from RUNNING to CANCELING.
>
> 2023-05-25 14:47:34,561 INFO  org.apache.flink.runtime.taskmanager.Task
>                   [] - Triggering cancellation of task code Source:
> psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map dec
>
> ompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink:
> Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink)
> (33/50)#0 (baf2652040493f67373c3877b825a1d1).
>
> 2023-05-25 14:47:34,561 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 33 ...
>
> java.lang.InterruptedException: sleep interrupted
>
>        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302]
>
>        at
> org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102)
> ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114)
> [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1]
>
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_302]
>
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_302]
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_302]
>
>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
>
> 2023-05-25 14:47:34,562 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 32 ...
>
> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 32 ...
>
> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Shutting down the shard consumer threads of subtask 33 ...
>
> 2023-05-25 14:47:34,564 INFO  org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher
> [] - Starting shutdown of shard consumer threads and AWS SDK resources of
> subtask 32 ...
>
> ===================================
>
>
> Job manager log only has this
>
>
> ===================================
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable failure threshold.
>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager
> .handleCheckpointException(CheckpointFailureManager.java:98)
>     at org.apache.flink.runtime.checkpoint.CheckpointFailureManager
> .handleJobLevelCheckpointException(CheckpointFailureManager.java:67)
>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> .abortPendingCheckpoint(CheckpointCoordinator.java:1934)
>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> .abortPendingCheckpoint(CheckpointCoordinator.java:1906)
>     at org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> .access$600(CheckpointCoordinator.java:96)
>     at org.apache.flink.runtime.checkpoint.
> CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:
> 1990)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:
> 511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.
> ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(
> ScheduledThreadPoolExecutor.java:180)
>     at java.util.concurrent.
> ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
> ScheduledThreadPoolExecutor.java:293)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> ==============================
>
> Thanks in advance
> Ivan
>


-- 
Best,
Hangxiang.