You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Anton Kalashnikov (Jira)" <ji...@apache.org> on 2022/03/03 17:22:00 UTC

[jira] [Commented] (FLINK-25958) OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI

    [ https://issues.apache.org/jira/browse/FLINK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500914#comment-17500914 ] 

Anton Kalashnikov commented on FLINK-25958:
-------------------------------------------

merge to master between:

f57a5379ff9c108627d3c511414e7ea71a1a2a2f and fbfdb0e468356fe71826eb6b185ecda9bc8b1de3

 

Since the behavior was changed a bit and this bug is not so critical. We decided that it doesn't make sense to backport it to older version.

> OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI
> ----------------------------------------------------------------
>
>                 Key: FLINK-25958
>                 URL: https://issues.apache.org/jira/browse/FLINK-25958
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.15.0, 1.12.7, 1.13.5, 1.14.3
>         Environment: Ververica Platform 2.6.2
> Flink 1.13.5
>            Reporter: Victor Xu
>            Assignee: Anton Kalashnikov
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>         Attachments: JIRA-1.jpg
>
>
> Flink job was running but the checkpoints & savepoints were failing all the time due to OOM Exception. However, the Flink UI showed COMPLETE for those checkpoints & savepoints.
> For example (checkpoint 39 & 40):
> {noformat}
> 2022-01-27 02:41:39,969 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 39 (type=CHECKPOINT) @ 1643251299952 for job ab2217e5ce144087bbddf6bd6c3
> 668eb.
> 2022-01-27 02:43:19,678 WARN  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Error while processing AcknowledgeCheckpoint message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 39. Failure reason: Failure to finalize checkpoint.
>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
> tream2]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
>         at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
>         at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
>         at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204) ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
> 1138-2.jar:?]
>         at com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83) ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.
> jar:?]
>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         ... 9 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 2022-01-27 03:41:39,970 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 40 (type=CHECKPOINT) @ 1643254899952 for job ab2217e5ce144087bbddf6bd6c3
> 668eb.
> 2022-01-27 03:43:22,326 WARN  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Error while processing AcknowledgeCheckpoint message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 40. Failure reason: Failure to finalize checkpoint.
>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
> tream2]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
>         at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
>         at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
>         at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204) ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
> 1138-2.jar:?]
>         at com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83) ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.jar:?]
>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         ... 9 more
> Caused by: java.lang.OutOfMemoryError: Java heap space{noformat}
> Please find attached a screenshot of the Flink UI (both 39 & 40 were COMPLETE).
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)