You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2024/04/19 15:04:00 UTC

[jira] [Closed] (FLINK-35159) CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash

     [ https://issues.apache.org/jira/browse/FLINK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chesnay Schepler closed FLINK-35159.
------------------------------------
    Resolution: Fixed

> CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash
> ------------------------------------------------------------------------
>
>                 Key: FLINK-35159
>                 URL: https://issues.apache.org/jira/browse/FLINK-35159
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> When a task manager dies while the JM is generating an ExecutionGraph in the background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can transition back into WaitingForResources if the TM hosted one of the slots that we planned to use in {{tryToAssignSlots}}.
> At this point the ExecutionGraph was already transitioned to running, which implicitly kicks of periodic checkpointing by the CheckpointCoordinator, without the operator coordinator holders being initialized yet (as this happens after we assigned slots).
> This effectively leaks that CheckpointCoordinator, including the timer thread that will continue to try triggering checkpoints, which will naturally fail to trigger.
> This can cause a JM crash because it results in {{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which fails with an NPE since the {{mainThreadExecutor}} was not initialized yet.
> {code}
> java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NullPointerException
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
> 	at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
> 	at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
> 	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> 	at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
> 	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
> 	at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.CompletionException: java.lang.NullPointerException
> 	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> 	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> 	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
> 	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
> 	... 7 more
> Caused by: java.lang.NullPointerException
> 	at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
> 	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> 	at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
> 	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> 	... 8 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)