You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Puneet Duggal <pu...@gmail.com> on 2021/09/24 13:19:29 UTC

Job Manager went down on cancelling job with savepoint

Hi,

So while cancelling one job with savepoint… even though job got cancelled successfully .. but somehow immediately after that job manager went down. Not able to deduce anything from given stack trace.. Any help is appreciated

2021-09-24 11:50:44,182 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping checkpoint coordinator for job 1f764a51996d206b28721aa4a1420bea.
2021-09-24 11:50:44,182 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Shutting down
2021-09-24 11:50:44,240 INFO  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore [] - Removing /flink/default_ns/checkpoints/1f764a51996d206b28721aa4a1420bea from ZooKeeper
2021-09-24 11:50:44,243 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter [] - Shutting down.
2021-09-24 11:50:44,243 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter [] - Removing /checkpoint-counter/1f764a51996d206b28721aa4a1420bea from ZooKeeper
2021-09-24 11:50:44,249 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 1f764a51996d206b28721aa4a1420bea reached globally terminal state CANCELED.
2021-09-24 11:50:44,249 ERROR org.apache.flink.runtime.util.FatalExitExceptionHandler      [] - FATAL: Thread 'cluster-io-thread-16' produced an uncaught exception. Stopping the process...
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@54a5137c rejected from java.util.concurrent.ScheduledThreadPoolExecutor@37ee0790[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4513]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_232]
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) ~[?:1.8.0_232]
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) ~[?:1.8.0_232]
	at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622) ~[?:1.8.0_232]
	at java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) ~[?:1.8.0_232]
	at org.apache.flink.runtime.concurrent.ScheduledExecutorServiceAdapter.execute(ScheduledExecutorServiceAdapter.java:64) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.scheduleTriggerRequest(CheckpointCoordinator.java:1290) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:66) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_232]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]

Regards,
Puneet

Re: Job Manager went down on cancelling job with savepoint

Posted by Guowei Ma <gu...@gmail.com>.

Hi, Puneet

Could you share whether you are using Flink's session mode or application
mode?
From the log, you are using `StandaloneDispatcher`, but you will use it in
both session and application mode.
If you use application mode, this might be in line with expectations.

Best,
Guowei


On Fri, Sep 24, 2021 at 9:19 PM Puneet Duggal <pu...@gmail.com>
wrote:

> Hi,
>
> So while cancelling one job with savepoint… even though job got cancelled
> successfully .. but somehow immediately after that job manager went down.
> Not able to deduce anything from given stack trace.. Any help is appreciated
>
> 2021-09-24 11:50:44,182 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping
> checkpoint coordinator for job 1f764a51996d206b28721aa4a1420bea.
> 2021-09-24 11:50:44,182 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Shutting down
> 2021-09-24 11:50:44,240 INFO
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore [] - Removing
> /flink/default_ns/checkpoints/1f764a51996d206b28721aa4a1420bea from
> ZooKeeper
> 2021-09-24 11:50:44,243 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter [] -
> Shutting down.
> 2021-09-24 11:50:44,243 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter [] -
> Removing /checkpoint-counter/1f764a51996d206b28721aa4a1420bea from ZooKeeper
> 2021-09-24 11:50:44,249 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
> 1f764a51996d206b28721aa4a1420bea reached globally terminal state CANCELED.
> 2021-09-24 11:50:44,249 ERROR
> org.apache.flink.runtime.util.FatalExitExceptionHandler      [] - FATAL:
> Thread 'cluster-io-thread-16' produced an uncaught exception. Stopping the
> process...
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@54a5137c
> rejected from java.util.concurrent.ScheduledThreadPoolExecutor@37ee0790[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4513]
>         at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> ~[?:1.8.0_232]
>         at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> ~[?:1.8.0_232]
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
> ~[?:1.8.0_232]
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
> ~[?:1.8.0_232]
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622)
> ~[?:1.8.0_232]
>         at
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
> ~[?:1.8.0_232]
>         at
> org.apache.flink.runtime.concurrent.ScheduledExecutorServiceAdapter.execute(ScheduledExecutorServiceAdapter.java:64)
> ~[flink-dist_2.12-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.scheduleTriggerRequest(CheckpointCoordinator.java:1290)
> ~[flink-dist_2.12-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:66)
> ~[flink-dist_2.12-1.12.1.jar:1.12.1]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_232]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_232]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
>
> Regards,
> Puneet
>
>
>