You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Xiaolong Wang <xi...@smartnews.com> on 2020/07/23 02:01:53 UTC

Flink failed to resume from checkpoint stored on S3

Deare community,
    One of my Flink job failed yesterday, and when I tried to resume from
the latest checkpoint, following exceptions happen:


```
Log Type: jobmanager.err

Log Upload Time: Wed Jul 22 09:04:24 +0000 2020

Log Length: 506

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/mnt/yarn/usercache/ec2-user/appcache/application_1591011685424_1054/filecache/10/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Log Type: jobmanager.log

Log Upload Time: Wed Jul 22 09:04:24 +0000 2020

Log Length: 65177

Showing 4096 bytes of 65177 total. Click here for the full log.

SchedulerBase.<init>(SchedulerBase.java:215)
at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
at
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
... 10 more
2020-07-22 09:04:22,766 ERROR
org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  -
Unhandled exception.
java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set up
JobManager
at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not
set up JobManager
at
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152)
at
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)
at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
... 7 more
Caused by: java.io.FileNotFoundException: Cannot find meta data file
'_metadata' in directory
's3://xxxx/flink/checkpoint_dir/65786c3307a10e79a52b4de478cfe996/chk-7853'.
Please try to load the checkpoint/savepoint directly from the metadata file
instead of the directory.
at
org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:258)
at
org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1152)
at
org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:306)
at
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:239)
at
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:215)
at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
at
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
... 10 more
2020-07-22 09:04:22,771 INFO  org.apache.flink.runtime.blob.BlobServer
                 - Stopped BLOB server at 0.0.0.0:34683

Log Type: jobmanager.out

Log Upload Time: Wed Jul 22 09:04:24 +0000 2020

Log Length: 0
```

My job was running and recording checkpoint every 5 minutes in *at-least
once* mode. Before the job fail, there was a checkpoint failure happened.

I searched the web and found the following issues related:

https://issues.apache.org/jira/browse/FLINK-10855
https://issues.apache.org/jira/browse/FLINK-10856
https://issues.apache.org/jira/browse/FLINK-10894

But they are all marked solved. Has anyone meet the above problem? And how
to make sure that the checkpoint is recorded without `_metadata` file loss?

Thanks, looking forward to your replies.

Yours, Roland.

Re: Flink failed to resume from checkpoint stored on S3

Posted by Congxian Qiu <qc...@gmail.com>.
Hi Xiaolong
   From the log, seems there is no `_metadata` file in the checkpoint
directory
s3://xxxx/flink/checkpoint_dir/65786c3307a10e79a52b4de478cfe996/chk-7853.
Do you configurate the retain checkpoint configuration[1] ever? If we do
not configuration it, the checkpoint will be deleted if job stopped.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/state/checkpoints.html#retained-checkpoints
Best,
Congxian


Xiaolong Wang <xi...@smartnews.com> 于2020年7月23日周四 上午10:03写道:

> Deare community,
>     One of my Flink job failed yesterday, and when I tried to resume from
> the latest checkpoint, following exceptions happen:
>
>
> ```
> Log Type: jobmanager.err
>
> Log Upload Time: Wed Jul 22 09:04:24 +0000 2020
>
> Log Length: 506
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/mnt/yarn/usercache/ec2-user/appcache/application_1591011685424_1054/filecache/10/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> Log Type: jobmanager.log
>
> Log Upload Time: Wed Jul 22 09:04:24 +0000 2020
>
> Log Length: 65177
>
> Showing 4096 bytes of 65177 total. Click here for the full log.
>
> SchedulerBase.<init>(SchedulerBase.java:215)
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
> at
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
> at
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
> at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
> at
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
> at
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
> at
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
> ... 10 more
> 2020-07-22 09:04:22,766 ERROR
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  -
> Unhandled exception.
> java.lang.RuntimeException:
> org.apache.flink.runtime.client.JobExecutionException: Could not set up
> JobManager
> at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could
> not set up JobManager
> at
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152)
> at
> org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)
> at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
> ... 7 more
> Caused by: java.io.FileNotFoundException: Cannot find meta data file
> '_metadata' in directory
> 's3://xxxx/flink/checkpoint_dir/65786c3307a10e79a52b4de478cfe996/chk-7853'.
> Please try to load the checkpoint/savepoint directly from the metadata file
> instead of the directory.
> at
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:258)
> at
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1152)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:306)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:239)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:215)
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
> at
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
> at
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
> at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
> at
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
> at
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
> at
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
> ... 10 more
> 2020-07-22 09:04:22,771 INFO  org.apache.flink.runtime.blob.BlobServer
>                  - Stopped BLOB server at 0.0.0.0:34683
>
> Log Type: jobmanager.out
>
> Log Upload Time: Wed Jul 22 09:04:24 +0000 2020
>
> Log Length: 0
> ```
>
> My job was running and recording checkpoint every 5 minutes in *at-least
> once* mode. Before the job fail, there was a checkpoint failure happened.
>
> I searched the web and found the following issues related:
>
> https://issues.apache.org/jira/browse/FLINK-10855
> https://issues.apache.org/jira/browse/FLINK-10856
> https://issues.apache.org/jira/browse/FLINK-10894
>
> But they are all marked solved. Has anyone meet the above problem? And how
> to make sure that the checkpoint is recorded without `_metadata` file loss?
>
> Thanks, looking forward to your replies.
>
> Yours, Roland.
>
>