You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (JIRA)" <ji...@apache.org> on 2018/11/14 12:32:00 UTC
[jira] [Reopened] (FLINK-10856) Harden resume from externalized
checkpoint E2E test
[ https://issues.apache.org/jira/browse/FLINK-10856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler reopened FLINK-10856:
--------------------------------------
The latest fix introduced a new instability where we attempt to resume from an incomplete checkpoint (i.e. one that has no {{_metadata_}} file).
We have to double-check that the checkpoint was actually completed.
https://travis-ci.org/zentol/flink/jobs/454943554
{code}
2018-11-14 12:11:45,071 ERROR org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Failed to submit job 9e7876f96c583177d183ce18b52bbd18.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
... 7 more
Caused by: java.io.FileNotFoundException: Cannot find meta data file '_metadata' in directory 'file:/home/travis/build/zentol/flink/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-14313753102/externalized-chckpt-e2e-backend-dir/dfcac75e697394900e5088f962130d57/chk-10'. Please try to load the checkpoint/savepoint directly from the metadata file instead of the directory.
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:256)
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:109)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1100)
at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1234)
at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1158)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:296)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157)
... 10 more
{code}
> Harden resume from externalized checkpoint E2E test
> ---------------------------------------------------
>
> Key: FLINK-10856
> URL: https://issues.apache.org/jira/browse/FLINK-10856
> Project: Flink
> Issue Type: Bug
> Components: E2E Tests, State Backends, Checkpointing
> Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.5.6, 1.6.3, 1.7.0
>
>
> The resume from externalized checkpoints E2E test can fail due to FLINK-10855. We should harden the test script to not expect a single checkpoint directory being present but to take the checkpoint with the highest checkpoint counter.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)