You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (JIRA)" <ji...@apache.org> on 2018/11/14 12:32:00 UTC

[jira] [Reopened] (FLINK-10856) Harden resume from externalized checkpoint E2E test

     [ https://issues.apache.org/jira/browse/FLINK-10856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chesnay Schepler reopened FLINK-10856:
--------------------------------------

The latest fix introduced a new instability where we attempt to resume from an incomplete checkpoint (i.e. one that has no {{_metadata_}} file).

We have to double-check that the checkpoint was actually completed.

https://travis-ci.org/zentol/flink/jobs/454943554

{code}
2018-11-14 12:11:45,071 ERROR org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Failed to submit job 9e7876f96c583177d183ce18b52bbd18.

java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager

    at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)

    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)

    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)

    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)

    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager

    at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)

    at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)

    at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)

    at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)

    ... 7 more

Caused by: java.io.FileNotFoundException: Cannot find meta data file '_metadata' in directory 'file:/home/travis/build/zentol/flink/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-14313753102/externalized-chckpt-e2e-backend-dir/dfcac75e697394900e5088f962130d57/chk-10'. Please try to load the checkpoint/savepoint directly from the metadata file instead of the directory.

    at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:256)

    at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:109)

    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1100)

    at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1234)

    at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1158)

    at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:296)

    at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157)

    ... 10 more
{code}

> Harden resume from externalized checkpoint E2E test
> ---------------------------------------------------
>
>                 Key: FLINK-10856
>                 URL: https://issues.apache.org/jira/browse/FLINK-10856
>             Project: Flink
>          Issue Type: Bug
>          Components: E2E Tests, State Backends, Checkpointing
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.5.6, 1.6.3, 1.7.0
>
>
> The resume from externalized checkpoints E2E test can fail due to FLINK-10855. We should harden the test script to not expect a single checkpoint directory being present but to take the checkpoint with the highest checkpoint counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)