You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Thomas Weise (Jira)" <ji...@apache.org> on 2022/12/01 19:19:00 UTC

[jira] [Commented] (FLINK-30266) Recovery reconciliation loop fails if no checkpoint has been created yet

    [ https://issues.apache.org/jira/browse/FLINK-30266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642133#comment-17642133 ] 

Thomas Weise commented on FLINK-30266:
--------------------------------------

I believe this was discussed before and the reason we decided to not allow this was that we cannot safely determine the reason why the HA metadata is missing. It could be because there was never any successful checkpoint or because it was removed by mistake? As long as we can ensure that we don't accidentally reset a job with prior state to empty state I would also prefer the solution that does not involve manual intervention.

> Recovery reconciliation loop fails if no checkpoint has been created yet
> ------------------------------------------------------------------------
>
>                 Key: FLINK-30266
>                 URL: https://issues.apache.org/jira/browse/FLINK-30266
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.3.0
>            Reporter: Maximilian Michels
>            Assignee: Gyula Fora
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.3.0
>
>
> When the upgradeMode is LAST-STATE, the operator fails to reconcile a failed application unless at least one checkpoint has already been created. The expected behavior would be that the job starts with empty state.
> {noformat}
> 2022-12-01 10:58:35,596 o.a.f.k.o.l.AuditUtils         [INFO ] [app] >>> Status | Error   | UPGRADING       | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. Manual restore required.","additionalMetadata":{"reason":"RestoreFailed"},"throwableList":[]} {noformat}
> {noformat}
> 2022-12-01 10:44:49,480 i.j.o.p.e.ReconciliationDispatcher [ERROR] [app] Error during event processing ExecutionScope{ resource id: ResourceID{name='app', namespace='namespace'}, version: 216933301} failed.
> org.apache.flink.kubernetes.operator.exception.ReconciliationException: java.lang.RuntimeException: This indicates a bug...
> 	at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:133)
> 	at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
> 	at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136)
> 	at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94)
> 	at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
> 	at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93)
> 	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130)
> 	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110)
> 	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81)
> 	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54)
> 	at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> 	at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.RuntimeException: This indicates a bug...
> 	at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:180)
> 	at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:61)
> 	at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:212)
> 	at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:144)
> 	at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:167)
> 	at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:64)
> 	at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:123)
> 	... 13 more {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)