You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "liad shachoach (Jira)" <ji...@apache.org> on 2022/10/18 10:23:00 UTC
[jira] [Resolved] (FLINK-29620) Flink deployment stuck in UPGRADING state when changing configuration
[ https://issues.apache.org/jira/browse/FLINK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liad shachoach resolved FLINK-29620.
------------------------------------
Fix Version/s: 1.15.0
Resolution: Fixed
> Flink deployment stuck in UPGRADING state when changing configuration
> ---------------------------------------------------------------------
>
> Key: FLINK-29620
> URL: https://issues.apache.org/jira/browse/FLINK-29620
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.14.2
> Environment: AWS EKS v1.21
> Operator version: 1.1.0
> Reporter: liad shachoach
> Priority: Major
> Fix For: 1.15.0
>
>
> When I update the configuration of a flink deployment I observe one of two scenarios:
> Success:
> This happens when the job has not started - if I change the configuration quick enough:
> {code:java}
> 2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
> 2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade
> 2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata.
> 2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
> 2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
> 2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
> 2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
> 2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
> 2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
> {code}
> In this scenario I see that the job manager and task manager pods are terminated and then recreated.
>
>
> Failure:
> This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:
> {code:java}
> 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
> 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT
> 2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService [INFO ][load-streaming/validator-process-124] Suspending job with savepoint.
> 2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
> 2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
> 2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
> 2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
> 2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
> 2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
> 2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
> 2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
> 2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
> 2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
> 2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
> 2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
> 2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
> 2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
> 2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
> 2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
> 2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4.
> 2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation.
> 2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService [INFO ][load-streaming/validator-process-124] Deploying application cluster
> 2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted
> 2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'.
> 2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
> 2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
> 2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed.
> org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
> at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
> at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
> at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
> at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
> at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
> at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
> at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
> at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
> at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
> at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
> at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
> at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
> at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
> at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
> at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
> at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
> at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
> ... 13 more {code}
> In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.
> The flink deployment state changes to UPGRADING and the above exception is repeated.
> error in flink deployment: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
> Job Manager Deployment Status: MISSING
>
> Flink deployment spec:
>
> {code:java}
> flinkVersion: v1_14
> job:
> allowNonRestoredState: true
> args: ...
> entryClass: ...
> jarURI: ...
> parallelism: x
> savepointTriggerNonce: 0
> state: running
> upgradeMode: savepoint
> jobManager:
> podTemplate:
> apiVersion: v1
> kind: Pod
> metadata:
> annotations:
> configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
> spec:
> affinity:
> nodeAffinity:
> requiredDuringSchedulingIgnoredDuringExecution:
> nodeSelectorTerms:
> - matchExpressions:
> - key: nodeType
> operator: In
> values:
> - someValue
> containers:
> - name: flink-main-container
> resources:
> limits:
> cpu: "1"
> memory: 1.6Gi
> requests:
> cpu: "0.2"
> memory: 1Gi
> tolerations:
> - effect: NoSchedule
> key: someValue
> value: "true"
> replicas: 1
> podTemplate:
> apiVersion: v1
> kind: Pod
> metadata:
> annotations:
> configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
> prometheus.io/path: /metrics
> prometheus.io/port: "9260"
> prometheus.io/scrape: "true"
> labels:
> app.kubernetes.io/instance: flink-validator-process-124
> app.kubernetes.io/managed-by: Helm
> app.kubernetes.io/name: apache-flink
> app.kubernetes.io/version: test
> helm.sh/chart: apache-flink-1.0.0
> spec:
> containers: []
> imagePullSecrets: []
> serviceAccount: validator-process-124
> taskManager:
> podTemplate:
> apiVersion: v1
> kind: Pod
> metadata:
> annotations:
> configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
> spec:
> affinity:
> nodeAffinity:
> requiredDuringSchedulingIgnoredDuringExecution:
> nodeSelectorTerms:
> - matchExpressions:
> - key: nodeType
> operator: In
> values:
> - someValue
> containers:
> - name: flink-main-container
> resources:
> limits:
> cpu: "1"
> memory: 3.6Gi
> requests:
> cpu: "0.2"
> memory: 3Gi
> tolerations:
> - effect: NoSchedule
> key: someValue
> value: "true"{code}
>
> Please let me know if more details are required.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)