You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "liad shachoach (Jira)" <ji...@apache.org> on 2022/10/18 10:23:00 UTC

[jira] [Resolved] (FLINK-29620) Flink deployment stuck in UPGRADING state when changing configuration

     [ https://issues.apache.org/jira/browse/FLINK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liad shachoach resolved FLINK-29620.
------------------------------------
    Fix Version/s: 1.15.0
       Resolution: Fixed

> Flink deployment stuck in UPGRADING state when changing configuration
> ---------------------------------------------------------------------
>
>                 Key: FLINK-29620
>                 URL: https://issues.apache.org/jira/browse/FLINK-29620
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.14.2
>         Environment: AWS EKS v1.21
> Operator version: 1.1.0
>            Reporter: liad shachoach
>            Priority: Major
>             Fix For: 1.15.0
>
>
> When I update the configuration of a flink deployment I observe one of two scenarios:
> Success:
> This happens when the job has not started - if I change the configuration quick enough:
> {code:java}
> 2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
> 2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade
> 2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata.
> 2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
> 2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
> 2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
> 2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
> 2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
> 2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
>  {code}
> In this scenario I see that the job manager and task manager pods are terminated and then recreated.
>  
>  
> Failure:
> This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:
> {code:java}
> 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
> 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT
> 2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Suspending job with savepoint.
> 2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
> 2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
> 2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
> 2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
> 2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
> 2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
> 2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
> 2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
> 2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
> 2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
> 2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
> 2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
> 2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
> 2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
> 2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
> 2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
> 2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4.
> 2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation.
> 2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Deploying application cluster
> 2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted
> 2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'.
> 2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
> 2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
> 2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed.
> org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
>     at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
>     at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
>     at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
>     at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
>     at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
>     at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
>     at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
>     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
>     at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
>     at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
>     at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
>     at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
>     ... 13 more {code}
> In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.
> The flink deployment state changes to UPGRADING and the above exception is repeated.
> error in flink deployment: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
> Job Manager Deployment Status:  MISSING
>  
> Flink deployment spec:
>  
> {code:java}
> flinkVersion: v1_14 
> job:
>     allowNonRestoredState: true
>     args: ...
>     entryClass: ...
>     jarURI: ...
>     parallelism: x
>     savepointTriggerNonce: 0
>     state: running
>     upgradeMode: savepoint
> jobManager:
>     podTemplate:
>       apiVersion: v1
>       kind: Pod
>       metadata:
>         annotations:
>           configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
>       spec:
>         affinity:
>           nodeAffinity:
>             requiredDuringSchedulingIgnoredDuringExecution:
>               nodeSelectorTerms:
>               - matchExpressions:
>                 - key: nodeType
>                   operator: In
>                   values:
>                   - someValue
>         containers:
>         - name: flink-main-container
>           resources:
>             limits:
>               cpu: "1"
>               memory: 1.6Gi
>             requests:
>               cpu: "0.2"
>               memory: 1Gi
>         tolerations:
>         - effect: NoSchedule
>           key: someValue
>           value: "true"
>     replicas: 1
> podTemplate:
>     apiVersion: v1
>     kind: Pod
>     metadata:
>       annotations:
>         configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
>         prometheus.io/path: /metrics
>         prometheus.io/port: "9260"
>         prometheus.io/scrape: "true"
>       labels:
>         app.kubernetes.io/instance: flink-validator-process-124
>         app.kubernetes.io/managed-by: Helm
>         app.kubernetes.io/name: apache-flink
>         app.kubernetes.io/version: test
>         helm.sh/chart: apache-flink-1.0.0
>     spec:
>       containers: []
>       imagePullSecrets: []
>   serviceAccount: validator-process-124
>   taskManager:
>     podTemplate:
>       apiVersion: v1
>       kind: Pod
>       metadata:
>         annotations:
>           configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
>       spec:
>         affinity:
>           nodeAffinity:
>             requiredDuringSchedulingIgnoredDuringExecution:
>               nodeSelectorTerms:
>               - matchExpressions:
>                 - key: nodeType
>                   operator: In
>                   values:
>                   - someValue
>         containers:
>         - name: flink-main-container
>           resources:
>             limits:
>               cpu: "1"
>               memory: 3.6Gi
>             requests:
>               cpu: "0.2"
>               memory: 3Gi
>         tolerations:
>         - effect: NoSchedule
>           key: someValue
>           value: "true"{code}
>  
> Please let me know if more details are required.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)