You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "liad shachoach (Jira)" <ji...@apache.org> on 2022/10/13 10:32:00 UTC

[jira] [Updated] (FLINK-29620) Flink deployment stuck in UPGRADING state when changing configuration

     [ https://issues.apache.org/jira/browse/FLINK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liad shachoach updated FLINK-29620:
-----------------------------------
    Description: 
When I update the configuration of a flink deployment I observe one of two scenarios:

Success:

This happens when the job has not started - if I change the configuration quick enough:
{code:java}
2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade
2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata.
2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
 {code}
In this scenario I see that the job manager and task manager pods are terminated and then recreated.

 

 

Failure:

This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:
{code:java}
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT
2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Suspending job with savepoint.
2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4.
2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation.
2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Deploying application cluster
2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted
2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'.
2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
    at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
    at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
    at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
    at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
    at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
    at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
    at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
    at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
    at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
    at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
    at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
    ... 13 more {code}
In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.

The flink deployment state changes to UPGRADING and the above exception is repeated.

error in flink deployment: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
Job Manager Deployment Status:  MISSING

 

Flink deployment spec:

 
{code:java}
flinkVersion: v1_14 
job:
    allowNonRestoredState: true
    args: ...
    entryClass: ...
    jarURI: ...
    parallelism: x
    savepointTriggerNonce: 0
    state: running
    upgradeMode: savepoint
jobManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodeType
                  operator: In
                  values:
                  - someValue
        containers:
        - name: flink-main-container
          resources:
            limits:
              cpu: "1"
              memory: 1.6Gi
            requests:
              cpu: "0.2"
              memory: 1Gi
        tolerations:
        - effect: NoSchedule
          key: someValue
          value: "true"
    replicas: 1
podTemplate:
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
        prometheus.io/path: /metrics
        prometheus.io/port: "9260"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/instance: flink-validator-process-124
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: apache-flink
        app.kubernetes.io/version: test
        helm.sh/chart: apache-flink-1.0.0
    spec:
      containers: []
      imagePullSecrets: []
  serviceAccount: validator-process-124
  taskManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodeType
                  operator: In
                  values:
                  - someValue
        containers:
        - name: flink-main-container
          resources:
            limits:
              cpu: "1"
              memory: 3.6Gi
            requests:
              cpu: "0.2"
              memory: 3Gi
        tolerations:
        - effect: NoSchedule
          key: someValue
          value: "true"{code}
 

Please let me know if more details are required.

 

  was:
When I update the configuration of a flink deployment I observe one of two scenarios:

Success:

This happens when the job has not started - if I change the configuration quick enough:
{code:java}
2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade
2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata.
2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
 {code}
In this scenario I see that the job manager and task manager pods are terminated and then recreated.

 

 

Failure:

This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:
{code:java}
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT
2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Suspending job with savepoint.
2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4.
2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation.
2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Deploying application cluster
2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted
2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'.
2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
    at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
    at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
    at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
    at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
    at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
    at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
    at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
    at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
    at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
    at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
    at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
    at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
    ... 13 more {code}
In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.

The flink deployment state changes to UPGRADING and the above exception is repeated.

 

Flink deployment spec:

 
{code:java}
flinkVersion: v1_14 
job:
    allowNonRestoredState: true
    args: ...
    entryClass: ...
    jarURI: ...
    parallelism: x
    savepointTriggerNonce: 0
    state: running
    upgradeMode: savepoint
jobManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodeType
                  operator: In
                  values:
                  - someValue
        containers:
        - name: flink-main-container
          resources:
            limits:
              cpu: "1"
              memory: 1.6Gi
            requests:
              cpu: "0.2"
              memory: 1Gi
        tolerations:
        - effect: NoSchedule
          key: someValue
          value: "true"
    replicas: 1
podTemplate:
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
        prometheus.io/path: /metrics
        prometheus.io/port: "9260"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/instance: flink-validator-process-124
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: apache-flink
        app.kubernetes.io/version: test
        helm.sh/chart: apache-flink-1.0.0
    spec:
      containers: []
      imagePullSecrets: []
  serviceAccount: validator-process-124
  taskManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodeType
                  operator: In
                  values:
                  - someValue
        containers:
        - name: flink-main-container
          resources:
            limits:
              cpu: "1"
              memory: 3.6Gi
            requests:
              cpu: "0.2"
              memory: 3Gi
        tolerations:
        - effect: NoSchedule
          key: someValue
          value: "true"{code}
 

Please let me know if more details are required.

 


> Flink deployment stuck in UPGRADING state when changing configuration
> ---------------------------------------------------------------------
>
>                 Key: FLINK-29620
>                 URL: https://issues.apache.org/jira/browse/FLINK-29620
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.14.2
>         Environment: AWS EKS v1.21
> Operator version: 1.1.0
>            Reporter: liad shachoach
>            Priority: Major
>
> When I update the configuration of a flink deployment I observe one of two scenarios:
> Success:
> This happens when the job has not started - if I change the configuration quick enough:
> {code:java}
> 2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
> 2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade
> 2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata.
> 2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
> 2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
> 2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
> 2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
> 2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
> 2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
>  {code}
> In this scenario I see that the job manager and task manager pods are terminated and then recreated.
>  
>  
> Failure:
> This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:
> {code:java}
> 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
> 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT
> 2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Suspending job with savepoint.
> 2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
> 2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
> 2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
> 2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
> 2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
> 2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
> 2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
> 2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
> 2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
> 2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
> 2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
> 2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
> 2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
> 2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
> 2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
> 2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
> 2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4.
> 2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation.
> 2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Deploying application cluster
> 2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted
> 2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'.
> 2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
> 2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
> 2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed.
> org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
>     at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
>     at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
>     at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
>     at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
>     at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
>     at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
>     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
>     at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
>     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
>     at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
>     at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
>     at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
>     at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
>     at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
>     ... 13 more {code}
> In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.
> The flink deployment state changes to UPGRADING and the above exception is repeated.
> error in flink deployment: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
> Job Manager Deployment Status:  MISSING
>  
> Flink deployment spec:
>  
> {code:java}
> flinkVersion: v1_14 
> job:
>     allowNonRestoredState: true
>     args: ...
>     entryClass: ...
>     jarURI: ...
>     parallelism: x
>     savepointTriggerNonce: 0
>     state: running
>     upgradeMode: savepoint
> jobManager:
>     podTemplate:
>       apiVersion: v1
>       kind: Pod
>       metadata:
>         annotations:
>           configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
>       spec:
>         affinity:
>           nodeAffinity:
>             requiredDuringSchedulingIgnoredDuringExecution:
>               nodeSelectorTerms:
>               - matchExpressions:
>                 - key: nodeType
>                   operator: In
>                   values:
>                   - someValue
>         containers:
>         - name: flink-main-container
>           resources:
>             limits:
>               cpu: "1"
>               memory: 1.6Gi
>             requests:
>               cpu: "0.2"
>               memory: 1Gi
>         tolerations:
>         - effect: NoSchedule
>           key: someValue
>           value: "true"
>     replicas: 1
> podTemplate:
>     apiVersion: v1
>     kind: Pod
>     metadata:
>       annotations:
>         configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
>         prometheus.io/path: /metrics
>         prometheus.io/port: "9260"
>         prometheus.io/scrape: "true"
>       labels:
>         app.kubernetes.io/instance: flink-validator-process-124
>         app.kubernetes.io/managed-by: Helm
>         app.kubernetes.io/name: apache-flink
>         app.kubernetes.io/version: test
>         helm.sh/chart: apache-flink-1.0.0
>     spec:
>       containers: []
>       imagePullSecrets: []
>   serviceAccount: validator-process-124
>   taskManager:
>     podTemplate:
>       apiVersion: v1
>       kind: Pod
>       metadata:
>         annotations:
>           configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
>       spec:
>         affinity:
>           nodeAffinity:
>             requiredDuringSchedulingIgnoredDuringExecution:
>               nodeSelectorTerms:
>               - matchExpressions:
>                 - key: nodeType
>                   operator: In
>                   values:
>                   - someValue
>         containers:
>         - name: flink-main-container
>           resources:
>             limits:
>               cpu: "1"
>               memory: 3.6Gi
>             requests:
>               cpu: "0.2"
>               memory: 3Gi
>         tolerations:
>         - effect: NoSchedule
>           key: someValue
>           value: "true"{code}
>  
> Please let me know if more details are required.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)