You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/05 16:25:28 UTC

[GitHub] [airflow] JeremieDoctrine opened a new issue #20690: Unable to adopt pod after scheduler restarts

JeremieDoctrine opened a new issue #20690:
URL: https://github.com/apache/airflow/issues/20690


   ### Apache Airflow version
   
   2.2.2
   
   ### What happened
   
   We are using Airflow version 2.2.2
   Using kubernetes executor
   We are running airflow with `multiNamespaceMode: true`
   
   When the scheduler restarts it tries to adopt pods and fails. Below is the error.
   
   ### What you expected to happen
   
   ```
   [2022-01-05 16:09:48,531] {kubernetes_executor.py:730} INFO - Attempting to adopt pod task_name
   [2022-01-05 16:09:48,552] {kubernetes_executor.py:741} INFO - Failed to adopt pod task_name. Reason: (422)
   Reason: Unprocessable Entity
   HTTP response headers: HTTPHeaderDict({'Audit-Id': 'abd29dca-ee1b-4ab4-990b-cc7a9bb2fa1d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '10aaeb70-dcb7-4f10-a829-3cb9c0af2c8d', 'X-Kubernetes-Pf-Prioritylevel-Uid': '5ff16932-da39-41e3-b9da-590e6c736d0f', 'Date': 'Wed, 05 Jan 2022 16:09:48 GMT', 'Transfer-Encoding': 'chunked'})
   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"task_name\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n  core.PodSpec
   ```
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   ```
   executor: KubernetesExecutor
   
   config:
     core:
       sql_alchemy_conn: 'postgresql+psycopg2://${db_user}:${db_password}@${db_host}/${database_name}'
       max_active_tasks_per_dag: 50
       parallelism: 50
       donot_pickle: False
       enable_xcom_pickling: True
       default_timezone: "Europe/Paris"
       load_default_connections: False
       max_queued_run_per_dag: 16
       dags_folder: /opt/airflow/dags/repo/dags
       plugins_folder: /opt/airflow/dags/repo/plugins
     scheduler:
       catchup_by_default: False
     kubernetes:
       enable_tcp_keepalive: True
       delete_worker_pods_on_failure: True
     operators:
       default_queue: masters
     logging:
       remote_logging: "true"
       remote_base_log_folder: "s3://${log_folder}"
       remote_log_conn_id: s3_log_conn
       encrypt_s3_logs: False
     webserver:
       authenticate: True
       auth_backend: airflow.contrib.auth.backends.google_auth
       base_url: "https://${host}"
       enable_proxy_fix: True
       rbac: false
     api:
       auth_backend: airflow.api.auth.backend.basic_auth
   
   images:
     airflow:
       tag: ${tag}
       repository: ***REDACTED***.dkr.ecr.eu-central-1.amazonaws.com/${repository}
     gitSync:
       repository: k8s.gcr.io/git-sync/git-sync
       tag: v3.3.5
   
   redis:
     enabled: False
   postgresql:
     enabled: False
   
   data:
     metadataConnection:
       user: ${db_user}
       pass: ${db_password}
       protocol: postgresql
       host: ${db_host}
       port: 5432
       db: ${database_name}
       sslmode: disable
     resultBackendConnection:
       user: ${db_user}
       pass: ${db_password}
       protocol: postgresql
       host: ${db_host}
       port: 5432
       db: ${database_name}
       sslmode: disable
   
   fernetKey: '***REDACTED***'
   
   pgbouncer:
     enabled: true
     maxClientConn: 1000
     resources:
       limits:
         cpu: 0.5
         memory: 128Mi
       requests:
         cpu: 0.4
         memory: 128Mi
     podDisruptionBudget:
       enabled: true
       config:
         maxUnavailable: 0
   
   extraEnv: |
     - name: PYTHONPATH
       value: /opt/airflow/dags/repo
     - name: SLACK_API_TOKEN
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: slack_api_token
     - name: SLACK_WEBHOOK_URI
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: slack_webhook_uri
     - name: ROLLBAR_WEBHOOK_TOKEN
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: rollbar_webhook_token
     - name: AIRFLOW_CONN_S3_LOG_CONN
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: s3_log_conn
     - name: AIRFLOW_CONN_S3_CONN
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: s3_conn
     - name: AIRFLOW__GOOGLE__CLIENT_ID
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: google_auth_client_id
     - name: AIRFLOW__GOOGLE__CLIENT_SECRET
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: google_auth_client_secret
     - name: AIRFLOW__CORE__FERNET_KEY
       valueFrom:
         secretKeyRef:
           name: airflow-connections-secrets
           key: fernet_key
   
   webserver:
     webserverConfig: |
       from flask_appbuilder.security.manager import AUTH_OAUTH
       import os
   
       AUTH_TYPE = AUTH_OAUTH
   
       # Uncomment to setup Full admin role name
       # AUTH_ROLE_ADMIN = 'Admin'
   
       # Uncomment to setup Public role name, no authentication needed
       # AUTH_ROLE_PUBLIC = 'Public'
   
       # Will allow user self registration
       AUTH_USER_REGISTRATION = True
   
       # The default user self registration role
       AUTH_USER_REGISTRATION_ROLE = "Admin"
       print("#" * 5)
       print("USING GOOGLE AUTH")
       OAUTH_PROVIDERS = [
           {
               "name": "google",
               "whitelist": ["***REDACTED***"],
               "icon": "fa-google",
               "token_key": "access_token",
               "remote_app": {
                   "client_id": os.environ["AIRFLOW__GOOGLE__CLIENT_ID"],
                   "client_secret": os.environ["AIRFLOW__GOOGLE__CLIENT_SECRET"],
                   "api_base_url": "https://www.googleapis.com/oauth2/v2/",
                   "client_kwargs": {"scope": "email profile"},
                   "request_token_url": None,
                   "access_token_url": "https://oauth2.googleapis.com/token",
                   "authorize_url": "https://accounts.google.com/o/oauth2/auth",
               },
           }
       ]
     resources:
       limits:
         cpu: 0.5
         memory: "2G"
       requests:
         cpu: 0.5
         memory: "2G"
   
   ingress:
     enabled: true
     web:
       annotations: {
         "alb.ingress.kubernetes.io/actions.redirect": "***REDACTED***",
         "alb.ingress.kubernetes.io/certificate-arn": "***REDACTED***",
         "alb.ingress.kubernetes.io/listen-ports": "[{\"HTTP\": 80}, {\"HTTPS\":443}]",
         "alb.ingress.kubernetes.io/scheme": "internal",
         "alb.ingress.kubernetes.io/target-type": "ip",
         "kubernetes.io/ingress.class": "alb",
       }
       path: "/*"
       pathType: "ImplementationSpecific"
       host: "${host}"
       # hosts: ["${host}"]
       # ingressClassName: "alb"
       tls:
         enabled: false
         secretName: ""
       precedingPaths: []
       succeedingPaths: []
   
   workers:
     resources:
       limits:
         cpu: 0.5
         memory: "1024M"
       requests:
         cpu: 0.25
         memory: "128Mi"
   
   scheduler:
     replicas: 1
     podDisruptionBudget:
       enabled: true
       config:
         maxUnavailable: 0
   
   multiNamespaceMode: true
   
   logs:
     persistence:
       enabled: true
       storageClassName: efs-sc
   
   
   airflowPodAnnotations: {'cluster-autoscaler.kubernetes.io/safe-to-evict': 'false','kubernetes.io/psp': 'eks.privileged'}
   
   dags:
     persistence:
       # Enable persistent volume for storing dags
       enabled: false
       # Volume size for dags
       size: 1Gi
       # access mode of the persistent volume
       accessMode: ReadWriteMany
     gitSync:
       enabled: true
       repo: ***REDACTED***
       branch: master
       subPath: ""
       rev: HEAD
       depth: 1
       sshKeySecret: ***REDACTED***
       wait: 60
       uid: 65533
       containerName: git-sync
       knownHosts: ***REDACTED***
   ```
   
   ### Anything else
   
   This looks a bit similar than https://github.com/apache/airflow/issues/20203 except that we are running on the same deployment.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JeremieDoctrine closed issue #20690: Unable to adopt pod after scheduler restarts

Posted by GitBox <gi...@apache.org>.
JeremieDoctrine closed issue #20690:
URL: https://github.com/apache/airflow/issues/20690


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #20690: Unable to adopt pod after scheduler restarts

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1005875746


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on issue #20690: Unable to adopt pod after scheduler restarts

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1006075072


   This is likely the same issue #19949 is trying to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JeremieDoctrine commented on issue #20690: Unable to adopt pod after scheduler restarts

Posted by GitBox <gi...@apache.org>.
JeremieDoctrine commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1050714005


   Nice @bitsofdave :-) it gave me some indication. The fernet Key was set twice on our pod. The scheduler can now adopt pods 🙇 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bitsofdave commented on issue #20690: Unable to adopt pod after scheduler restarts

Posted by GitBox <gi...@apache.org>.
bitsofdave commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1050487594


   In my case, this was caused by the the [user community airflow helm chart setting an env var twice in the pod template](https://github.com/airflow-helm/charts/issues/416). I outlined the issue here https://github.com/airflow-helm/charts/issues/532.
   
   As a workaround, editing the `airflow-pod-template` ConfgMap and removing the second instance of `CONNECTION_CHECK_MAX_COUNT` where it's being set to 0 allows the pod adoption to work again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org