You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/05 16:25:28 UTC
[GitHub] [airflow] JeremieDoctrine opened a new issue #20690: Unable to adopt pod after scheduler restarts
JeremieDoctrine opened a new issue #20690:
URL: https://github.com/apache/airflow/issues/20690
### Apache Airflow version
2.2.2
### What happened
We are using Airflow version 2.2.2
Using kubernetes executor
We are running airflow with `multiNamespaceMode: true`
When the scheduler restarts it tries to adopt pods and fails. Below is the error.
### What you expected to happen
```
[2022-01-05 16:09:48,531] {kubernetes_executor.py:730} INFO - Attempting to adopt pod task_name
[2022-01-05 16:09:48,552] {kubernetes_executor.py:741} INFO - Failed to adopt pod task_name. Reason: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'abd29dca-ee1b-4ab4-990b-cc7a9bb2fa1d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '10aaeb70-dcb7-4f10-a829-3cb9c0af2c8d', 'X-Kubernetes-Pf-Prioritylevel-Uid': '5ff16932-da39-41e3-b9da-590e6c736d0f', 'Date': 'Wed, 05 Jan 2022 16:09:48 GMT', 'Transfer-Encoding': 'chunked'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"task_name\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n core.PodSpec
```
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux 10 (buster)
### Versions of Apache Airflow Providers
_No response_
### Deployment
Other 3rd-party Helm chart
### Deployment details
```
executor: KubernetesExecutor
config:
core:
sql_alchemy_conn: 'postgresql+psycopg2://${db_user}:${db_password}@${db_host}/${database_name}'
max_active_tasks_per_dag: 50
parallelism: 50
donot_pickle: False
enable_xcom_pickling: True
default_timezone: "Europe/Paris"
load_default_connections: False
max_queued_run_per_dag: 16
dags_folder: /opt/airflow/dags/repo/dags
plugins_folder: /opt/airflow/dags/repo/plugins
scheduler:
catchup_by_default: False
kubernetes:
enable_tcp_keepalive: True
delete_worker_pods_on_failure: True
operators:
default_queue: masters
logging:
remote_logging: "true"
remote_base_log_folder: "s3://${log_folder}"
remote_log_conn_id: s3_log_conn
encrypt_s3_logs: False
webserver:
authenticate: True
auth_backend: airflow.contrib.auth.backends.google_auth
base_url: "https://${host}"
enable_proxy_fix: True
rbac: false
api:
auth_backend: airflow.api.auth.backend.basic_auth
images:
airflow:
tag: ${tag}
repository: ***REDACTED***.dkr.ecr.eu-central-1.amazonaws.com/${repository}
gitSync:
repository: k8s.gcr.io/git-sync/git-sync
tag: v3.3.5
redis:
enabled: False
postgresql:
enabled: False
data:
metadataConnection:
user: ${db_user}
pass: ${db_password}
protocol: postgresql
host: ${db_host}
port: 5432
db: ${database_name}
sslmode: disable
resultBackendConnection:
user: ${db_user}
pass: ${db_password}
protocol: postgresql
host: ${db_host}
port: 5432
db: ${database_name}
sslmode: disable
fernetKey: '***REDACTED***'
pgbouncer:
enabled: true
maxClientConn: 1000
resources:
limits:
cpu: 0.5
memory: 128Mi
requests:
cpu: 0.4
memory: 128Mi
podDisruptionBudget:
enabled: true
config:
maxUnavailable: 0
extraEnv: |
- name: PYTHONPATH
value: /opt/airflow/dags/repo
- name: SLACK_API_TOKEN
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: slack_api_token
- name: SLACK_WEBHOOK_URI
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: slack_webhook_uri
- name: ROLLBAR_WEBHOOK_TOKEN
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: rollbar_webhook_token
- name: AIRFLOW_CONN_S3_LOG_CONN
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: s3_log_conn
- name: AIRFLOW_CONN_S3_CONN
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: s3_conn
- name: AIRFLOW__GOOGLE__CLIENT_ID
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: google_auth_client_id
- name: AIRFLOW__GOOGLE__CLIENT_SECRET
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: google_auth_client_secret
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-connections-secrets
key: fernet_key
webserver:
webserverConfig: |
from flask_appbuilder.security.manager import AUTH_OAUTH
import os
AUTH_TYPE = AUTH_OAUTH
# Uncomment to setup Full admin role name
# AUTH_ROLE_ADMIN = 'Admin'
# Uncomment to setup Public role name, no authentication needed
# AUTH_ROLE_PUBLIC = 'Public'
# Will allow user self registration
AUTH_USER_REGISTRATION = True
# The default user self registration role
AUTH_USER_REGISTRATION_ROLE = "Admin"
print("#" * 5)
print("USING GOOGLE AUTH")
OAUTH_PROVIDERS = [
{
"name": "google",
"whitelist": ["***REDACTED***"],
"icon": "fa-google",
"token_key": "access_token",
"remote_app": {
"client_id": os.environ["AIRFLOW__GOOGLE__CLIENT_ID"],
"client_secret": os.environ["AIRFLOW__GOOGLE__CLIENT_SECRET"],
"api_base_url": "https://www.googleapis.com/oauth2/v2/",
"client_kwargs": {"scope": "email profile"},
"request_token_url": None,
"access_token_url": "https://oauth2.googleapis.com/token",
"authorize_url": "https://accounts.google.com/o/oauth2/auth",
},
}
]
resources:
limits:
cpu: 0.5
memory: "2G"
requests:
cpu: 0.5
memory: "2G"
ingress:
enabled: true
web:
annotations: {
"alb.ingress.kubernetes.io/actions.redirect": "***REDACTED***",
"alb.ingress.kubernetes.io/certificate-arn": "***REDACTED***",
"alb.ingress.kubernetes.io/listen-ports": "[{\"HTTP\": 80}, {\"HTTPS\":443}]",
"alb.ingress.kubernetes.io/scheme": "internal",
"alb.ingress.kubernetes.io/target-type": "ip",
"kubernetes.io/ingress.class": "alb",
}
path: "/*"
pathType: "ImplementationSpecific"
host: "${host}"
# hosts: ["${host}"]
# ingressClassName: "alb"
tls:
enabled: false
secretName: ""
precedingPaths: []
succeedingPaths: []
workers:
resources:
limits:
cpu: 0.5
memory: "1024M"
requests:
cpu: 0.25
memory: "128Mi"
scheduler:
replicas: 1
podDisruptionBudget:
enabled: true
config:
maxUnavailable: 0
multiNamespaceMode: true
logs:
persistence:
enabled: true
storageClassName: efs-sc
airflowPodAnnotations: {'cluster-autoscaler.kubernetes.io/safe-to-evict': 'false','kubernetes.io/psp': 'eks.privileged'}
dags:
persistence:
# Enable persistent volume for storing dags
enabled: false
# Volume size for dags
size: 1Gi
# access mode of the persistent volume
accessMode: ReadWriteMany
gitSync:
enabled: true
repo: ***REDACTED***
branch: master
subPath: ""
rev: HEAD
depth: 1
sshKeySecret: ***REDACTED***
wait: 60
uid: 65533
containerName: git-sync
knownHosts: ***REDACTED***
```
### Anything else
This looks a bit similar than https://github.com/apache/airflow/issues/20203 except that we are running on the same deployment.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] JeremieDoctrine closed issue #20690: Unable to adopt pod after scheduler restarts
Posted by GitBox <gi...@apache.org>.
JeremieDoctrine closed issue #20690:
URL: https://github.com/apache/airflow/issues/20690
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #20690: Unable to adopt pod after scheduler restarts
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1005875746
Thanks for opening your first issue here! Be sure to follow the issue template!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jedcunningham commented on issue #20690: Unable to adopt pod after scheduler restarts
Posted by GitBox <gi...@apache.org>.
jedcunningham commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1006075072
This is likely the same issue #19949 is trying to fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] JeremieDoctrine commented on issue #20690: Unable to adopt pod after scheduler restarts
Posted by GitBox <gi...@apache.org>.
JeremieDoctrine commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1050714005
Nice @bitsofdave :-) it gave me some indication. The fernet Key was set twice on our pod. The scheduler can now adopt pods 🙇
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] bitsofdave commented on issue #20690: Unable to adopt pod after scheduler restarts
Posted by GitBox <gi...@apache.org>.
bitsofdave commented on issue #20690:
URL: https://github.com/apache/airflow/issues/20690#issuecomment-1050487594
In my case, this was caused by the the [user community airflow helm chart setting an env var twice in the pod template](https://github.com/airflow-helm/charts/issues/416). I outlined the issue here https://github.com/airflow-helm/charts/issues/532.
As a workaround, editing the `airflow-pod-template` ConfgMap and removing the second instance of `CONNECTION_CHECK_MAX_COUNT` where it's being set to 0 allows the pod adoption to work again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org