You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/16 11:29:06 UTC

[GitHub] [airflow] hussainsaify opened a new issue #22307: Airflow webserver stops responding and then restarted by liveness probes

hussainsaify opened a new issue #22307:
URL: https://github.com/apache/airflow/issues/22307


   ### Apache Airflow version
   
   2.2.4 (latest released)
   
   ### What happened
   
   Hi Team,
   
   After upgrading to airflow 2.2.3, we have started facing an issue where webserserver gets stuck and is eventually restarted after failing the liveness probes. However, this issue is transient and we see it 2-3 times per week. We also see high spike of cpu during the time issue occured.
   
   Cloud - AWS EKS 1.21
   Helm Chart - 1.4.0
   Current airflow version- 2.2.4
   
   Please let me know in case you need more details 
   
   Thanks,
   Hussain 
   
   
   ### What you think should happen instead
   
   Webserver should continue running and responds to requests.
   
   ### How to reproduce
   
   The issue is transient, occurs 2-3 times per week.
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==3.0.0
   apache-airflow-providers-apache-hive==2.2.0
   apache-airflow-providers-celery==2.1.0
   apache-airflow-providers-cncf-kubernetes==3.0.2
   apache-airflow-providers-docker==2.4.1
   apache-airflow-providers-elasticsearch==2.2.0
   apache-airflow-providers-ftp==2.0.1
   apache-airflow-providers-google==6.4.0
   apache-airflow-providers-grpc==2.0.1
   apache-airflow-providers-hashicorp==2.1.1
   apache-airflow-providers-http==2.0.3
   apache-airflow-providers-imap==2.2.0
   apache-airflow-providers-microsoft-azure==3.6.0
   apache-airflow-providers-microsoft-mssql==2.1.0
   apache-airflow-providers-mysql==2.2.0
   apache-airflow-providers-odbc==2.0.1
   apache-airflow-providers-postgres==3.0.0
   apache-airflow-providers-redis==2.0.1
   apache-airflow-providers-sendgrid==2.0.1
   apache-airflow-providers-sftp==2.4.1
   apache-airflow-providers-slack==4.2.0
   apache-airflow-providers-sqlite==2.1.0
   apache-airflow-providers-ssh==2.4.0
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Please find below helm values.
   
       defaultAirflowTag: "2.2.4"
   
       airflowVersion: "2.2.4"
   
       labels:
         spotinst.io/restrict-scale-down: "true"
       nodeSelector:
         node-class: worker
   
       fernetKeySecretName: airflow-fernet-key
   
       webserverSecretKeySecretName: airflow-webserver-secret-key
   
       ingress:
         enabled: true
         web:
           host: "airflow.************.com"
           annotations:
             kubernetes.io/ingress.class: "nginx"
           hosts:
             - name: "airflow.**************.com"
               tls:
                 enabled: true
                 secretName: "tls-wildcard"
   
         # Configs for the Ingress of the flower Service
         flower:
   
           path: "/flower"
   
           pathType: "ImplementationSpecific"
   
           host: "airflow.************.com"
           annotations:
             kubernetes.io/ingress.class: "nginx"
   
           hosts:
             - name: "airflow.***********.com"
               tls:
                 enabled: true
                 secretName: "tls-wildcard"
   
   
       airflowPodAnnotations:
         ad.datadoghq.com/airflow-web.check_names: '["airflow"]'
         ad.datadoghq.com/airflow-web.init_configs: '[{}]'
         ad.datadoghq.com/airflow-web.instances: |
           [
             {
               "url": "http://%%host%%:8080"
             }
           ]
       executor: "KubernetesExecutor"
   
   
       extraEnv: |
         - name: AIRFLOW__CORE__FERNET_KEY
           valueFrom:
             secretKeyRef:
               name: airflow-fernet-key
               key: fernet-key
         - name: AZURE_TENANT_ID
           valueFrom:
             secretKeyRef:
               name: airflow-azuread-creds
               key: tenant_id
         - name: AZURE_CLIENT_SECRET
           valueFrom:
             secretKeyRef:
               name: airflow-azuread-creds
               key: client_secret
         - name: AZURE_CLIENT_ID
           valueFrom:
             secretKeyRef:
               name: airflow-azuread-creds
               key: client_id
         - name: AIRFLOW__METRICS__STATSD_HOST
           valueFrom:
             fieldRef:
               fieldPath: status.hostIP
       env:
         - name: AIRFLOW_CONN_AWS_LOG
           value: "aws://"
         - name: AIRFLOW__CORE__REMOTE_LOGGING
           value: "True"
         - name: AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER
           value: "s3://***********/airflow-etl/logs"
         - name: AIRFLOW__CORE__REMOTE_LOG_CONN_ID
           value: "aws_development"
         - name: AIRFLOW__CORE__ENCRYPT_S3_LOGS
           value: "True"
         - name: AIRFLOW__CORE__PARALLELISM
           value: "64"
         - name: AIRFLOW__CORE__DAG_CONCURRENCY
           value: "32"
         - name: AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG
           value: "32"
         - name: AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE
           value: "10"
         - name: AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW
           value: "30"
         - name: AIRFLOW__SMTP__SMTP_HOST
           value: "smtp.mail-relay.svc"
         - name: AIRFLOW__SMTP__SMTP_MAIL_FROM
           value: "***********"
         - name: AIRFLOW__METRICS__STATSD_ON
           value: "True"
         - name: AIRFLOW__METRICS__STATSD_PORT
           value: "8125"
         - name: AIRFLOW__METRICS__STATSD_PREFIX
           value: "airflow"
         - name: AIRFLOW__WEBSERVER__AUTHENTICATE
           value: "True"
         - name: AIRFLOW__WEBSERVER__EXPOSE_CONFIG
           value: "True"
         - name: AIRFLOW__WEBSERVER__RBAC
           value: "True"
         - name: AIRFLOW__WEBSERVER__ENABLE_PROXY_FIX
           value: "True"
         - name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
           value: "WARN"
         - name: AIRFLOW__LOGGING__LOGGING_LEVEL
           value: "DEBUG"
         - name: AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
           value: "240.0"
         - name: AIRFLOW__CELERY__FLOWER_URL_PREFIX
           value: "/flower"
         - name: AIRFLOW__API__AUTH_BACKEND
           value: "airflow.api.auth.backend.basic_auth"
         - name: AIRFLOW__CORE__HOSTNAME_CALLABLE
           value: "socket:gethostname"
         - name: AIRFLOW__WEBSERVER__WORKER_REFRESH_BATCH_SIZE
           value: "0"
         - name: AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL
           value: "0"
   
       data:
         metadataSecretName: airflow-metadata-connection
   
       # Airflow scheduler settings
       scheduler:
         replicas: 2
         serviceAccount:
           create: false
           name: airflow
         # Scheduler pod disruption budget
         podDisruptionBudget:
           enabled: true
           config:
             maxUnavailable: 1
   
         resources:
           limits:
             cpu: "4000m"
             memory: "8Gi"
           requests:
             cpu: "1000m"
             memory: "2Gi"
   
         nodeSelector:
           node-class: worker
         affinity:
           podAntiAffinity:
             preferredDuringSchedulingIgnoredDuringExecution:
             - podAffinityTerm:
                 labelSelector:
                   matchLabels:
                     component: scheduler
                 topologyKey: kubernetes.io/hostname
               weight: 100
   
       # Airflow webserver settings
       webserver:
   
         replicas: 1
   
         serviceAccount:
           create: false
           name: airflow
   
         resources:
           limits:
             cpu: "4000m"
             memory: "8Gi"
           requests:
             cpu: "1500m"
             memory: "2Gi"
   
         webserverConfig: |
           import os
           from airflow.configuration import conf
           from flask_appbuilder.security.manager import AUTH_OAUTH
           from flask import Flask
           from flask_appbuilder import SQLA, AppBuilder
           SQLALCHEMY_DATABASE_URI = conf.get("core", "SQL_ALCHEMY_CONN")
           basedir = os.path.abspath(os.path.dirname(__file__))
           AUTH_USER_REGISTRATION_ROLE = "Viewer"
           AUTH_TYPE = AUTH_OAUTH
           AUTH_ROLES_SYNC_AT_LOGIN = True
           AUTH_USER_REGISTRATION = True
           AZURE_TENANT_ID = os.environ.get("AZURE_TENANT_ID")
           API_BASE = f"https://login.microsoftonline.com/{AZURE_TENANT_ID}/oauth2"
           ACCESS_TOKEN_URL = f"{API_BASE}/token"
           AUTHORIZE_URL = f"{API_BASE}/authorize"
           AUTH_ROLES_MAPPING = {
                                     ***********
           }
           OAUTH_PROVIDERS = [
              {
               "name": "azure",
               "icon": "fa-windows",
               "token_key": "access_token",
               "remote_app": {
                   "client_id": os.environ.get("AZURE_CLIENT_ID"),
                   "client_secret":  os.environ.get("AZURE_CLIENT_SECRET"),
                   "api_base_url": API_BASE,
                   "client_kwargs": {
                       "resource": os.environ.get("AZURE_CLIENT_ID"),
                       "scope": "User.read name preferred_username email profile upn https://graph.windows.net/.default openid"
                   },
                   "request_token_url": None,
                   "access_token_url": ACCESS_TOKEN_URL,
                   "authorize_url": AUTHORIZE_URL,
               },
           }
           ]
       # Airflow Triggerer Config
       triggerer:
         enabled: true
   
         # Create ServiceAccount
         serviceAccount:
           create: false
           name: airflow
   
       workers:
         serviceAccount:
           create: false
           name: airflow
   
         resources:
           limits:
             cpu: "1000m"
             memory: "2Gi"
           requests:
             cpu: "400m"
             memory: "1000Mi"
   
       # Flower settings
       flower:
         enabled: false
   
   
       # Statsd settings
       statsd:
         enabled: false
   
   
       # Configuration for the redis provisioned by the chart
       redis:
         enabled: false
   
       postgresql:
         enabled: false
   
       # Git sync
       dags:
         gitSync:
           enabled: true
           repo: git@github.com:***********
           branch: development/airflow
           subPath: "dags"
           sshKeySecret: "airflow-git"
   
           knownHosts: |
             github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
           wait: 120
           resources:
             limits:
               cpu: "100m"
               memory: "400Mi"
             requests:
               cpu: "50m"
               memory: "200Mi"
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #22307: Airflow webserver stops responding and then restarted by liveness probes

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #22307:
URL: https://github.com/apache/airflow/issues/22307#issuecomment-1069024463


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22307: Airflow webserver stops responding and then restarted by liveness probes

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22307:
URL: https://github.com/apache/airflow/issues/22307#issuecomment-1069049135


   You need to provide logs (and possibly analyse) of the webserver and likely kubernetes from around the time, the webserver is killed.  Ideally you should try to analyze it before and see if you can identify the reason yourself. Also you should see how the log files differ from "normal" situation.
   
   There is no way we can act on it without seeing this information. Converting it into discussion until more information is available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org