You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/03 04:43:22 UTC

[GitHub] [airflow] iantbutler01 opened a new issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.

iantbutler01 opened a new issue #10122:
URL: https://github.com/apache/airflow/issues/10122


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   <!--
   
   IMPORTANT!!!
   
   PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
   NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
   
   PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
   
   Please complete the next sections or the issue will be closed.
   This questions are the first thing we need to know to understand the context.
   
   -->
   
   **Apache Airflow version**:
   1.10.10
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   v1.16.8-eks-e16311
   **Environment**:
   <details>
   KUBERNETES_SERVICE_PORT_HTTPS=443
   AIRFLOW__SMTP__SMTP_PORT=25
   AIRFLOW__KUBERNETES__NAMESPACE=airflow
   AIRFLOW__SMTP__SMTP_PASSWORD=*snip*
   AIRFLOW__SMTP__SMTP_USER=*snip*
   KUBERNETES_SERVICE_PORT=443
   BOILING_LAND_WEB_PORT_8080_TCP_PORT=8080
   REDIS_PASSWORD=fjODRhL3FL6n0y4cA
   AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_BASE_LOGS_FOLDER=*snip*
   BOILING_LAND_WEB_SERVICE_PORT=8080
   HOSTNAME=boiling-land-scheduler-7bcb794c75-gjzjx
   PYTHON_VERSION=3.7.7
   LANGUAGE=C.UTF-8
   POSTGRES_PASSWORD=*snip*
   PIP_VERSION=19.0.2
   AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=False
   AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080
   AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
   AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
   AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection
   BOILING_LAND_WEB_SERVICE_PORT_WEB=8080
   AIRFLOW__CORE__DONOT_PICKLE=false
   BOILING_LAND_WEB_PORT=tcp://172.20.191.242:8080
   PWD=/opt/airflow
   AIRFLOW_VERSION=1.10.10
   AIRFLOW__SMTP__SMTP_MAIL_FROM=*snip*
   AWS_ROLE_ARN=*snip*
   AIRFLOW__CORE__LOAD_EXAMPLES=False
   TZ=Etc/UTC
   AIRFLOW__KUBERNETES__GIT_REPO=git@gitlab.com:whize/airflow-dags.git
   AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=/opt/airflow/dags
   HOME=/home/airflow
   AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=boiling-land-env
   LANG=C.UTF-8
   KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443
   AIRFLOW_HOME=/opt/airflow
   DATABASE_USER=postgres
   AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME=airflow-kube-pods-git
   DATABASE_PORT=5432
   GPG_KEY=*snip*
   AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOGGING=True
   AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
   AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://*snip*
   AIRFLOW__KUBERNETES__RUN_AS_USER=50000
   AIRFLOW__CORE__BASE_LOG_FOLDER=/opt/airflow/logs
   AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
   AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false
   TERM=xterm
   AIRFLOW__SCHEDULER__MAX_THREADS=8
   AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=5
   AIRFLOW_CONN_S3_CONNECTION=aws://
   
   AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=apache/airflow
   DATABASE_DB=airflow
   AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG=1.10.10-python3.7
   BOILING_LAND_WEB_PORT_8080_TCP_PROTO=tcp
   AIRFLOW__KUBERNETES__IN_CLUSTER=True
   DATABASE_PASSWORD=*snip*
   AIRFLOW_GID=50000
   SHLVL=1
   AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
   KUBERNETES_PORT_443_TCP_PROTO=tcp
   BOILING_LAND_WEB_SERVICE_HOST=172.20.191.242
   LC_MESSAGES=C.UTF-8
   PYTHON_PIP_VERSION=20.0.2
   KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1
   DATABASE_HOST=*snip*
   AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection
   AIRFLOW__EMAIL__EMAIL_BACKEND=airflow.utils.email.send_email_smtp
   LC_CTYPE=C.UTF-8
   AIRFLOW__SMTP__SMTP_STARTTLS=False
   AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=boiling-land
   PYTHON_GET_PIP_SHA256=*snip*
   AIRFLOW__CORE__SQL_ALCHEMY_CONN=*snip*
   KUBERNETES_SERVICE_HOST=172.20.0.1
   LC_ALL=C.UTF-8
   AIRFLOW__CORE__REMOTE_LOGGING=True
   KUBERNETES_PORT=tcp://172.20.0.1:443
   KUBERNETES_PORT_443_TCP_PORT=443
   AIRFLOW_KUBERNETES_ENVIRONMENT_VARIABLES_KUBE_CLIENT_REQUEST_TIMEOUT_SEC=50
   AIRFLOW__KUBERNETES__GIT_BRANCH=master
   PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d59197a3c169cef378a22428a3fa99d33e080a5d/get-pip.py
   AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=False
   PATH=/home/airflow/.local/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
   AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo/
   PYTHON_BASE_IMAGE=python:3.7-slim-buster
   AIRFLOW_UID=50000
   AIRFLOW__CORE__FERNET_KEY=*snip*
   DEBIAN_FRONTEND=noninteractive
   BOILING_LAND_WEB_PORT_8080_TCP=tcp://172.20.191.242:8080
   AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__FERNET_KEY=*snip*
   AIRFLOW__SMTP__SMTP_SSL=False
   BOILING_LAND_WEB_PORT_8080_TCP_ADDR=172.20.191.242
   AIRFLOW__SMTP__SMTP_HOST=email-smtp.us-east-1.amazonaws.com
   _=/usr/bin/env
   </details>
   
   - **Cloud provider or hardware configuration**: AWS EKS
   
   - **OS** (e.g. from /etc/os-release):
   NAME="Amazon Linux"
   VERSION="2"
   
   - **Kernel** (e.g. `uname -a`): Linux<AWS_INTERNAL_HOSTNAME>.x86_64 #1 SMP Thu May 7 18:48:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
   
   
   - **Install tools**:
   - **Others**:
   
   **What happened**:
   Using the KubernetesExecutor a pod that prepares to launch a task running the KubernetesPodOperator is launched. This task fails due to an issue in the task definition such as an invalid option. The pod does not exit immediately it takes about 40minutes for it to exit after the failure and report its state in the UI. Interestingly the execution time is correctly listed as < 1 second.
   <!-- (please include exact error messages if you can) -->
   
   It also says that it is marking the job failed in the task logs on the launcher pod, which after about 40 minutes the state does change:
   
   <details>
   
   `[2020-08-03 04:28:32,844] {taskinstance.py:1202} INFO - Marking task as FAILED.dag_id=arxiv_crawler_pipeline, task_id=launch_crawl_pod, execution_date=20200803T042640, start_date=20200803T042652, end_date=20200803T042832
   `
   </details>
   
   The scheduler logs on the launcher pod say nothing about the failure though:
   <details>
   
   ```
   [2020-08-03 04:26:51,543] {__init__.py:51} INFO - Using executor LocalExecutor
   [2020-08-03 04:26:51,544] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/crawlers/arxiv/arxiv_crawl_pipeline.py
   /home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py:159: PendingDeprecationWarning: Invalid arguments were passed to KubernetesPodOperator (task_id: launch_crawl_pod). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
   *args: ()
   **kwargs: {'reattach_on_restart': True, 'log_events_on_failure': True}
     super(KubernetesPodOperator, self).__init__(*args, resources=None, **kwargs)
   /home/airflow/.local/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py:71: PendingDeprecationWarning: Invalid arguments were passed to HttpSensor (task_id: wait_for_finish). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
   *args: ()
   **kwargs: {'result_check': <function check_http_response at 0x7fdc378bf320>}
     super(BaseSensorOperator, self).__init__(*args, **kwargs)
   Running %s on host %s <TaskInstance: arxiv_crawler_pipeline.launch_crawl_pod 2020-08-03T04:26:40.850022+00:00 [queued]> arxivcrawlerpipelinelaunchcrawlpod-4c7e99ae14704b2b8fa0d64db508
   ```
   
   </details>
   
   **What you expected to happen**:
   The pod should exit immediately and report the failed task state in the metadata database which should then be reflected in the job UI in a much more timely fashion.
   <!-- What do you think went wrong? -->
   No idea, I've been looking at this for about 12 hours now and this report is my, I can't figure it out moment.
   **How to reproduce it**:
   
   Set up an airflow cluster on a Kubernetes cluster with the KubernetesExecutor and create a job that attempts to launch a KubernetesPodOperator task that will fail either in the attempt to launch or in the pod that is created by the task itself.
   
   
   How often does this problem occur? Once? Every time etc?
   Every single time.
   Any relevant logs to include? Put them here in side a detail tag:
   
   The logs don't really give any insight to why there is such a dramatic lag between failure and updating the metadata.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] iantbutler01 edited a comment on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.

Posted by GitBox <gi...@apache.org>.
iantbutler01 edited a comment on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668560331


   I can confirm that this is the cause. There is no current option even to pass a timeout to SMTP right now but disabling sending emails cleared it right up. Failure was correctly reported in a 5min window instead of 40min+ I'm going to look into adding both a sensible default timeout and a config option.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] iantbutler01 commented on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.

Posted by GitBox <gi...@apache.org>.
iantbutler01 commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668273631


   KubernetesPodOperator is a red herring. I believe I know the issue and it should be addressed as it will happen on every failure that attempts to send an email.
   
   s = smtplib.SMTP_SSL(SMTP_HOST, SMTP_PORT) if SMTP_SSL else smtplib.SMTP(SMTP_HOST, SMTP_PORT)
   does not have a timeout set. In this case it falls back to socket._GLOBAL_DEFAULT_TIMEOUT which in the case of socket.connect is treated as None. This means it will hang indefinitely if it is unable to make the SMPT connection.
   
   I am testing the fix on my end right now assuming all good I think a PR should be made to set a default timeout of say 60 seconds. I am happy to make and test that PR if it's wanted.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-667799621


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #10122: SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668576549


   Nice! Looking forward to the PR fixing it :)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] iantbutler01 commented on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.

Posted by GitBox <gi...@apache.org>.
iantbutler01 commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668560331


   I can confirm that this is the cause. There is no option even to pass a timeout to SMTP right now but disabling sending emails cleared it right up. Failure was correctly reported in a 5min window instead of 40min+ I'm going to look into adding both a sensible default timeout and a config option.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal closed issue #10122: SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.

Posted by GitBox <gi...@apache.org>.
eladkal closed issue #10122:
URL: https://github.com/apache/airflow/issues/10122


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #10122: SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-786177390


   default timeout was added in https://github.com/apache/airflow/pull/12801


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org