You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/03 04:43:22 UTC
[GitHub] [airflow] iantbutler01 opened a new issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.
iantbutler01 opened a new issue #10122:
URL: https://github.com/apache/airflow/issues/10122
<!--
Welcome to Apache Airflow! For a smooth issue process, try to answer the following questions.
Don't worry if they're not all applicable; just try to include what you can :-)
If you need to include code snippets or logs, please put them in fenced code
blocks. If they're super-long, please use the details tag like
<details><summary>super-long log</summary> lots of stuff </details>
Please delete these comment blocks before submitting the issue.
-->
<!--
IMPORTANT!!!
PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
Please complete the next sections or the issue will be closed.
This questions are the first thing we need to know to understand the context.
-->
**Apache Airflow version**:
1.10.10
**Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
v1.16.8-eks-e16311
**Environment**:
<details>
KUBERNETES_SERVICE_PORT_HTTPS=443
AIRFLOW__SMTP__SMTP_PORT=25
AIRFLOW__KUBERNETES__NAMESPACE=airflow
AIRFLOW__SMTP__SMTP_PASSWORD=*snip*
AIRFLOW__SMTP__SMTP_USER=*snip*
KUBERNETES_SERVICE_PORT=443
BOILING_LAND_WEB_PORT_8080_TCP_PORT=8080
REDIS_PASSWORD=fjODRhL3FL6n0y4cA
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_BASE_LOGS_FOLDER=*snip*
BOILING_LAND_WEB_SERVICE_PORT=8080
HOSTNAME=boiling-land-scheduler-7bcb794c75-gjzjx
PYTHON_VERSION=3.7.7
LANGUAGE=C.UTF-8
POSTGRES_PASSWORD=*snip*
PIP_VERSION=19.0.2
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=False
AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080
AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection
BOILING_LAND_WEB_SERVICE_PORT_WEB=8080
AIRFLOW__CORE__DONOT_PICKLE=false
BOILING_LAND_WEB_PORT=tcp://172.20.191.242:8080
PWD=/opt/airflow
AIRFLOW_VERSION=1.10.10
AIRFLOW__SMTP__SMTP_MAIL_FROM=*snip*
AWS_ROLE_ARN=*snip*
AIRFLOW__CORE__LOAD_EXAMPLES=False
TZ=Etc/UTC
AIRFLOW__KUBERNETES__GIT_REPO=git@gitlab.com:whize/airflow-dags.git
AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=/opt/airflow/dags
HOME=/home/airflow
AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=boiling-land-env
LANG=C.UTF-8
KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443
AIRFLOW_HOME=/opt/airflow
DATABASE_USER=postgres
AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME=airflow-kube-pods-git
DATABASE_PORT=5432
GPG_KEY=*snip*
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOGGING=True
AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://*snip*
AIRFLOW__KUBERNETES__RUN_AS_USER=50000
AIRFLOW__CORE__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false
TERM=xterm
AIRFLOW__SCHEDULER__MAX_THREADS=8
AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=5
AIRFLOW_CONN_S3_CONNECTION=aws://
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=apache/airflow
DATABASE_DB=airflow
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG=1.10.10-python3.7
BOILING_LAND_WEB_PORT_8080_TCP_PROTO=tcp
AIRFLOW__KUBERNETES__IN_CLUSTER=True
DATABASE_PASSWORD=*snip*
AIRFLOW_GID=50000
SHLVL=1
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
KUBERNETES_PORT_443_TCP_PROTO=tcp
BOILING_LAND_WEB_SERVICE_HOST=172.20.191.242
LC_MESSAGES=C.UTF-8
PYTHON_PIP_VERSION=20.0.2
KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1
DATABASE_HOST=*snip*
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection
AIRFLOW__EMAIL__EMAIL_BACKEND=airflow.utils.email.send_email_smtp
LC_CTYPE=C.UTF-8
AIRFLOW__SMTP__SMTP_STARTTLS=False
AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=boiling-land
PYTHON_GET_PIP_SHA256=*snip*
AIRFLOW__CORE__SQL_ALCHEMY_CONN=*snip*
KUBERNETES_SERVICE_HOST=172.20.0.1
LC_ALL=C.UTF-8
AIRFLOW__CORE__REMOTE_LOGGING=True
KUBERNETES_PORT=tcp://172.20.0.1:443
KUBERNETES_PORT_443_TCP_PORT=443
AIRFLOW_KUBERNETES_ENVIRONMENT_VARIABLES_KUBE_CLIENT_REQUEST_TIMEOUT_SEC=50
AIRFLOW__KUBERNETES__GIT_BRANCH=master
PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d59197a3c169cef378a22428a3fa99d33e080a5d/get-pip.py
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=False
PATH=/home/airflow/.local/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo/
PYTHON_BASE_IMAGE=python:3.7-slim-buster
AIRFLOW_UID=50000
AIRFLOW__CORE__FERNET_KEY=*snip*
DEBIAN_FRONTEND=noninteractive
BOILING_LAND_WEB_PORT_8080_TCP=tcp://172.20.191.242:8080
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__FERNET_KEY=*snip*
AIRFLOW__SMTP__SMTP_SSL=False
BOILING_LAND_WEB_PORT_8080_TCP_ADDR=172.20.191.242
AIRFLOW__SMTP__SMTP_HOST=email-smtp.us-east-1.amazonaws.com
_=/usr/bin/env
</details>
- **Cloud provider or hardware configuration**: AWS EKS
- **OS** (e.g. from /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
- **Kernel** (e.g. `uname -a`): Linux<AWS_INTERNAL_HOSTNAME>.x86_64 #1 SMP Thu May 7 18:48:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- **Install tools**:
- **Others**:
**What happened**:
Using the KubernetesExecutor a pod that prepares to launch a task running the KubernetesPodOperator is launched. This task fails due to an issue in the task definition such as an invalid option. The pod does not exit immediately it takes about 40minutes for it to exit after the failure and report its state in the UI. Interestingly the execution time is correctly listed as < 1 second.
<!-- (please include exact error messages if you can) -->
It also says that it is marking the job failed in the task logs on the launcher pod, which after about 40 minutes the state does change:
<details>
`[2020-08-03 04:28:32,844] {taskinstance.py:1202} INFO - Marking task as FAILED.dag_id=arxiv_crawler_pipeline, task_id=launch_crawl_pod, execution_date=20200803T042640, start_date=20200803T042652, end_date=20200803T042832
`
</details>
The scheduler logs on the launcher pod say nothing about the failure though:
<details>
```
[2020-08-03 04:26:51,543] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-08-03 04:26:51,544] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/crawlers/arxiv/arxiv_crawl_pipeline.py
/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py:159: PendingDeprecationWarning: Invalid arguments were passed to KubernetesPodOperator (task_id: launch_crawl_pod). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'reattach_on_restart': True, 'log_events_on_failure': True}
super(KubernetesPodOperator, self).__init__(*args, resources=None, **kwargs)
/home/airflow/.local/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py:71: PendingDeprecationWarning: Invalid arguments were passed to HttpSensor (task_id: wait_for_finish). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'result_check': <function check_http_response at 0x7fdc378bf320>}
super(BaseSensorOperator, self).__init__(*args, **kwargs)
Running %s on host %s <TaskInstance: arxiv_crawler_pipeline.launch_crawl_pod 2020-08-03T04:26:40.850022+00:00 [queued]> arxivcrawlerpipelinelaunchcrawlpod-4c7e99ae14704b2b8fa0d64db508
```
</details>
**What you expected to happen**:
The pod should exit immediately and report the failed task state in the metadata database which should then be reflected in the job UI in a much more timely fashion.
<!-- What do you think went wrong? -->
No idea, I've been looking at this for about 12 hours now and this report is my, I can't figure it out moment.
**How to reproduce it**:
Set up an airflow cluster on a Kubernetes cluster with the KubernetesExecutor and create a job that attempts to launch a KubernetesPodOperator task that will fail either in the attempt to launch or in the pod that is created by the task itself.
How often does this problem occur? Once? Every time etc?
Every single time.
Any relevant logs to include? Put them here in side a detail tag:
The logs don't really give any insight to why there is such a dramatic lag between failure and updating the metadata.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] iantbutler01 edited a comment on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.
Posted by GitBox <gi...@apache.org>.
iantbutler01 edited a comment on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668560331
I can confirm that this is the cause. There is no current option even to pass a timeout to SMTP right now but disabling sending emails cleared it right up. Failure was correctly reported in a 5min window instead of 40min+ I'm going to look into adding both a sensible default timeout and a config option.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] iantbutler01 commented on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.
Posted by GitBox <gi...@apache.org>.
iantbutler01 commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668273631
KubernetesPodOperator is a red herring. I believe I know the issue and it should be addressed as it will happen on every failure that attempts to send an email.
s = smtplib.SMTP_SSL(SMTP_HOST, SMTP_PORT) if SMTP_SSL else smtplib.SMTP(SMTP_HOST, SMTP_PORT)
does not have a timeout set. In this case it falls back to socket._GLOBAL_DEFAULT_TIMEOUT which in the case of socket.connect is treated as None. This means it will hang indefinitely if it is unable to make the SMPT connection.
I am testing the fix on my end right now assuming all good I think a PR should be made to set a default timeout of say 60 seconds. I am happy to make and test that PR if it's wanted.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-667799621
Thanks for opening your first issue here! Be sure to follow the issue template!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #10122: SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668576549
Nice! Looking forward to the PR fixing it :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] iantbutler01 commented on issue #10122: KubernetesPodOperator specifically takes a very long time to update to failed state after task fails.
Posted by GitBox <gi...@apache.org>.
iantbutler01 commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-668560331
I can confirm that this is the cause. There is no option even to pass a timeout to SMTP right now but disabling sending emails cleared it right up. Failure was correctly reported in a 5min window instead of 40min+ I'm going to look into adding both a sensible default timeout and a config option.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] eladkal closed issue #10122: SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.
Posted by GitBox <gi...@apache.org>.
eladkal closed issue #10122:
URL: https://github.com/apache/airflow/issues/10122
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] eladkal commented on issue #10122: SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.
Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #10122:
URL: https://github.com/apache/airflow/issues/10122#issuecomment-786177390
default timeout was added in https://github.com/apache/airflow/pull/12801
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org