You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/25 07:28:41 UTC
[GitHub] [airflow] FloChehab opened a new issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
FloChehab opened a new issue #10541:
URL: https://github.com/apache/airflow/issues/10541
**Apache Airflow version**: 1.10.12 rc4
**Kubernetes version (if you are using kubernetes)** (use `kubectl version`): v1.16.11-gke.5
**Environment**:
- **Cloud provider or hardware configuration**: /
- **OS** (e.g. from /etc/os-release): /
- **Kernel** (e.g. `uname -a`): /
- **Install tools**: /
- **Others**: `apache/airflow@sha256:6de1374274f26836c98bbe9f8c065215491f8f5bd48bedc155765dec9b883144`
**What happened**:
This issue is a followup to discussions on https://github.com/apache/airflow/pull/10230#issuecomment-679274286 .
Let's take this dag:
```python
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.kubernetes.secret import Secret
from airflow.models import DAG
from airflow.utils.dates import days_ago
default_args = {
'owner': 'Airflow',
'start_date': days_ago(2),
'retries': 3
}
with DAG(
dag_id='bug_kuberntes_pod_operator',
default_args=default_args,
schedule_interval=None
) as dag:
k = KubernetesPodOperator(
namespace='airflow',
image="ubuntu:16.04",
cmds=["bash", "-cx"],
arguments=["sleep 100"],
name="airflow-test-pod",
task_id="task",
get_logs=True,
is_delete_operator_pod=True,
)
```
If you:
1. Trigger the dag,
2. Wait for the task to be up and running on kubernetes,
3. Kill everything related to airflow (except the task running on kubernetes),
4. Wait for the task to complete on Kubernetes,
5. Restart airflow.
The the task would be marked as `up_for_retry` and would be stuck in this state until another scheduler restart.
**What you expected to happen**:
The task to be marked as success on the first scheduler restart or not stuck in `up_for_retry` state.
**How to reproduce it**:
* Use the dag above,
* Tested with LocalExecutor and CeleryExecutor (on keda) ; both with helm chart from master. With no major changes except setting the timezone to Europe/Paris.
**Anything else we need to know**:
* The issue seems to appear every time,
* Scheduler logs can be found here: https://github.com/apache/airflow/pull/10230#issuecomment-679304807 & https://github.com/apache/airflow/pull/10230#issuecomment-679314891
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] FloChehab commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
Posted by GitBox <gi...@apache.org>.
FloChehab commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-785719910
Hello @kaxil,
Just tested this morning, and everything looks fine on 2.0.
I am closing the issue now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] kaxil commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-785360274
Does this still occur with Airflow 2.0?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] FloChehab commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
Posted by GitBox <gi...@apache.org>.
FloChehab commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-683281891
Hi @luozhaoyu, I am not sure your issue is the same: based on the log you provided, I can see that you simply can't interact with the kubernetes API in your cluster (it's more a configuration issue on your side I guess).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] FloChehab closed issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
Posted by GitBox <gi...@apache.org>.
FloChehab closed issue #10541:
URL: https://github.com/apache/airflow/issues/10541
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] luozhaoyu commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
Posted by GitBox <gi...@apache.org>.
luozhaoyu commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-683225572
I also encountered the same issue using:
1. manifest generated from helm chart master branch
2. KubernetesPodOperator
3. using both minikube and a real k8s cluster
```
airflow@airflow-scheduler-54797f7ddb-5bsb7:/opt/airflow$ airflow run my_example start1 2020-08-24T09:00:00+00:00 -sd /tmp/my_example.py
[2020-08-29 02:51:24,996] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
[2020-08-29 02:51:24,996] {settings.py:273} DEBUG - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=22402
[2020-08-29 02:51:25,162] {sentry.py:179} DEBUG - Could not configure Sentry: No module named 'blinker', using DummySentry instead.
[2020-08-29 02:51:25,228] {__init__.py:45} DEBUG - Cannot import due to doesn't look like a module path
[2020-08-29 02:51:25,467] {cli_action_loggers.py:42} DEBUG - Adding <function default_action_log at 0x7f112d7b3430> to pre execution callback
[2020-08-29 02:51:25,861] {cli_action_loggers.py:68} DEBUG - Calling callbacks: [<function default_action_log at 0x7f112d7b3430>]
[2020-08-29 02:51:25,887] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
[2020-08-29 02:51:25,887] {settings.py:241} DEBUG - settings.configure_orm(): Using NullPool
/home/airflow/.local/lib/python3.8/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
from airflow.contrib.kubernetes.pod import _extract_volume_mounts
[2020-08-29 02:51:26,196] {__init__.py:50} INFO - Using executor KubernetesExecutor
[2020-08-29 02:51:26,200] {dagbag.py:417} INFO - Filling up the DagBag from /tmp/my_example.py
[2020-08-29 02:51:26,201] {dagbag.py:245} DEBUG - Importing /tmp/my_example.py
[2020-08-29 02:51:26,210] {dagbag.py:384} DEBUG - Loaded DAG <DAG: my_example>
Running %s on host %s <TaskInstance: my_example.start1 2020-08-24T09:00:00+00:00 [None]> airflow-scheduler-54797f7ddb-5bsb7
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 37, in <module>
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 579, in run
_run(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 500, in _run
executor.start()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 786, in start
self.clear_not_launched_queued_tasks()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 719, in clear_not_launched_queued_tasks
pod_list = self.kube_client.list_namespaced_pod(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12803, in list_namespaced_pod
(data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12891, in list_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
return self.__call_api(resource_path, method,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
response_data = self.request(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 362, in request
return self.rest_client.GET(url,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
return self.request("GET", url,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sat, 29 Aug 2020 02:51:26 GMT', 'Content-Length': '282'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:airflow:airflow\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
```
This is my DAG:
```
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now() - timedelta(days=1),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'my_example', default_args=default_args)
start1 = KubernetesPodOperator(namespace='airflow',
image="python:3.6",
image_pull_policy="Always",
cmds=["python","-c"],
arguments=["print('hello world')"],
labels={"foo": "bar"},
name="start1",
resources={"request_cpu": "256m", "limit_cpu": "1", "request_memory": "256Mi","limit_memory": "1Gi"},
task_id="start1",
get_logs=True,
dag=dag
)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] luozhaoyu edited a comment on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.
Posted by GitBox <gi...@apache.org>.
luozhaoyu edited a comment on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-683225572
I also encountered the same issue using:
1. manifest generated from helm chart master branch
2. KubernetesPodOperator
3. using both minikube and a real k8s cluster
4. docker image 1.10.12-python3.8
```
airflow@airflow-scheduler-54797f7ddb-5bsb7:/opt/airflow$ airflow run my_example start1 2020-08-24T09:00:00+00:00 -sd /tmp/my_example.py
[2020-08-29 02:51:24,996] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
[2020-08-29 02:51:24,996] {settings.py:273} DEBUG - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=22402
[2020-08-29 02:51:25,162] {sentry.py:179} DEBUG - Could not configure Sentry: No module named 'blinker', using DummySentry instead.
[2020-08-29 02:51:25,228] {__init__.py:45} DEBUG - Cannot import due to doesn't look like a module path
[2020-08-29 02:51:25,467] {cli_action_loggers.py:42} DEBUG - Adding <function default_action_log at 0x7f112d7b3430> to pre execution callback
[2020-08-29 02:51:25,861] {cli_action_loggers.py:68} DEBUG - Calling callbacks: [<function default_action_log at 0x7f112d7b3430>]
[2020-08-29 02:51:25,887] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
[2020-08-29 02:51:25,887] {settings.py:241} DEBUG - settings.configure_orm(): Using NullPool
/home/airflow/.local/lib/python3.8/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
from airflow.contrib.kubernetes.pod import _extract_volume_mounts
[2020-08-29 02:51:26,196] {__init__.py:50} INFO - Using executor KubernetesExecutor
[2020-08-29 02:51:26,200] {dagbag.py:417} INFO - Filling up the DagBag from /tmp/my_example.py
[2020-08-29 02:51:26,201] {dagbag.py:245} DEBUG - Importing /tmp/my_example.py
[2020-08-29 02:51:26,210] {dagbag.py:384} DEBUG - Loaded DAG <DAG: my_example>
Running %s on host %s <TaskInstance: my_example.start1 2020-08-24T09:00:00+00:00 [None]> airflow-scheduler-54797f7ddb-5bsb7
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 37, in <module>
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 579, in run
_run(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 500, in _run
executor.start()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 786, in start
self.clear_not_launched_queued_tasks()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 719, in clear_not_launched_queued_tasks
pod_list = self.kube_client.list_namespaced_pod(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12803, in list_namespaced_pod
(data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12891, in list_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
return self.__call_api(resource_path, method,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
response_data = self.request(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 362, in request
return self.rest_client.GET(url,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
return self.request("GET", url,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sat, 29 Aug 2020 02:51:26 GMT', 'Content-Length': '282'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:airflow:airflow\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
```
This is my DAG:
```
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now() - timedelta(days=1),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'my_example', default_args=default_args)
start1 = KubernetesPodOperator(namespace='airflow',
image="python:3.6",
image_pull_policy="Always",
cmds=["python","-c"],
arguments=["print('hello world')"],
labels={"foo": "bar"},
name="start1",
resources={"request_cpu": "256m", "limit_cpu": "1", "request_memory": "256Mi","limit_memory": "1Gi"},
task_id="start1",
get_logs=True,
dag=dag
)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org