You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/25 07:28:41 UTC

[GitHub] [airflow] FloChehab opened a new issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

FloChehab opened a new issue #10541:
URL: https://github.com/apache/airflow/issues/10541


   **Apache Airflow version**: 1.10.12 rc4
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):  v1.16.11-gke.5
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: /
   - **OS** (e.g. from /etc/os-release): /
   - **Kernel** (e.g. `uname -a`): /
   - **Install tools**: /
   - **Others**: `apache/airflow@sha256:6de1374274f26836c98bbe9f8c065215491f8f5bd48bedc155765dec9b883144`
   
   **What happened**:
   
   This issue is a followup to discussions on https://github.com/apache/airflow/pull/10230#issuecomment-679274286 .
   
   Let's take this dag:
   
   ```python
   from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
   from airflow.kubernetes.secret import Secret
   from airflow.models import DAG
   from airflow.utils.dates import days_ago
   
   
   default_args = {
       'owner': 'Airflow',
       'start_date': days_ago(2),
       'retries': 3
   }
   
   with DAG(
       dag_id='bug_kuberntes_pod_operator',
       default_args=default_args,
       schedule_interval=None
   ) as dag:
       k = KubernetesPodOperator(
           namespace='airflow',
           image="ubuntu:16.04",
           cmds=["bash", "-cx"],
           arguments=["sleep 100"],
           name="airflow-test-pod",
           task_id="task",
           get_logs=True,
           is_delete_operator_pod=True,
       )
   ```
   
   If you:
   1. Trigger the dag,
   2. Wait for the task to be up and running on kubernetes,
   3. Kill everything related to airflow (except the task running on kubernetes),
   4. Wait for the task to complete on Kubernetes,
   5. Restart airflow.
   
   The the task would be marked as `up_for_retry` and would be stuck in this state until another scheduler restart.
   
   **What you expected to happen**:
   
   The task to be marked as success on the first scheduler restart or not stuck in `up_for_retry` state.
   
   **How to reproduce it**:
   
   * Use the dag above,
   * Tested with LocalExecutor and CeleryExecutor (on keda) ; both with helm chart from master. With no major changes except setting the timezone to Europe/Paris.
   
   
   **Anything else we need to know**:
   
   * The issue seems to appear every time,
   * Scheduler logs can be found here: https://github.com/apache/airflow/pull/10230#issuecomment-679304807 & https://github.com/apache/airflow/pull/10230#issuecomment-679314891


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] FloChehab commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

Posted by GitBox <gi...@apache.org>.
FloChehab commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-785719910


   Hello @kaxil, 
   Just tested this morning, and everything looks fine on 2.0.
   
   I am closing the issue now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-785360274


   Does this still occur with Airflow 2.0?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] FloChehab commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

Posted by GitBox <gi...@apache.org>.
FloChehab commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-683281891


   Hi @luozhaoyu, I am not sure your issue is the same: based on the log you provided, I can see that you simply can't interact with the kubernetes API in your cluster (it's more a configuration issue on your side I guess).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] FloChehab closed issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

Posted by GitBox <gi...@apache.org>.
FloChehab closed issue #10541:
URL: https://github.com/apache/airflow/issues/10541


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] luozhaoyu commented on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

Posted by GitBox <gi...@apache.org>.
luozhaoyu commented on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-683225572


   I also encountered the same issue using:
   1. manifest generated from helm chart master branch
   2. KubernetesPodOperator
   3. using both minikube and a real k8s cluster
   
   
   ```
   airflow@airflow-scheduler-54797f7ddb-5bsb7:/opt/airflow$ airflow run my_example start1 2020-08-24T09:00:00+00:00 -sd /tmp/my_example.py
   [2020-08-29 02:51:24,996] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
   [2020-08-29 02:51:24,996] {settings.py:273} DEBUG - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=22402
   [2020-08-29 02:51:25,162] {sentry.py:179} DEBUG - Could not configure Sentry: No module named 'blinker', using DummySentry instead.
   [2020-08-29 02:51:25,228] {__init__.py:45} DEBUG - Cannot import  due to  doesn't look like a module path
   [2020-08-29 02:51:25,467] {cli_action_loggers.py:42} DEBUG - Adding <function default_action_log at 0x7f112d7b3430> to pre execution callback
   [2020-08-29 02:51:25,861] {cli_action_loggers.py:68} DEBUG - Calling callbacks: [<function default_action_log at 0x7f112d7b3430>]
   [2020-08-29 02:51:25,887] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
   [2020-08-29 02:51:25,887] {settings.py:241} DEBUG - settings.configure_orm(): Using NullPool
   /home/airflow/.local/lib/python3.8/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
     from airflow.contrib.kubernetes.pod import _extract_volume_mounts
   [2020-08-29 02:51:26,196] {__init__.py:50} INFO - Using executor KubernetesExecutor
   [2020-08-29 02:51:26,200] {dagbag.py:417} INFO - Filling up the DagBag from /tmp/my_example.py
   [2020-08-29 02:51:26,201] {dagbag.py:245} DEBUG - Importing /tmp/my_example.py
   [2020-08-29 02:51:26,210] {dagbag.py:384} DEBUG - Loaded DAG <DAG: my_example>
   Running %s on host %s <TaskInstance: my_example.start1 2020-08-24T09:00:00+00:00 [None]> airflow-scheduler-54797f7ddb-5bsb7
   Traceback (most recent call last):
     File "/home/airflow/.local/bin/airflow", line 37, in <module>
       args.func(args)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 76, in wrapper
       return f(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 579, in run
       _run(args, dag, ti)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 500, in _run
       executor.start()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 786, in start
       self.clear_not_launched_queued_tasks()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 74, in wrapper
       return func(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 719, in clear_not_launched_queued_tasks
       pod_list = self.kube_client.list_namespaced_pod(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12803, in list_namespaced_pod
       (data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12891, in list_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
       return self.__call_api(resource_path, method,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
       response_data = self.request(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 362, in request
       return self.rest_client.GET(url,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
       return self.request("GET", url,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
       raise ApiException(http_resp=r)
   kubernetes.client.rest.ApiException: (403)
   Reason: Forbidden
   HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sat, 29 Aug 2020 02:51:26 GMT', 'Content-Length': '282'})
   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:airflow:airflow\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
   ``` 
   
   This is my DAG:
   ```
   from airflow import DAG
   from datetime import datetime, timedelta
   from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
   from airflow.operators.dummy_operator import DummyOperator
   
   
   default_args = {
       'owner': 'airflow',
       'depends_on_past': False,
       'start_date': datetime.now() - timedelta(days=1),
       'email': ['airflow@example.com'],
       'email_on_failure': False,
       'email_on_retry': False,
       'retries': 1,
       'retry_delay': timedelta(minutes=5)
   }
   
   dag = DAG(
       'my_example', default_args=default_args)
   
   
   start1 = KubernetesPodOperator(namespace='airflow',
                             image="python:3.6",
                             image_pull_policy="Always",
                             cmds=["python","-c"],
                             arguments=["print('hello world')"],
                             labels={"foo": "bar"},
                             name="start1",
                             resources={"request_cpu": "256m", "limit_cpu": "1", "request_memory": "256Mi","limit_memory": "1Gi"},
                             task_id="start1",
                             get_logs=True,
                             dag=dag
                             )
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] luozhaoyu edited a comment on issue #10541: KubernetesPodOperator stuck in `up_for_retry` state after scheduler restart.

Posted by GitBox <gi...@apache.org>.
luozhaoyu edited a comment on issue #10541:
URL: https://github.com/apache/airflow/issues/10541#issuecomment-683225572


   I also encountered the same issue using:
   1. manifest generated from helm chart master branch
   2. KubernetesPodOperator
   3. using both minikube and a real k8s cluster
   4. docker image 1.10.12-python3.8
   
   
   ```
   airflow@airflow-scheduler-54797f7ddb-5bsb7:/opt/airflow$ airflow run my_example start1 2020-08-24T09:00:00+00:00 -sd /tmp/my_example.py
   [2020-08-29 02:51:24,996] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
   [2020-08-29 02:51:24,996] {settings.py:273} DEBUG - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=22402
   [2020-08-29 02:51:25,162] {sentry.py:179} DEBUG - Could not configure Sentry: No module named 'blinker', using DummySentry instead.
   [2020-08-29 02:51:25,228] {__init__.py:45} DEBUG - Cannot import  due to  doesn't look like a module path
   [2020-08-29 02:51:25,467] {cli_action_loggers.py:42} DEBUG - Adding <function default_action_log at 0x7f112d7b3430> to pre execution callback
   [2020-08-29 02:51:25,861] {cli_action_loggers.py:68} DEBUG - Calling callbacks: [<function default_action_log at 0x7f112d7b3430>]
   [2020-08-29 02:51:25,887] {settings.py:233} DEBUG - Setting up DB connection pool (PID 22402)
   [2020-08-29 02:51:25,887] {settings.py:241} DEBUG - settings.configure_orm(): Using NullPool
   /home/airflow/.local/lib/python3.8/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
     from airflow.contrib.kubernetes.pod import _extract_volume_mounts
   [2020-08-29 02:51:26,196] {__init__.py:50} INFO - Using executor KubernetesExecutor
   [2020-08-29 02:51:26,200] {dagbag.py:417} INFO - Filling up the DagBag from /tmp/my_example.py
   [2020-08-29 02:51:26,201] {dagbag.py:245} DEBUG - Importing /tmp/my_example.py
   [2020-08-29 02:51:26,210] {dagbag.py:384} DEBUG - Loaded DAG <DAG: my_example>
   Running %s on host %s <TaskInstance: my_example.start1 2020-08-24T09:00:00+00:00 [None]> airflow-scheduler-54797f7ddb-5bsb7
   Traceback (most recent call last):
     File "/home/airflow/.local/bin/airflow", line 37, in <module>
       args.func(args)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 76, in wrapper
       return f(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 579, in run
       _run(args, dag, ti)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/bin/cli.py", line 500, in _run
       executor.start()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 786, in start
       self.clear_not_launched_queued_tasks()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 74, in wrapper
       return func(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 719, in clear_not_launched_queued_tasks
       pod_list = self.kube_client.list_namespaced_pod(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12803, in list_namespaced_pod
       (data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 12891, in list_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
       return self.__call_api(resource_path, method,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
       response_data = self.request(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 362, in request
       return self.rest_client.GET(url,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
       return self.request("GET", url,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
       raise ApiException(http_resp=r)
   kubernetes.client.rest.ApiException: (403)
   Reason: Forbidden
   HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sat, 29 Aug 2020 02:51:26 GMT', 'Content-Length': '282'})
   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:airflow:airflow\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
   ``` 
   
   This is my DAG:
   ```
   from airflow import DAG
   from datetime import datetime, timedelta
   from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
   from airflow.operators.dummy_operator import DummyOperator
   
   
   default_args = {
       'owner': 'airflow',
       'depends_on_past': False,
       'start_date': datetime.now() - timedelta(days=1),
       'email': ['airflow@example.com'],
       'email_on_failure': False,
       'email_on_retry': False,
       'retries': 1,
       'retry_delay': timedelta(minutes=5)
   }
   
   dag = DAG(
       'my_example', default_args=default_args)
   
   
   start1 = KubernetesPodOperator(namespace='airflow',
                             image="python:3.6",
                             image_pull_policy="Always",
                             cmds=["python","-c"],
                             arguments=["print('hello world')"],
                             labels={"foo": "bar"},
                             name="start1",
                             resources={"request_cpu": "256m", "limit_cpu": "1", "request_memory": "256Mi","limit_memory": "1Gi"},
                             task_id="start1",
                             get_logs=True,
                             dag=dag
                             )
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org