You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/06/02 15:19:19 UTC

[GitHub] [airflow] easontm commented on issue #13542: Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing

easontm commented on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-853117451


   More data, if it can benefit:
   I'm using Airflow `2.1.0` (upgraded from `1.10.15` in our development environment) and trying to use the `KubernetesExecutor`. I am running _extremely_ basic DAGs that exist solely to test. In `airflow.cfg` I set `delete_worker_pods = False` so that I could try and examine what's going on. As far as I can tell, my worker pods are being created, immediately considering themselves successful (not sure if they actually receive a task to complete), and terminating. I don't think the pods are "successful but not writing the success state to the DB" because one of my tasks is a simple `CREATE TABLE` statement on my data warehouse.
   
   
   Here are the results of various logs:
   
   If I run `kubectl get pods`, I see my `airflow-webserver` and `airflow-scheduler` pods, as well as some "Completed" worker pods.
   
   ```
   NAME                                                           READY   STATUS      RESTARTS   AGE
   airflow-scheduler                                              1/1     Running     0          45m
   airflow-webserver-1                                            2/2     Running     0          56m
   airflow-webserver-2                                            2/2     Running     0          56m
   airflow-webserver-3                                            2/2     Running     0          45m
   airflow-webserver-4                                            2/2     Running     0          45m
   airflow-webserver-5                                            2/2     Running     0          45m
   <GENERATED_WORKER_POD_1>                                       0/1     Completed   0          15m
   <GENERATED_WORKER_POD_2>                                       0/1     Completed   0          56m
   <GENERATED_WORKER_POD_3>                                       0/1     Completed   0          45m
   ```
   
   `kubectl logs <my_worker_pod>` - I can tell that the pod was briefly alive but that did nothing, because the only output is a line that always appears as a side effect of the Docker image config. In contrast, if I get the logs from one of my functional `1.10.15` pods, I can see the task start:
   ```
   [2021-06-02 13:33:01,358] {__init__.py:50} INFO - Using executor LocalExecutor
   [2021-06-02 13:33:01,358] {dagbag.py:417} INFO - Filling up the DagBag from /usr/local/airflow/dags/my_dag.py
   ...
   etc
   ```
   
   `kubectl describe pod <my_worker_pod>` - the event log is quite tame:
   ```
   Events:
     Type    Reason     Age    From               Message
     ----    ------     ----   ----               -------
     Normal  Scheduled  3m21s  default-scheduler  Successfully assigned <GENERATED_POD_NAME> to <EC2_INSTANCE>
     Normal  Pulling    3m19s  kubelet            Pulling image <MY_AIRFLOW_2.1.0_DOCKER_IMAGE>
     Normal  Pulled     3m19s  kubelet            Successfully pulled image <MY_AIRFLOW_2.1.0_DOCKER_IMAGE>
     Normal  Created    3m19s  kubelet            Created container base
     Normal  Started    3m19s  kubelet            Started container base
   ```
   That's it. No further events.
   
   `kubectl logs airflow-scheduler` - I've trimmed the logs significantly, but here are the statements mentioning a stuck task
   ```
   [2021-06-02 14:49:42,742] {kubernetes_executor.py:369} INFO - Attempting to finish pod; pod_id: <GENERATED_POD_NAME>; state: None; annotations: {'dag_id': '<TEST_DAG>', 'task_id': '<TASK_ID>', 'execution_date': '2021-05-30T00:00:00+00:00', 'try_number': '1'}
   [2021-06-02 14:49:42,748] {kubernetes_executor.py:546} INFO - Changing state of (TaskInstanceKey(dag_id='<TEST_DAG>', task_id='<TASK_ID>', execution_date=datetime.datetime(2021, 5, 30, 0, 0, tzinfo=tzlocal()), try_number=1), None, '<GENERATED_POD_NAME>', 'airflow', '1000200742') to None
   [2021-06-02 14:49:42,751] {scheduler_job.py:1212} INFO - Executor reports execution of <TASK_ID> execution_date=2021-05-30 00:00:00+00:00 exited with status None for try_number 1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org