You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/03/02 08:51:07 UTC

[GitHub] [airflow] katguev opened a new issue #14556: Task stuck in queued state after pod fails to start

katguev opened a new issue #14556:
URL: https://github.com/apache/airflow/issues/14556


   **Apache Airflow version**: `2.0.1`
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):  
   `Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.1", GitCommit:"7879fc12a63337efff607952a323df90cdc7a335", GitTreeState:"clean", BuildDate:"2020-04-10T21:53:58Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}`  
   
   `Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-gke.7800", GitCommit:"cef3156c566a1d1a4b23ee360a760f45bfbaaac1", GitTreeState:"clean", BuildDate:"2020-12-14T09:12:37Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}`
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: `GKE`
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**: We use scheduler HA with 2 instances. 
   
   **What happened**:
   - We are running Airflow 2.0.1 using KubernetesExecutor and PostgreSQL 9.6.2.
   - Task is `up_for_retry` after its worker pod fails to start
   - It gets stuck in `queued` state
   - It runs only after a scheduler restart. 
   
   **What you expected to happen**:
   - Task gets rescheduled and runs successfully.
   
   **How to reproduce it**:
   
   
   **Anything else we need to know**:
   
   Logs below. 
   
   <details>
   
   
   Its upstream task succeeds and schedules it via fast follow.  
   `[2021-03-01 17:08:51,306] {taskinstance.py:1166} INFO - Marking task as SUCCESS. dag_id=datalake_dag_id, task_id=processidlogs, execution_date=20210301T163000, start_date=20210301T170546, end_date=20210301T170851`
   
   `[2021-03-01 17:08:51,339] {taskinstance.py:1220} INFO - 1 downstream tasks scheduled from follow-on schedule check`
   
   `[2021-03-01 17:08:51,357] {local_task_job.py:146} INFO - Task exited with return code 0`
   
   
   Task is attempted to be run.  
   
   `[2021-03-01 17:08:52,229] {scheduler_job.py:1105} INFO - Sending TaskInstanceKey(dag_id='datalake_dag_id', task_id='delta_id_logs', execution_date=datetime.datetime(2021, 3, 1, 16, 30, tzinfo=Timezone('UTC')), try_number=1) to executor with priority 1 and queue default`
   
   `[2021-03-01 17:08:52,308] {kubernetes_executor.py:306} DEBUG - Kubernetes running for command ['airflow', 'tasks', 'run', 'datalake_dag_id', 'delta_id_logs', '2021-03-01T16:30:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/data_lake/some_tasks/some_tasks.py']`
   
   `[2021-03-01 17:08:52,332] {scheduler_job.py:1206} INFO - Executor reports execution of datalake_dag_id.delta_id_logs execution_date=2021-03-01 16:30:00+00:00 exited with status queued for try_number 1`
   
   
   
   Pod fails to start.  
   
   `[2021-03-01 17:12:17,319] {kubernetes_executor.py:197} INFO - Event: Failed to start pod datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a, will reschedule`
   
   
   It is put as up_for_reschedule.  
   
   `[2021-03-01 17:12:23,912] {kubernetes_executor.py:343} DEBUG - Processing task ('datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a', 'prod', 'up_for_reschedule', {'dag_id': 'datalake_dag_id', 'task_id': 'delta_id_logs', 'execution_date': '2021-03-01T16:30:00+00:00', 'try_number': '1'}, '1172208829')`
   
   
   `[2021-03-01 17:12:23,930] {kubernetes_executor.py:528} INFO - Changing state of (TaskInstanceKey(dag_id='datalake_dag_id', task_id='delta_id_logs', execution_date=datetime.datetime(2021, 3, 1, 16, 30, tzinfo=tzlocal()), try_number=1), 'up_for_reschedule', 'datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a', 'prod', '1172208829') to up_for_reschedule`
   
   `[2021-03-01 17:12:23,941] {scheduler_job.py:1206} INFO - Executor reports execution of datalake_dag_id.delta_id_logs execution_date=2021-03-01 16:30:00+00:00 exited with status up_for_reschedule for try_number 1`
   
   
   A few minutes later, another scheduler finds it in queued state.  
   
   `[2021-03-01 17:15:39,177] {taskinstance.py:851} DEBUG - Dependencies all met for <TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]>`
   
   `[2021-03-01 17:15:40,477] {taskinstance.py:866} DEBUG - <TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> dependency 'Not In Retry Period' PASSED: True, The context specified that being in a retry period was permitted.`
   
   `[2021-03-01 17:15:40,478] {taskinstance.py:866} DEBUG - <TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.`
   
   
   It stays in that state for another hour and a half until the scheduler is restarted.  
   
   Finally, it is rescheduled.  
   `[2021-03-01 18:58:10,475] {kubernetes_executor.py:463} INFO - TaskInstance: <TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> found in queued state but was not launched, rescheduling`
   
   </details>
   
   Other tasks run fine while that one is stuck in queued state. We have a cron job to restart the scheduler as a hack to recover from when such cases happen, but we would like to avoid it as much as possible. 
   
   We run 60 DAGs with 50-100 tasks each every 30 minutes. We have been seeing this issue at least once daily since we upgraded to Airflow 2.0.1. 
   
   I understand there are some open issues about the scheduler or tasks getting stuck. But I could not tell if this is related, since hundreds of other tasks run as expected. Apologies if this turns out to be a duplicate of an existing issue. Thank you. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on issue #14556: Task stuck in queued state after pod fails to start

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on issue #14556:
URL: https://github.com/apache/airflow/issues/14556#issuecomment-797549475


   I've been able to replicate this and will pick it up. This also happens with a single scheduler.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #14556: Task stuck in queued state after pod fails to start

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #14556:
URL: https://github.com/apache/airflow/issues/14556#issuecomment-788736382


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed issue #14556: Task stuck in queued state after pod fails to start

Posted by GitBox <gi...@apache.org>.
kaxil closed issue #14556:
URL: https://github.com/apache/airflow/issues/14556


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org