You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/06/15 11:08:47 UTC

[GitHub] [airflow] shivanshs9 edited a comment on issue #13542: Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing

shivanshs9 edited a comment on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-861406671


   > `kill -USR2 <pid of scheduler>` -- how you get the pid depends upon how and where you are running it :)
   > 
   > Likely exec in to the container, run `ps auxww` and find the oldest scheduler processs (you'll see some sub processes, possibly named helpfully).
   
   @ashb @Jorricks  I'm facing a similar problem as OP on Airflow v2.0.1 with Celery Executor. If it helps, you can find the debug information from scheduler logs:
   
   ```
   [2021-06-15 07:13:48,623] {{dag_processing.py:1071}} INFO - Finding 'running' jobs without a recent heartbeat
   [2021-06-15 07:13:48,624] {{dag_processing.py:1075}} INFO - Failing jobs without heartbeat after 2021-06-15 07:08:48.624534+00:00
   [2021-06-15 07:13:49,539] {{scheduler_job.py:757}} INFO - --------------------------------------------------------------------------------
   SIGUSR2 received, printing debug
   --------------------------------------------------------------------------------
   [2021-06-15 07:13:49,539] {{base_executor.py:302}} INFO - executor.queued (0)
   	
   [2021-06-15 07:13:49,539] {{base_executor.py:307}} INFO - executor.running (40)
   	TaskInstanceKey(dag_id='ergo_job_collector', task_id='process_job_result', execution_date=datetime.datetime(2021, 6, 15, 4, 39, 10, 753973, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 13, 22, 40, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='ergo_task_queuer', task_id='push_tasks', execution_date=datetime.datetime(2021, 6, 15, 4, 39, 1, 316420, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_4', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 10, 54, 17, 679502, tzinfo=Timezone('UTC')), try_number=32)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_1', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 7, 3, 23, 693482, tzinfo=Timezone('UTC')), try_number=41)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 13, 9, 10, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 13, 9, 10, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_3', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 6, 58, 43, 797668, tzinfo=Timezone('UTC')), try_number=39)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 11, 6, 10, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 11, 6, 30, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_pusher', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 13, 22, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 12, 21, 10, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 13, 9, 16, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 11, 6, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_pusher', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 12, 21, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 5, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_0', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 7, 0, 2, 374657, tzinfo=Timezone('UTC')), try_number=42)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_2', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 6, 58, 43, 797327, tzinfo=Timezone('UTC')), try_number=40)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 32, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 12, 21, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 15, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 10, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_pusher', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 15, 3, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 15, 0, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 20, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 12, 21, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 30, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_3', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 6, 58, 43, 797668, tzinfo=Timezone('UTC')), try_number=41)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_4', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 10, 54, 17, 679502, tzinfo=Timezone('UTC')), try_number=30)
   	TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 11, 6, 0, tzinfo=Timezone('UTC')), try_number=1)
   	TaskInstanceKey(dag_id='smart_sensor_group_shard_0', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 7, 0, 2, 374657, tzinfo=Timezone('UTC')), try_number=39)
   [2021-06-15 07:13:49,540] {{base_executor.py:308}} INFO - executor.event_buffer (0)
   	
   [2021-06-15 07:13:49,540] {{celery_executor.py:387}} INFO - executor.tasks (27)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_4', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 10, 54, 17, 679502, tzinfo=Timezone('UTC')), try_number=30), <AsyncResult: 6960c6d0-7e21-4ef9-8f04-d95efdd9d706>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_2', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 6, 58, 43, 797327, tzinfo=Timezone('UTC')), try_number=40), <AsyncResult: 4a815d59-d824-4373-8f96-34272174cfc0>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_3', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 6, 58, 43, 797668, tzinfo=Timezone('UTC')), try_number=39), <AsyncResult: 5296e550-5efd-474a-9e34-897330504886>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_0', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 7, 0, 2, 374657, tzinfo=Timezone('UTC')), try_number=39), <AsyncResult: cad2a01b-cde3-473e-85bc-9f96eae7da7c>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_pusher', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: cae54db9-56a5-4a27-8226-1dc5867265f7>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 438e5379-ef8d-4bfe-8e55-cae461a3f62f>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 9a953b60-b222-4d4c-aff4-5b831caef71f>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_pusher', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 840e404c-1cee-4c18-8123-b354a63e8d80>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 20, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 4410416b-7132-4713-a78d-caa0ce3cacb0>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 91d5060a-ff5c-4997-832a-757c25069ac7>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 10, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: ade2d701-f632-41d9-a901-40531619dfba>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 20, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: febb3b1e-0f63-41d3-83dc-58f5b11a39ca>)
   	(TaskInstanceKey(dag_id='ergo_job_collector', task_id='process_job_result', execution_date=datetime.datetime(2021, 6, 15, 4, 39, 10, 753973, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 390f1451-7f44-4415-8c57-e2542ed6b6d2>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 30, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 7926f732-e671-4f64-afaf-7c352a85a929>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 32, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: ba1edc8e-66b4-4a20-9cc6-5b0c5758d8c4>)
   	(TaskInstanceKey(dag_id='ergo_task_queuer', task_id='push_tasks', execution_date=datetime.datetime(2021, 6, 15, 4, 39, 1, 316420, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 0ef80d4f-1c4c-4e62-b5ed-cebd0b3ba075>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 20, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: fea3edfa-477d-4fb0-a580-03ea91faaef1>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_pusher', execution_date=datetime.datetime(2021, 6, 15, 4, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 4749177d-df70-49cf-a2fb-9b0a230d5ff4>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 15, 0, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: bf04f105-efc7-41be-864c-8161c4973f6b>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_task_queued', execution_date=datetime.datetime(2021, 6, 15, 3, 0, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 7c8c6550-cdcc-4405-816b-15da1dbeb79e>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 15, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: ee255fe6-21d2-4d3f-8860-83e55265fea9>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 14, 22, 20, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: 46b77d60-43b5-41e6-99eb-e407e0d74791>)
   	(TaskInstanceKey(dag_id='[REDACTED]', task_id='chronos_job_sensor', execution_date=datetime.datetime(2021, 6, 15, 4, 5, tzinfo=Timezone('UTC')), try_number=1), <AsyncResult: ed5115f9-dec6-49da-a5bf-15860a907ab4>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_4', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 10, 54, 17, 679502, tzinfo=Timezone('UTC')), try_number=32), <AsyncResult: 49ee2d3f-eef1-4f7a-a5da-923e8cb8b914>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_1', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 7, 3, 23, 693482, tzinfo=Timezone('UTC')), try_number=41), <AsyncResult: 910a1d67-658d-4fe9-954a-2d093b64c95d>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_0', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 7, 0, 2, 374657, tzinfo=Timezone('UTC')), try_number=42), <AsyncResult: 8061f49f-8050-4b73-84fd-4a0222369040>)
   	(TaskInstanceKey(dag_id='smart_sensor_group_shard_3', task_id='smart_sensor_task', execution_date=datetime.datetime(2021, 6, 14, 6, 58, 43, 797668, tzinfo=Timezone('UTC')), try_number=41), <AsyncResult: 262cf0ac-1132-4176-aba0-77ec4a2bddb7>)
   [2021-06-15 07:13:49,540] {{celery_executor.py:390}} INFO - executor.adopted_task_timeouts (0)
   	
   [2021-06-15 07:13:49,540] {{scheduler_job.py:760}} INFO - --------------------------------------------------------------------------------
   [2021-06-15 07:13:51,028] {{dag_processing.py:838}} INFO -
   ================================================================================
   DAG File Processing Stats
   
   File Path                                                                                           PID  Runtime      # DAGs    # Errors  Last Runtime    Last Run
   ------------------------------------------------------------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
   /home/airflow/.local/lib/python3.8/site-packages/airflow/smart_sensor_dags/__init__.py                                     0           0  0.09s           2021-06-15T07:13:50
   /opt/airflow/dags/dag_ergo.py                                                                                              2           0  0.47s           2021-06-15T07:13:51
   /home/airflow/.local/lib/python3.8/site-packages/airflow/smart_sensor_dags/smart_sensor_group.py  16087  0.00s             5           0  0.15s           2021-06-15T07:13:50
   /opt/airflow/dags/dag_dagen.py                                                                    16081  1.48s            90           0  4.10s           2021-06-15T07:13:49
   ================================================================================
   ```
   
   Workers have already finished executing these tasks but the scheduler is stuck and doesn't schedule any new tasks. The only fix that works is by manually terminating and restarting the scheduler pod.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org