You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/12 10:20:02 UTC
[GitHub] [airflow] yuqian90 commented on issue #7935: scheduler gets stuck without a trace
yuqian90 commented on issue #7935:
URL: https://github.com/apache/airflow/issues/7935#issuecomment-839656436
Hi @ashb I would like to report that we've been seeing something similar to this issue in Airflow 2.0.2 recently.
We are using airflow 2.0.2 with a single airflow-scheduler + a few airflow-worker using CeleryExecutor and postgres backend running dozens of dags each with hundreds to a few thousand tasks. Python version is 3.8.7.
Here's what we saw:
airflow-scheduler sometimes stops heartbeating and stops scheduling any tasks. This seems to happen at random times, about once or twice a week. When this happens, the last line in the scheduler log shows the following, i.e. it stopped writing out any log after `receiving signal 15`. I did strace the airflow scheduler process. It did not capture any other process sending it signal 15. So most likely the signal 15 was sent by the scheduler to itself.
```
May 11 21:19:56 shaops43 airflow[12643]: [2021-05-11 21:19:56,908] {base_executor.py:82} INFO - Adding to queue: ['airflow', 'tasks', 'run', ...]
May 11 21:19:56 shaops43 airflow[12643]: [2021-05-11 21:19:56,973] {scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15
```
When the scheduler was in this state, there was also a child `airflow scheduler` process shown in `ps` which was spawned by the main `airflow scheduler` process. I forgot `py-spy dump`, but I did use `py-spy top` to look at the child `airflow scheduler` process. This was what I saw. It seems to be stuck somewhere in `celery_executor.py::_send_tasks_to_celery`. This sounds similar to what @milton0825 reported previously although he mentioned he was using Airflow 1.10.8.
When I manually SIGTERM the child airflow scheduler process, it died. And immediately the main `airflow scheduler` started to heartbeat and schedule tasks again like nothing ever happened. So I suspect somewhere when the `airflow scheduler` was spawning a child processes, it got stuck. But I still don't understand how it produced a `Exiting gracefully upon receiving signal 15` in the log.
```
Total Samples 7859
GIL: 0.00%, Active: 0.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
0.00% 0.00% 0.540s 0.540s __enter__ (multiprocessing/synchronize.py:95)
0.00% 0.00% 0.000s 0.540s worker (multiprocessing/pool.py:114)
0.00% 0.00% 0.000s 0.540s _bootstrap (multiprocessing/process.py:315)
0.00% 0.00% 0.000s 0.540s _repopulate_pool (multiprocessing/pool.py:303)
0.00% 0.00% 0.000s 0.540s main (airflow/__main__.py:40)
0.00% 0.00% 0.000s 0.540s start (multiprocessing/process.py:121)
0.00% 0.00% 0.000s 0.540s _send_tasks_to_celery (airflow/executors/celery_executor.py:330)
0.00% 0.00% 0.000s 0.540s Pool (multiprocessing/context.py:119)
0.00% 0.00% 0.000s 0.540s run (airflow/jobs/base_job.py:237)
0.00% 0.00% 0.000s 0.540s _repopulate_pool_static (multiprocessing/pool.py:326)
0.00% 0.00% 0.000s 0.540s heartbeat (airflow/executors/base_executor.py:158)
0.00% 0.00% 0.000s 0.540s _launch (multiprocessing/popen_fork.py:75)
0.00% 0.00% 0.000s 0.540s wrapper (airflow/utils/cli.py:89)
0.00% 0.00% 0.000s 0.540s __init__ (multiprocessing/pool.py:212)
0.00% 0.00% 0.000s 0.540s _Popen (multiprocessing/context.py:277)
```
One other observation was that when the airflow scheduler was in the stuck state, the `DagFileProcessor` processes started by airflow scheduler were still running. I could see them writing out logs to `dag_processor_manager.log`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org