You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/12 10:20:02 UTC

[GitHub] [airflow] yuqian90 commented on issue #7935: scheduler gets stuck without a trace

yuqian90 commented on issue #7935:
URL: https://github.com/apache/airflow/issues/7935#issuecomment-839656436


   Hi @ashb I would like to report that we've been seeing something similar to this issue in Airflow 2.0.2 recently.
   
   We are using airflow 2.0.2 with a single airflow-scheduler + a few airflow-worker using CeleryExecutor and postgres backend running dozens of dags each with hundreds to a few thousand tasks. Python version is 3.8.7.
   
   Here's what we saw:
   airflow-scheduler sometimes stops heartbeating and stops scheduling any tasks. This seems to happen at random times, about once or twice a week. When this happens, the last line in the scheduler log shows the following, i.e. it stopped writing out any log after `receiving signal 15`. I did strace the airflow scheduler process. It did not capture any other process sending it signal 15. So most likely the signal 15 was sent by the scheduler to itself.
   
   ```
   May 11 21:19:56 shaops43 airflow[12643]: [2021-05-11 21:19:56,908] {base_executor.py:82} INFO - Adding to queue: ['airflow', 'tasks', 'run', ...]
   May 11 21:19:56 shaops43 airflow[12643]: [2021-05-11 21:19:56,973] {scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15
   ```
   
   When the scheduler was in this state, there was also a child `airflow scheduler` process shown in `ps` which was spawned by the main `airflow scheduler` process. I forgot `py-spy dump`, but I did use `py-spy top` to look at the child `airflow scheduler` process. This was what I saw. It seems to be stuck somewhere in `celery_executor.py::_send_tasks_to_celery`. This sounds similar to what @milton0825  reported previously although he mentioned he was using Airflow 1.10.8.
   
   When I manually SIGTERM the child airflow scheduler process, it died. And immediately the main `airflow scheduler` started to heartbeat and schedule tasks again like nothing ever happened. So I suspect somewhere when the `airflow scheduler` was spawning a child processes, it got stuck. But I still don't understand how it produced a `Exiting gracefully upon receiving signal 15` in the log.
   
   ```
   Total Samples 7859
   GIL: 0.00%, Active: 0.00%, Threads: 1
   
     %Own   %Total  OwnTime  TotalTime  Function (filename:line)
     0.00%   0.00%   0.540s    0.540s   __enter__ (multiprocessing/synchronize.py:95)
     0.00%   0.00%   0.000s    0.540s   worker (multiprocessing/pool.py:114)
     0.00%   0.00%   0.000s    0.540s   _bootstrap (multiprocessing/process.py:315)
     0.00%   0.00%   0.000s    0.540s   _repopulate_pool (multiprocessing/pool.py:303)
     0.00%   0.00%   0.000s    0.540s   main (airflow/__main__.py:40)
     0.00%   0.00%   0.000s    0.540s   start (multiprocessing/process.py:121)
     0.00%   0.00%   0.000s    0.540s   _send_tasks_to_celery (airflow/executors/celery_executor.py:330)
     0.00%   0.00%   0.000s    0.540s   Pool (multiprocessing/context.py:119)
     0.00%   0.00%   0.000s    0.540s   run (airflow/jobs/base_job.py:237)
     0.00%   0.00%   0.000s    0.540s   _repopulate_pool_static (multiprocessing/pool.py:326)
     0.00%   0.00%   0.000s    0.540s   heartbeat (airflow/executors/base_executor.py:158)
     0.00%   0.00%   0.000s    0.540s   _launch (multiprocessing/popen_fork.py:75)
     0.00%   0.00%   0.000s    0.540s   wrapper (airflow/utils/cli.py:89)
     0.00%   0.00%   0.000s    0.540s   __init__ (multiprocessing/pool.py:212)
     0.00%   0.00%   0.000s    0.540s   _Popen (multiprocessing/context.py:277)
   ```
   
   One other observation was that when the airflow scheduler was in the stuck state, the `DagFileProcessor` processes started by airflow scheduler were still running. I could see them writing out logs to `dag_processor_manager.log`. 
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org