You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/02/18 18:04:41 UTC

[GitHub] [airflow] MatthewRBruce edited a comment on issue #7935: scheduler gets stuck without a trace

MatthewRBruce edited a comment on issue #7935:
URL: https://github.com/apache/airflow/issues/7935#issuecomment-781529908


   We just saw this on 2.0.1 when we added a largish number of new DAGs (We're adding around 6000 DAGs total, but this seems to lock up when about 200 try to be scheduled at once). 
   
   Here's py-spy stacktraces from our scheduler:
   ```
   Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
   Python v3.8.7 (/usr/local/bin/python3.8)
   Thread 0x7FF5C09C8740 (active): "MainThread"
       _send (multiprocessing/connection.py:368)
       _send_bytes (multiprocessing/connection.py:411)
       send (multiprocessing/connection.py:206)
       send_callback_to_execute (airflow/utils/dag_processing.py:283)
       _send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
       _schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
       _do_scheduling (airflow/jobs/scheduler_job.py:1521)
       _run_scheduler_loop (airflow/jobs/scheduler_job.py:1382)
       _execute (airflow/jobs/scheduler_job.py:1280)
       run (airflow/jobs/base_job.py:237)
       scheduler (airflow/cli/commands/scheduler_command.py:63)
       wrapper (airflow/utils/cli.py:89)
       command (airflow/cli/cli_parser.py:48)
       main (airflow/__main__.py:40)
       <module> (airflow:8)
    
   Process 77: airflow scheduler -- DagFileProcessorManager
   Python v3.8.7 (/usr/local/bin/python3.8)
   Thread 0x7FF5C09C8740 (active): "MainThread"
       _send (multiprocessing/connection.py:368)
       _send_bytes (multiprocessing/connection.py:405)
       send (multiprocessing/connection.py:206)
       _run_parsing_loop (airflow/utils/dag_processing.py:698)
       start (airflow/utils/dag_processing.py:596)
       _run_processor_manager (airflow/utils/dag_processing.py:365)
       run (multiprocessing/process.py:108)
       _bootstrap (multiprocessing/process.py:315)
       _launch (multiprocessing/popen_fork.py:75)
       __init__ (multiprocessing/popen_fork.py:19)
       _Popen (multiprocessing/context.py:277)
       start (multiprocessing/process.py:121)
       start (airflow/utils/dag_processing.py:248)
       _execute (airflow/jobs/scheduler_job.py:1276)
       run (airflow/jobs/base_job.py:237)
       scheduler (airflow/cli/commands/scheduler_command.py:63)
       wrapper (airflow/utils/cli.py:89)
       command (airflow/cli/cli_parser.py:48)
       main (airflow/__main__.py:40)
       <module> (airflow:8)
   ```
   
   What I think is happening is that the pipe between the  `DagFileProcessorAgent` and the `DagFileProcessorManager` is full and is causing the Scheduler to deadlock.  
   
   From what I can see the `DagFileProcessorAgent` only pulls data off the pipe in it's `heartbeat` and `wait_until_finished` functions 
   (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/utils/dag_processing.py#L374)
    
   and that the SchedulerJob is responsible for calling it's `heartbeat` function each scheduler loop (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/scheduler_job.py#L1388). 
   
   However, the SchedulerJob is blocked from calling `heartbeat` because it's blocked forever trying to send data to the same full pipe as part of the `_send_dag_callbacks_to_processor` in the `_do_scheduling_` function causing a deadlock.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org