You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/14 09:31:52 UTC

[GitHub] [airflow] tzanko-matev commented on issue #14205: Scheduler "deadlocks" itself when max_active_runs_per_dag is reached by up_for_retry tasks

tzanko-matev commented on issue #14205:
URL: https://github.com/apache/airflow/issues/14205#issuecomment-898871143


   Here is what I believe is causing the bug: 
   https://github.com/apache/airflow/blob/88199eefccb4c805f8d6527bab5bf600b397c35e/airflow/jobs/scheduler_job.py#L1765
   
   ```py
           if dag.max_active_runs:
               if (
                   len(currently_active_runs) >= dag.max_active_runs
                   and dag_run.execution_date not in currently_active_runs
               ):
                   self.log.info(
                       "DAG %s already has %d active runs, not queuing any tasks for run %s",
                       dag.dag_id,
                       len(currently_active_runs),
                       dag_run.execution_date,
                   )
                   return 0
   ```
   Notice that the dagrun state is not updated before `return 0`. This means that on the next iteration of the scheduler the same dagrun will be tried for processing. If there are enough dagruns which satisfy this condition then the scheduler will never try to process other dagruns and will deadlock.
   
   The code which makes the scheduler go forward is found in `dag_run.update_state`:
   https://github.com/apache/airflow/blob/88199eefccb4c805f8d6527bab5bf600b397c35e/airflow/models/dagrun.py#L384
   
   ```py
       @provide_session
       def update_state(
           self, session: Session = None, execute_callbacks: bool = True
       ) -> Tuple[List[TI], Optional[callback_requests.DagCallbackRequest]]:
           
           #..........
           start_dttm = timezone.utcnow()
           self.last_scheduling_decision = start_dttm
           # ........
           session.merge(self)
   
           return schedulable_tis, callback
   ```
   Updating `last_scheduling_decision` is what makes the scheduler chose a different set of dagruns on the next iteration. To fix the bug this update needs to be made before the `return 0` line in the first code snippet.
   
   In the current main branch the scheduling logic is different, so the bug might have disappeared. The bug exists in all versions up to 2.1.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org