You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Sergei Iakhnin (JIRA)" <ji...@apache.org> on 2016/06/01 11:33:59 UTC

[jira] [Created] (AIRFLOW-203) Scheduler fails to reliably schedule tasks when many dag runs are triggered

Sergei Iakhnin created AIRFLOW-203:
--------------------------------------

             Summary: Scheduler fails to reliably schedule tasks when many dag runs are triggered
                 Key: AIRFLOW-203
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-203
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: Airflow 1.7.1
            Reporter: Sergei Iakhnin


Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master node and 115 worker nodes, each with 8 cores. The workflow consists of series of 27 tasks, some of which are nearly instantaneous and some take hours to complete. Dag runs are manually triggered, about 3000 at a time, resulting in roughly 75 000 tasks.

My observations are that the scheduling behaviour is extremely inconsistent, i.e. about 1000 tasks get scheduled and executed and then no new tasks get scheduled after that. Sometimes it is enough to restart the scheduler for new tasks to get scheduled, sometimes the scheduler and worker services need to be restarted multiple times to get any progress. When I look at the scheduler output it seems to be chugging away at trying to schedule tasks with messages like:

"2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: airflow run ..."

However, these tasks do not show up in queued status on the UI and don't actually get scheduled out to the workers.

It is unclear what may be causing this behaviour as no errors are produced anywhere. The impact is especially high when short-running tasks are concerned because the cluster should be able to blow through them within a couple of minutes, but instead it takes hours of manual restarts to get through them.

I'm happy to share logs or any other useful debug output as desired.

Thanks in advance.

Sergei.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)