You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Nidhi (Jira)" <ji...@apache.org> on 2019/12/11 21:27:00 UTC

[jira] [Commented] (AIRFLOW-203) Scheduler fails to reliably schedule tasks when many dag runs are triggered

    [ https://issues.apache.org/jira/browse/AIRFLOW-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993910#comment-16993910 ] 

Nidhi commented on AIRFLOW-203:
-------------------------------

I am facing the same issue as I have around 60,000 tasks inside one DAG. When I trigger the dag it is not scheduling my tasks and DAG is staying into Running state. Please let me know if you know how to solve it. I am working with Celery Executor and tried to change "dagbag_import_timeout" and "max_threads" but nothing is working for my case.

Any help to solve this issue will be appreciated.

 

> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-203
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-203
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.7.1.2
>            Reporter: Sergei Iakhnin
>            Priority: Major
>         Attachments: airflow.cfg, airflow_scheduler_non_working.log, airflow_scheduler_working.log
>
>
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master node and 115 worker nodes, each with 8 cores. The workflow consists of series of 27 tasks, some of which are nearly instantaneous and some take hours to complete. Dag runs are manually triggered, about 3000 at a time, resulting in roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent, i.e. about 1000 tasks get scheduled and executed and then no new tasks get scheduled after that. Sometimes it is enough to restart the scheduler for new tasks to get scheduled, sometimes the scheduler and worker services need to be restarted multiple times to get any progress. When I look at the scheduler output it seems to be chugging away at trying to schedule tasks with messages like:
> "2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't actually get scheduled out to the workers (nor make it into the rabbitmq queue, or the task_instance table).
> It is unclear what may be causing this behaviour as no errors are produced anywhere. The impact is especially high when short-running tasks are concerned because the cluster should be able to blow through them within a couple of minutes, but instead it takes hours of manual restarts to get through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)