You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aditya Vishwakarma (Jira)" <ji...@apache.org> on 2019/10/15 15:42:00 UTC

[jira] [Created] (AIRFLOW-5660) Scheduler becomes responsive when processing large DAGs on kubernetes.

Aditya Vishwakarma created AIRFLOW-5660:
-------------------------------------------

             Summary: Scheduler becomes responsive when processing large DAGs on kubernetes.
                 Key: AIRFLOW-5660
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
             Project: Apache Airflow
          Issue Type: Bug
          Components: executor-kubernetes
    Affects Versions: 1.10.5
            Reporter: Aditya Vishwakarma
            Assignee: Daniel Imberman


For very large dags( 10,000+) and high parallelism, the scheduling loop can take more 5-10 minutes. 

It seems that `_labels_to_key` function in kubernetes_executor loads all tasks with a given execution date into memory. It does it for every task in progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it will load million tasks on every tick of the scheduler from db.

[https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]

A quick fix is to search for task in the db directly before regressing to full scan. I can submit a PR for it.

A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, dag_id, task_id, execution_date) somewhere, probably in the metadatabase.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)