You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Daniel Imberman (Jira)" <ji...@apache.org> on 2019/10/15 16:22:00 UTC

[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

    [ https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952090#comment-16952090 ] 

Daniel Imberman commented on AIRFLOW-5660:
------------------------------------------

Hi [~adivish] thank you for catching this. Yeah this just looks like an implementation bug. If I'm reading this correctly the issue is that there's no guarantee that the task id label would be the same as the task id in the DB. I think we can for the most part solve this with the following steps.

1. Do exactly what you said and first do a task_id/dag_id search of the DB to significantly reduce search time
2. In the `_make_safe_label_value` function we can add a warning if a task_id or dag_id will require hashing (which will slow down processing
3. IF there is no database match we do a full scan for that day.

[~ash] does that sound good?

[~adivish] If you can make a PR I'll gladly review :)

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> ------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5660
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor-kubernetes
>    Affects Versions: 1.10.5
>            Reporter: Aditya Vishwakarma
>            Assignee: Daniel Imberman
>            Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all tasks with a given execution date into memory. It does it for every task in progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)