You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "Fokko Driesprong (JIRA)" <ji...@apache.org> on 2018/05/13 18:55:00 UTC

[jira] [Resolved] (AIRFLOW-2430) Bad query patterns at scale prevent scheduler from starting

     [ https://issues.apache.org/jira/browse/AIRFLOW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fokko Driesprong resolved AIRFLOW-2430.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0
                   1.10.0

Issue resolved by pull request #3324
[https://github.com/apache/incubator-airflow/pull/3324]

> Bad query patterns at scale prevent scheduler from starting
> -----------------------------------------------------------
>
>                 Key: AIRFLOW-2430
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2430
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>            Reporter: Gabriel Silk
>            Priority: Major
>             Fix For: 1.10.0, 2.0.0
>
>
> h2. Summary
> Certain queries executed by the scheduler do not scale well with the number of tasks being operated on. Two example functions 
>  * reset_state_for_orphaned_tasks
>  * _execute_task_instances
>  
> Concretely — with a mere 75k tasks being operated on, the first query can take dozens of minutes to run, blocking the scheduler from making progress.
>  
> The cause is twofold:
> 1. As the query grows past a certain point, the MySQL planner will choose to do a full table scan as opposed to using an index. I assume the same is true of Postgres.
> 2. The query predicate size grows linearly in the number of tasks being operated, thus increasing the amount of work that needs to be done per row.
>  
> In a sense, you’re left with an operation that scales O(n^2)
>  
> h2. Proposed Fix
> It appears that one of these bad query patterns was fixed in [3547cbffd|https://github.com/apache/incubator-airflow/commit/3547cbffdbffac2f98a8aa05526e8c9671221025] by introducing a configurable batch size with can be set via max_tis_per_query.
>  
> I propose we extend the suggested fix to include other poorly-performing queries in the scheduler.
>  
> I’ve identified two queries that are directly affecting my work and included them in the diff, though the same approach can be extended to more queries as we see fit.
>  
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)