You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ping Zhang <pi...@umich.edu> on 2021/12/17 05:40:47 UTC

[DISCUSS] Move expensive dag run creation back to the DagFileProcessorManager loop

Hi Airflow community,

While reading the airflow latest main branch, I noticed that the dag run
creation including the ti creation in (verify_integrity) was moved to the
scheduling loop (in the _do_scheduling) from the `DagFileProcessorManager`
loop. I would like to learn more about the context behind this.

Since in your production (Airbnb), we have a metric to show that this
`verify_integrity` is very expensive for new dag runs, it can take ~47
seconds for our large dag (~20K tasks, we have a few dozen of dags reaching
this number) for a single dag run with aws db.r5.16xlarge. Even though we
have optimized it down to ~17 seconds (We will open source this soon), it
is still very expensive.

This will greatly hurt the scheduling performance and lower the overall
throughput for large clusters. Creating dag runs for
all dags_needing_dagruns in the scheduling loop can exacerbate the
scheduling delay even if NUM_DAGS_PER_DAGRUN_QUERY is configurable.

I would like to chat more about this.

Best wishes

Ping Zhang

Re: [DISCUSS] Move expensive dag run creation back to the DagFileProcessorManager loop

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yep. I second Ash. There were enormous changes under the hood in
Airflow 2 especially when it comes to the performance. A lot of
assumptions and problems from 1.10 do not hold any more on Airflow 2
when it comes to performance characteristics, so you might want to run
your DAGs through Airflow 2 to find out how they behave now.

On Fri, Dec 17, 2021 at 11:13 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> We have massively re-worked (and benchmarked) verify_integrity as part of the HA work (including using a dummy sample of your large DAG structure provided by Kevin) since the 1.10.4 version, and it is no longer the bottleneck it once was. From memory this was mostly fixed around 1.10.12 by improving the queries issued.
>
> We have done performance benchmarks of 1000 concurrent dags with 1000 tasks each and verify_integrity barely showed up on the profile.
>
> -ash
>
> On Thu, Dec 16 2021 at 21:40:47 -0800, Ping Zhang <pi...@umich.edu> wrote:
>
> Hi Airflow community,
>
> While reading the airflow latest main branch, I noticed that the dag run creation including the ti creation in (verify_integrity) was moved to the scheduling loop (in the _do_scheduling) from the `DagFileProcessorManager` loop. I would like to learn more about the context behind this.
>
> Since in your production (Airbnb), we have a metric to show that this `verify_integrity` is very expensive for new dag runs, it can take ~47 seconds for our large dag (~20K tasks, we have a few dozen of dags reaching this number) for a single dag run with aws db.r5.16xlarge. Even though we have optimized it down to ~17 seconds (We will open source this soon), it is still very expensive.
>
> This will greatly hurt the scheduling performance and lower the overall throughput for large clusters. Creating dag runs for all dags_needing_dagruns in the scheduling loop can exacerbate the scheduling delay even if NUM_DAGS_PER_DAGRUN_QUERY is configurable.
>
> I would like to chat more about this.
>
> Best wishes
>
> Ping Zhang

Re: [DISCUSS] Move expensive dag run creation back to the DagFileProcessorManager loop

Posted by Ash Berlin-Taylor <as...@apache.org>.
We have massively re-worked (and benchmarked) verify_integrity as part 
of the HA work (including using a dummy sample of your large DAG 
structure provided by Kevin) since the 1.10.4 version, and it is no 
longer the bottleneck it once was. From memory this was mostly fixed 
around 1.10.12 by improving the queries issued.

We have done performance benchmarks of 1000 concurrent dags with 1000 
tasks each and verify_integrity barely showed up on the profile.

-ash

On Thu, Dec 16 2021 at 21:40:47 -0800, Ping Zhang <pi...@umich.edu> 
wrote:
> Hi Airflow community,
> 
> While reading the airflow latest main branch, I noticed that the dag 
> run creation including the ti creation in (verify_integrity) was 
> moved to the scheduling loop (in the _do_scheduling) from the 
> `DagFileProcessorManager` loop. I would like to learn more about the 
> context behind this.
> 
> Since in your production (Airbnb), we have a metric to show that this 
> `verify_integrity` is very expensive for new dag runs, it can take 
> ~47 seconds for our large dag (~20K tasks, we have a few dozen of 
> dags reaching this number) for a single dag run with aws 
> db.r5.16xlarge. Even though we have optimized it down to ~17 seconds 
> (We will open source this soon), it is still very expensive.
> 
> This will greatly hurt the scheduling performance and lower the 
> overall throughput for large clusters. Creating dag runs for all 
> dags_needing_dagruns in the scheduling loop can exacerbate the 
> scheduling delay even if NUM_DAGS_PER_DAGRUN_QUERY is configurable.
> 
> I would like to chat more about this.
> 
> Best wishes
> 
> Ping Zhang