You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 09:23:24 UTC

[GitHub] [airflow] yuqian90 opened a new pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

yuqian90 opened a new pull request #11010:
URL: https://github.com/apache/airflow/pull/11010


   This is cherry-picked from  #4751 for `v1-10-test`.
   
   Some investigation into the issue reported in #10790 led to the discovery that this loop in `scheduler_job.py` takes almost 90% of the time in `SchedulerJob.process_file()` for large DAGs (around 500 tasks). This causes the `DagFileProcessor` spawned by the scheduler to go slowly. The reason this loop is slow is that it creates a new `DepContext` for every `ti`. And every `DepContext` needs to populate its own `finished_tasks` even though this list is the same for every `DagRun`.
   
   ```python
               for ti in tis:
                   task = dag.get_task(ti.task_id)
   
                   # fixme: ti.task is transient but needs to be set
                   ti.task = task
   
                   if ti.are_dependencies_met(
                           dep_context=DepContext(flag_upstream_failed=True),
                           session=session):
                       self.log.debug('Queuing task: %s', ti)
                       task_instances_list.append(ti.key)
   ```
   
   This is the flamegraph generated by `py-spy` showing the performance of `DagFileProcessor` in Airflow 1.10.12 before this PR:
   https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg
   
   This is the performance after 1.10.12 is patched with this PR:
   https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg
   
   The nice thing is that #4751 already addressed this issue for master branch. We just need to cherry-pick it to fix this in 1.10.* with some very minor conflict fixes.
   
   While this PR will not fix every scenario that causes #10790, it does reduce the `DagFileProcessor` time from around 100s to just about 12s for our use case (a DAG with about 500 tasks, many of them are sensors in `reschedule` mode with `poke_interval` 60s.). 
   
   
   Original commit message in #4751:
   
   This decreases scheduler delay between tasks by about 20% for larger DAGs,
   sometimes more for larger or more complex DAGs.
   
   The delay between tasks can be a major issue, especially when we have dags with
   many subdags, figures out that the scheduling process spends plenty of time in
   dependency checking, we took the trigger rule dependency which calls the db for
   each task instance, we made it call the db just once for each dag_run
   
   (cherry picked from commit 50efda5c69c1ddfaa869b408540182fb19f1a286)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil edited a comment on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
kaxil edited a comment on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-716731919


   Both commits have been cherry-picked to v1-10-test:
   
   https://github.com/apache/airflow/commit/cb750c11b1899a88cfdf25baa18cf9896c7cef8f
   https://github.com/apache/airflow/commit/edae056edd6b6fae94533ca8a4cc220ebfbadc68
   
   Thanks @yuqian90  -- Wanted to keep them as separate commits, hence not squashing and merging this PR, instead cherry-picked


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek edited a comment on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
turbaszek edited a comment on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-696832830


   I'm not sure if we want to cherry-pick this fix to 1.10 as 2.0 is closer. On the other hand... the job of cherrypicking was done and we just need to merge. What others think @kaxil @ashb @potiuk @mik-laj ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-704387743


   As mentioned in https://github.com/apache/airflow/pull/11119#issuecomment-702372954 we are no more accepting new features to 1-10 branch :< I'm sorry. The 2.0 alpha should be released soon 🎉 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tooptoop4 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-697006856


   pls merge


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-716731919


   Both commits have been cherry-picked to v1-10-tests:
   
   https://github.com/apache/airflow/commit/cb750c11b1899a88cfdf25baa18cf9896c7cef8f
   https://github.com/apache/airflow/commit/edae056edd6b6fae94533ca8a4cc220ebfbadc68
   
   Thanks @yuqian90  -- Wanted to keep them as separate commits, hence not squashing and merging this PR, instead cherry-picked


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-696832830


   I'm not sure if we want to cherry pick this fix to 1.10, what others think @kaxil @ashb @potiuk @mik-laj 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-694792832


   @amichai07 @kaxil @ashb  anyone interested to take a look? I'm porting #4751 to 1.10.* to fix a bad `DagFileProcessor` slowness.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek edited a comment on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
turbaszek edited a comment on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-696832830


   I'm not sure if we want to cherry-pick this fix to 1.10 as 2.0 is closer. On the other hand... the job of cherrypicking was done and we just need to merge. What others think @kaxil @ashb @potiuk @mik-laj ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-716540213


   > @yuqian90 Can you rebase it once more, please ? Appreciate it
   
   Done. Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-704918738


   > As mentioned in [#11119 (comment)](https://github.com/apache/airflow/pull/11119#issuecomment-702372954) we are no more accepting new features to 1-10 branch :< I'm sorry. The 2.0 alpha should be released soon 🎉
   
   Hi, @turbaszek  and @dimberman  I agree that  #11119 can be closed since it's a brand new feature. However this one #11010 is not a new feature. It's more of a fix to 1.10.*. When DAGs are just a little bit larger than usual (500 tasks) and when user tries to run dozens of DagRuns in at the same time (around 30), airflow-scheduler becomes super slow. This PR is the fix for that backported for 1.10.*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-697067550


   > I'm not sure if we want to cherry-pick this fix to 1.10 as 2.0 is closer. On the other hand... the job of cherrypicking was done and we just need to merge. 
   
   Thanks @turbaszek . I cherry-picked this because the scheduler in 1.10.* is having trouble for large DAGs (not that large, just hundreds of tasks in one DAG). It queries the db too many times and was struggling to finish. (See [flamegraph.before](https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg) and [flamegraph_after](https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg) in the PR description.) Which caused us to hit  #10790 too. So this cherry-pick is more of a fix rather than an improvement in some sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek closed pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
turbaszek closed pull request #11010:
URL: https://github.com/apache/airflow/pull/11010


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-697067550


   > I'm not sure if we want to cherry-pick this fix to 1.10 as 2.0 is closer. On the other hand... the job of cherrypicking was done and we just need to merge. 
   
   Thanks @turbaszek . I cherry-picked this because the scheduler in 1.10.* is having trouble for large DAGs (not that large, just hundreds of tasks in one DAG). It queries the db too many times and was struggling to finish. (See [flamegraph.before](https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg) and [flamegraph_after](https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg) in the PR description.) Which caused us to hit  #10790 too. So this cherry-pick is more of a fix rather than an improvement in some sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pingzh commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
pingzh commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-701863242


   @yuqian90 you will need this as well: https://github.com/apache/airflow/pull/7503


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-696836002


   @turbaszek  I am not a release manager. Unfortunately, I cannot help you. I am focusing on the development of Airflow 2.0  If I can help you with anything else, please let me know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-696832830


   I'm not sure if we want to cherry pick this fix to 1.10, what others think @kaxil @ashb @potiuk @mik-laj 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
kaxil closed pull request #11010:
URL: https://github.com/apache/airflow/pull/11010


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-704065719


   > @yuqian90 you will need this as well: #7503
   
   Thanks for pointing out. I'm not using the features fixed in #7503 myself so it didn't affect me. However, you are right it should be included since it's a bug fix. So I cherry-picked it too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-696836002


   @turbaszek  I am not a release manager. Unfortunately, I cannot help you. I am focusing on the development of Airflow 2.0  If I can help you with anything else, please let me know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-716505150


   @yuqian90 Can you rebase it once more, please ? Appreciate it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11010: [AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11010:
URL: https://github.com/apache/airflow/pull/11010#issuecomment-716422837


   @turbaszek I think this change should be in the next rellease. See: https://github.com/apache/airflow/issues/11780 WDYT?
   
   @potiuk @kaxil WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org