You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/08/06 13:22:51 UTC

[GitHub] [airflow] amichai07 opened a new pull request #4751: [AIRFLOW-3607] collected trigger rule dep check per dag run

amichai07 opened a new pull request #4751: [AIRFLOW-3607] collected trigger rule dep check per dag run
URL: https://github.com/apache/airflow/pull/4751
 
 
   ### Jira
   * My PR addresses the following https://issues.apache.org/jira/browse/AIRFLOW-3607 and references them in the PR title
   * Decreasing scheduler delay between tasks
   
   ### Description
   * The delay between tasks can be a major issue, especially when we have dags with many subdags,
     figures out that the scheduling process spends plenty of time in dependency checking,  we took the
     trigger rule dependency which calls the db for each task instance,  we made it call the db just once for
     each dag_run.
   
   ### Tests
   * My PR does not need extra testing for this extremely good reason:
     My pr uses the code from the  and also has a fall back to the original behaviour, the ci covers all of the logic and cases that might happen already
   
   ### Commits
   * removed unnecessary queries  - run on each dag run instead of each ti
   
   ### Documentation
   no need for new docs
   
   ### Code Quality
   * Passes `flake8`
   * Tested in production environment for 3 days
   
   ### Results
   The tests was made on a heavily multitasks dag (35 tasks).
   The tasks don't do any db queries
   
   **On local environment**
   before changes:
   - avg delay between tasks: 4.22 sec
   - number of queries during 10 minutes: 118,879 
   
   after collecting dep check queries:
   - avg delay between tasks: 3.86 sec
   - number of queries during 10 minutes: 104,397
   
   Stress test - running the dag for every 10 sec for an hour:
   before changes: 
   - avg delay between tasks: 16.7 sec
   - number of queries: 943,230
   
   after:
   - avg delay between tasks: 3.28 sec
   - number of queries: 734,563
   
   **On production environment**
   before changes:
   - avg delay between tasks: 2.45 sec
   
   after:
   - avg delay between tasks: 2.16 sec
   
   Stress test - running with 150 other dags:
   before changes: 
    - avg delay between tasks: 6.3 sec
   
   after:
    - avg delay between tasks: 4.74 sec

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services