You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/07/15 17:40:39 UTC

[GitHub] [airflow] coufon opened a new pull request #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver

coufon opened a new pull request #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver
URL: https://github.com/apache/airflow/pull/5594
 
 
   
   ### Jira
   
   - [(/)] My PR addresses the following issues:
     - [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/AIRFLOW-4924)
     - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
   
   ### Description
   
   - [(/)] Here are some details about my PR, including screenshots of any UI changes:
   
   #### Scalability Issue in Webserver
   Airflow webserver uses gunicorn workers to serve HTTP requests. It loads all DAGs from DAG files before serving requests. If there are many DAGs (e.g., > 1,000), loading all DAGs can take a significant amount of time.
   
   Airflow webserver also relies on restarting gunicorn workers to refresh all DAGs. This refreshing interval is set by webserver-worker_refresh_interval, default to 30s. As a result, if loading all DAGs takes >30s, the webserver will never be ready for HTTP requests.
   
   The current solution is to skip loading DAGs by using env var SKIP_DAGS_PARSING. It makes the webserver work, but there is no DAG on the UI.
   
   #### Asynchronously DAG Loading
   The solution here is to load DAGs asynchronously in the background. It creates a background process to load DAGs, stringifies DAGs, and sends DAGs to gunicorn worker process. The stringifying step is needed because some fields can not be pickled, e.g., locally defined functions and user defined modules. It aggressively transform all fields of DAG and task to be string-compatible.
   
   This feature is enabled by webserver-async_dagbag_loader=True. The background process sends DAGs to gunicorn worker gradually (every webserver-dagbag_sync_interval). DAG refreshing interval is controlled by webserver-collect_dags_interval.
   
   Asynchronous DAG loading has been released in Google Cloud Composer as an Alpha feature:
   https://cloud.google.com/composer/docs/release-notes
   https://cloud.google.com/composer/docs/how-to/accessing/airflow-web-interface
   
   This issue is created to merge the feature to Airflow upstream.
   
   ### Tests
   
   - [(/)] My PR adds the following unit tests:
    - tests/dags/test_stringified_dags.py: DAGs can be successfully stringified, pickled and sent over multiprocess queue.
    - tests/www/test_async_dag_loaders.py: asynchronous DAG loader can successfully load DAGs.
   
   ### Commits
   
   - [(/)] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [(/)] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain docstrings that explain what it does
     - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services