You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Zhou Fang (JIRA)" <ji...@apache.org> on 2019/07/10 01:25:00 UTC

[jira] [Updated] (AIRFLOW-4924) Loading DAGs asynchronously in Airflow webserver

     [ https://issues.apache.org/jira/browse/AIRFLOW-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhou Fang updated AIRFLOW-4924:
-------------------------------
    Description: 
h2. Scalability Issue in Webserver

Airflow webserver uses gunicorn workers to serve HTTP requests. It loads all DAGs from DAG files before serving requests. If there are many DAGs (e.g., > 1,000), loading all DAGs takes a significant amount of time.

Airflow webserver also relies on restarting gunicorn workers to refresh all DAGs. This refreshing interval is set by webserver-worker_refresh_interval, default to 30s. As a result, if loading all DAGs takes >30s, the webserver will never be ready for HTTP requests.

The current solution is to skip loading DAGs by using env var SKIP_DAGS_PARSING. It makes the webserver work, but there is no DAG on the UI.
h2. Asynchronously DAG Loading

The solution here is to load DAGs asynchronously in the background. It creates a background process to load DAGs, stringifies DAGs, and sends DAGs to gunicorn worker process. The stringifying step is needed because some fields can not be pickled, e.g., locally defined functions and user defined modules. It aggressively transform all fields of DAG and task to be string-compatible.

This feature is enabled by webserver-async_dagbag_loader=True. The background process sends DAGs to gunicorn worker gradually (every webserver-dagbag_sync_interval). DAG refreshing interval is controlled by webserver-collect_dags_interval.

Asynchronous DAG loading has been released in Google Cloud Composer as an Alpha feature:
[https://cloud.google.com/composer/docs/release-notes]
[https://cloud.google.com/composer/docs/how-to/accessing/airflow-web-interface]

This issue is created to merge the feature to Airflow upstream.

 

 

  was:
h2. Scalability Issue in Webserver

Airflow webserver uses gunicorn workers to serve HTTP requests. It loads all DAGs from DAG files before serving requests. If there are many DAGs (e.g., > 1,000), loading all DAGs takes a significant amount of time.

Airflow webserver also relies on restarting gunicorn workers to refresh all DAGs. This refreshing interval is set by webserver-worker_refresh_interval, default to 30s. As a result, if loading all DAGs takes >30s, the webserver will never be ready for HTTP requests.

The current solution is to skip loading DAGs by using env var SKIP_DAGS_PARSING. It makes the webserver work, but there is no DAG on the UI.
h2. Asynchronously DAG Loading

The solution here is to load DAGs asynchronously in the background. It creates a background process to load DAGs, stringifies DAGs, and sends DAGs to gunicorn worker process. The stringifying step is needed because some fields can not be pickled, e.g., locally defined functions and user defined modules. It aggressively transform all fields of DAG and task to be string-compatible.

This feature is enabled by webserver-async_dagbag_loader=True. The background process sends DAGs to gunicorn worker gradually (every webserver-dagbag_sync_interval). DAG refreshing interval is controlled by webserver-collect_dags_interval.

 


> Loading DAGs asynchronously in Airflow webserver
> ------------------------------------------------
>
>                 Key: AIRFLOW-4924
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4924
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: webserver
>    Affects Versions: 1.10.2, 1.10.3, 1.10.4
>            Reporter: Zhou Fang
>            Assignee: Zhou Fang
>            Priority: Major
>              Labels: features, scalability, webserver
>             Fix For: 1.10.2, 1.10.3, 1.10.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h2. Scalability Issue in Webserver
> Airflow webserver uses gunicorn workers to serve HTTP requests. It loads all DAGs from DAG files before serving requests. If there are many DAGs (e.g., > 1,000), loading all DAGs takes a significant amount of time.
> Airflow webserver also relies on restarting gunicorn workers to refresh all DAGs. This refreshing interval is set by webserver-worker_refresh_interval, default to 30s. As a result, if loading all DAGs takes >30s, the webserver will never be ready for HTTP requests.
> The current solution is to skip loading DAGs by using env var SKIP_DAGS_PARSING. It makes the webserver work, but there is no DAG on the UI.
> h2. Asynchronously DAG Loading
> The solution here is to load DAGs asynchronously in the background. It creates a background process to load DAGs, stringifies DAGs, and sends DAGs to gunicorn worker process. The stringifying step is needed because some fields can not be pickled, e.g., locally defined functions and user defined modules. It aggressively transform all fields of DAG and task to be string-compatible.
> This feature is enabled by webserver-async_dagbag_loader=True. The background process sends DAGs to gunicorn worker gradually (every webserver-dagbag_sync_interval). DAG refreshing interval is controlled by webserver-collect_dags_interval.
> Asynchronous DAG loading has been released in Google Cloud Composer as an Alpha feature:
> [https://cloud.google.com/composer/docs/release-notes]
> [https://cloud.google.com/composer/docs/how-to/accessing/airflow-web-interface]
> This issue is created to merge the feature to Airflow upstream.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)