You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/07/16 17:27:25 UTC

[GitHub] [airflow] coufon edited a comment on issue #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver

coufon edited a comment on issue #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver
URL: https://github.com/apache/airflow/pull/5594#issuecomment-511909913
 
 
   Hi Jarek, thanks for your comments. Here are my thoughts:
   
   > a starting point to implement part of DAG persistence
   
   We are working on storing 'stringified DAG' into DB to be used by webserver and scheduler. We found it is straightforward now because 'stringified DAG' is always picklable. I will send out an AIP soon. This change (still use current DAG classes) is not as fundamental as:
   https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
   
   > maybe you can share your experiences with an actual "production" usage of this?
   
   We implement async_dag_loader in Composer because we observe there are more and more users running a large amount of DAGs in one Airflow cluster. Webserver frequently goes down because 'collecting DAG time' > 'webserver gunicorn worker refreshing time'. So this feature is suggested for all users to run >= 1,000 DAGs.
   
   Even now we have async_dag_loader, there is still a memory issue. Composer runs Airflow webserver on a separate VM. Collecting thousands of DAGs is memory intense (DAG objects are not very memory consuming though). Therefore, users may still find webserver down due to OOM. Therefore we suggest users to have [webserver] workers=1. We are currently working on storing 'stringified DAGs' in DB  as the solution.
   
   > casting to BaseOperator for non-airflow modules
   
   There may be errors in unpickling classes defined in non-airflow modules. These modules are imported in 'DAG collecting' processes, but not imported in webserver main process. Unpickling them would lead to 'module not found' errors. If we import these modules in webserver main processes, we have to process DAG files, it goes back to sync DAG loading again.
   
   Here is an example: https://github.com/apache/airflow/blob/master/airflow/example_dags/example_skip_dag.py
   
   In this Airflow test DAG, these is a non-airflow operator "class DummySkipOperator" (not defined in airflow/operators or airflow/contrib/operators). The DAG containing that operator can not be unpickled unless we replace that with BaseOperator.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services