You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/18 11:24:11 UTC

[GitHub] [airflow] BasPH commented on a diff in pull request #25121: Add "Optimizing" chapter to dynamic-dags section

BasPH commented on code in PR #25121:
URL: https://github.com/apache/airflow/pull/25121#discussion_r923249662


##########
docs/apache-airflow/howto/dynamic-dag-generation.rst:
##########
@@ -140,3 +140,20 @@ Each of them can run separately with related configuration
 
 .. warning::
   Using this practice, pay attention to "late binding" behaviour in Python loops. See `that GitHub discussion <https://github.com/apache/airflow/discussions/21278#discussioncomment-2103559>`_ for more details
+
+
+Optimizing DAG parsing in workers/Kubernetes Pods
+-------------------------------------------------
+
+Sometimes when you generate a lot of Dynamic DAGs in single DAG file, it might cause unnecessary delays
+when the DAG file is parsed in worker or in Kubernetes POD. In Workers or Kubernetes PODs, you actually
+need only the single DAG (and even single Task of the DAG) to be instantiated in order to execute the task.
+If creating your DAG objects takes a lot of time, and each generated DAG is created independently from each
+other, this might be optimized away by simply skipping the generation of DAGs in worker.

Review Comment:
   Couple of suggestions for clarity/conciseness. Would also add a self-contained example, so that the reader can gather all information from just the docs.
   
   ```suggestion
   Sometimes when you generate a lot of Dynamic DAGs in single DAG file, it might cause unnecessary delays
   when the DAG file is parsed in worker or in Kubernetes POD. In Workers or Kubernetes PODs, you actually
   need only the single DAG (and even single Task of the DAG) to be instantiated in order to execute the task.
   If creating your DAG objects takes a lot of time, and each generated DAG is created independently from each
   other, this might be optimized away by simply skipping the generation of DAGs in worker.
   ```
   
   The parsing time of dynamically generated DAGs in Airflow workers can be optimized. This optimization is most effective when the number of generated DAGs is high. The Airflow scheduler requires loading of a complete DAG file to process all metadata. However, an Airflow worker requires only a single DAG object to execute a task. This allows us to skip the generation of unnecessary DAG objects in the worker, shortening the parsing time. Upon evaluation of a DAG file, command line arguments are supplied which we can use to determine whether the scheduler or worker evaluates the file:
   
   - Scheduler args: ``["scheduler"]``
   - Worker args: ``["airflow", "tasks", "run", "dag_id", "task_id", ...]``
   
   Upon iterating over the collection of things to generate DAGs for, use these arguments to determine whether you need to generate all DAG objects (when running in the scheduler), or to generate only a single DAG object (when running in a worker):
   
   .. code-block:: python
       :emphasize-lines: 1,2,3,7,8
   
       current_dag = None
       if len(sys.argv) > 3:
           current_dag = sys.argv[3]
   
       for thing in list_of_things:
           dag_id = f"generated_dag_{thing}"
           if current_dag is not None and current_dag != dag_id:
               continue  # skip generation of non-selected DAG
           
           dag = DAG(dag_id=dag_id, ...)
           globals()[dag_id] = dag
   
   A nice example is shown in the
   [Airflow's Magic Loop](https://medium.com/apache-airflow/airflows-magic-loop-ec424b05b629) blog post that describes how parsing in workers was reduced from 120 seconds to 200 ms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org