You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/15 00:36:15 UTC

[GitHub] [airflow] kaxil commented on a change in pull request #15183: Move common pitfall documentation to Airflow docs

kaxil commented on a change in pull request #15183:
URL: https://github.com/apache/airflow/pull/15183#discussion_r632870068



##########
File path: docs/apache-airflow/faq.rst
##########
@@ -159,72 +215,205 @@ simple dictionary.
         other_dag_id = f'bar_{i}'
         globals()[other_dag_id] = create_dag(other_dag_id)
 
-What are all the ``airflow tasks run`` commands in my process list?
--------------------------------------------------------------------
+Even though Airflow supports multiple DAG definition per python file, dynamically generated or otherwise, it is not
+recommended as Airflow would like better isolation between DAGs from a fault and deployment perspective and multiple
+DAGs in the same file goes against that.
 
-There are many layers of ``airflow tasks run`` commands, meaning it can call itself.
 
-- Basic ``airflow tasks run``: fires up an executor, and tell it to run an
-  ``airflow tasks run --local`` command. If using Celery, this means it puts a
-  command in the queue for it to run remotely on the worker. If using
-  LocalExecutor, that translates into running it in a subprocess pool.
-- Local ``airflow tasks run --local``: starts an ``airflow tasks run --raw``
-  command (described below) as a subprocess and is in charge of
-  emitting heartbeats, listening for external kill signals
-  and ensures some cleanup takes place if the subprocess fails.
-- Raw ``airflow tasks run --raw`` runs the actual operator's execute method and
-  performs the actual work.
+Are top level Python code allowed?
+----------------------------------
 
+While it is not recommended to write any code outside of defining Airflow constructs, Airflow does support any
+arbitrary python code as long as it does not break the DAG file processor or prolong file processing time past the
+:ref:`config:core__dagbag_import_timeout` value.
 
-How can my airflow dag run faster?
-----------------------------------
+A common example is the violation of the time limit when building a dynamic DAG which usually requires querying data
+from another service like a database. At the same time, the requested service is being swamped with DAG file
+processors requests for data to process the file. These unintended interactions may cause the service to deteriorate
+and eventually cause DAG file processing to fail.
 
-There are a few variables we can control to improve airflow dag performance:
+Refer to :ref:`DAG writing best practices<best_practice:writing_a_dag>` for more information.
 
-- ``parallelism``: This variable controls the number of task instances that runs simultaneously across the whole Airflow cluster. User could increase the ``parallelism`` variable in the ``airflow.cfg``.
-- ``concurrency``: The Airflow scheduler will run no more than ``concurrency`` task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG. If you do not set the concurrency on your DAG, the scheduler will use the default value from the ``dag_concurrency`` entry in your ``airflow.cfg``.
-- ``task_concurrency``: This variable controls the number of concurrent running task instances across ``dag_runs`` per task.
-- ``max_active_runs``: the Airflow scheduler will run no more than ``max_active_runs`` DagRuns of your DAG at a given time. If you do not set the ``max_active_runs`` in your DAG, the scheduler will use the default value from the ``max_active_runs_per_dag`` entry in your ``airflow.cfg``.
-- ``pool``: This variable controls the number of concurrent running task instances assigned to the pool.
 
-How can we reduce the airflow UI page load time?
-------------------------------------------------
+Do Macros resolves in another Jinja template?
+---------------------------------------------
 
-If your dag takes long time to load, you could reduce the value of ``default_dag_run_display_number`` configuration in ``airflow.cfg`` to a smaller value. This configurable controls the number of dag run to show in UI with default value 25.
+It is not possible to render :ref:`Macros<macros>` or any Jinja template within another Jinja template. This is
+commonly attempted in ``user_defined_macros``.
 
+.. code-block:: python
 
-How to fix Exception: Global variable explicit_defaults_for_timestamp needs to be on (1)?
------------------------------------------------------------------------------------------
+        dag = DAG(
+            ...
+            user_defined_macros={
+                'my_custom_macro': 'day={{ ds }}'
+            }
+        )
 
-This means ``explicit_defaults_for_timestamp`` is disabled in your mysql server and you need to enable it by:
+        bo = BashOperator(
+            task_id='my_task',
+            bash_command="echo {{ my_custom_macro }}",
+            dag=dag
+        )
 
-#. Set ``explicit_defaults_for_timestamp = 1`` under the ``mysqld`` section in your ``my.cnf`` file.
-#. Restart the Mysql server.
+This will echo "day={{ ds }}" instead of "day=2020-01-01" for a dagrun with the execution date 2020-01-01 00:00:00.
 
+.. code-block:: python
 
-How to reduce airflow dag scheduling latency in production?
------------------------------------------------------------
+        bo = BashOperator(
+            task_id='my_task',
+            bash_command="echo day={{ ds }}",
+            dag=dag
+        )
+
+By using the ds macros directly in the template_field, the rendered value results in "day=2020-01-01".
 
-Airflow 2 has low DAG scheduling latency out of the box (particularly when compared with Airflow 1.10.x),
-however if you need more throughput you can :ref:`start multiple schedulers<scheduler:ha>`.
 
-Why next_ds or prev_ds might not contain expected values?
----------------------------------------------------------
+Why ``next_ds`` or ``prev_ds`` might not contain expected values?
+------------------------------------------------------------------
 
 - When scheduling DAG, the ``next_ds`` ``next_ds_nodash`` ``prev_ds`` ``prev_ds_nodash`` are calculated using
   ``execution_date`` and ``schedule_interval``. If you set ``schedule_interval`` as ``None`` or ``@once``,
   the ``next_ds``, ``next_ds_nodash``, ``prev_ds``, ``prev_ds_nodash`` values will be set to ``None``.
 - When manually triggering DAG, the schedule will be ignored, and ``prev_ds == next_ds == ds``
 
+
+Task execution interactions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+What does TemplateNotFound mean?
+---------------------------------

Review comment:
       ```suggestion
   What does ``TemplateNotFound`` mean?
   -------------------------------------
   ```

##########
File path: docs/apache-airflow/faq.rst
##########
@@ -159,72 +215,205 @@ simple dictionary.
         other_dag_id = f'bar_{i}'
         globals()[other_dag_id] = create_dag(other_dag_id)
 
-What are all the ``airflow tasks run`` commands in my process list?
--------------------------------------------------------------------
+Even though Airflow supports multiple DAG definition per python file, dynamically generated or otherwise, it is not
+recommended as Airflow would like better isolation between DAGs from a fault and deployment perspective and multiple
+DAGs in the same file goes against that.
 
-There are many layers of ``airflow tasks run`` commands, meaning it can call itself.
 
-- Basic ``airflow tasks run``: fires up an executor, and tell it to run an
-  ``airflow tasks run --local`` command. If using Celery, this means it puts a
-  command in the queue for it to run remotely on the worker. If using
-  LocalExecutor, that translates into running it in a subprocess pool.
-- Local ``airflow tasks run --local``: starts an ``airflow tasks run --raw``
-  command (described below) as a subprocess and is in charge of
-  emitting heartbeats, listening for external kill signals
-  and ensures some cleanup takes place if the subprocess fails.
-- Raw ``airflow tasks run --raw`` runs the actual operator's execute method and
-  performs the actual work.
+Are top level Python code allowed?
+----------------------------------
 
+While it is not recommended to write any code outside of defining Airflow constructs, Airflow does support any
+arbitrary python code as long as it does not break the DAG file processor or prolong file processing time past the
+:ref:`config:core__dagbag_import_timeout` value.
 
-How can my airflow dag run faster?
-----------------------------------
+A common example is the violation of the time limit when building a dynamic DAG which usually requires querying data
+from another service like a database. At the same time, the requested service is being swamped with DAG file
+processors requests for data to process the file. These unintended interactions may cause the service to deteriorate
+and eventually cause DAG file processing to fail.
 
-There are a few variables we can control to improve airflow dag performance:
+Refer to :ref:`DAG writing best practices<best_practice:writing_a_dag>` for more information.
 
-- ``parallelism``: This variable controls the number of task instances that runs simultaneously across the whole Airflow cluster. User could increase the ``parallelism`` variable in the ``airflow.cfg``.
-- ``concurrency``: The Airflow scheduler will run no more than ``concurrency`` task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG. If you do not set the concurrency on your DAG, the scheduler will use the default value from the ``dag_concurrency`` entry in your ``airflow.cfg``.
-- ``task_concurrency``: This variable controls the number of concurrent running task instances across ``dag_runs`` per task.
-- ``max_active_runs``: the Airflow scheduler will run no more than ``max_active_runs`` DagRuns of your DAG at a given time. If you do not set the ``max_active_runs`` in your DAG, the scheduler will use the default value from the ``max_active_runs_per_dag`` entry in your ``airflow.cfg``.
-- ``pool``: This variable controls the number of concurrent running task instances assigned to the pool.
 
-How can we reduce the airflow UI page load time?
-------------------------------------------------
+Do Macros resolves in another Jinja template?
+---------------------------------------------
 
-If your dag takes long time to load, you could reduce the value of ``default_dag_run_display_number`` configuration in ``airflow.cfg`` to a smaller value. This configurable controls the number of dag run to show in UI with default value 25.
+It is not possible to render :ref:`Macros<macros>` or any Jinja template within another Jinja template. This is
+commonly attempted in ``user_defined_macros``.
 
+.. code-block:: python
 
-How to fix Exception: Global variable explicit_defaults_for_timestamp needs to be on (1)?
------------------------------------------------------------------------------------------
+        dag = DAG(
+            ...
+            user_defined_macros={
+                'my_custom_macro': 'day={{ ds }}'
+            }
+        )
 
-This means ``explicit_defaults_for_timestamp`` is disabled in your mysql server and you need to enable it by:
+        bo = BashOperator(
+            task_id='my_task',
+            bash_command="echo {{ my_custom_macro }}",
+            dag=dag
+        )
 
-#. Set ``explicit_defaults_for_timestamp = 1`` under the ``mysqld`` section in your ``my.cnf`` file.
-#. Restart the Mysql server.
+This will echo "day={{ ds }}" instead of "day=2020-01-01" for a dagrun with the execution date 2020-01-01 00:00:00.
 
+.. code-block:: python
 
-How to reduce airflow dag scheduling latency in production?
------------------------------------------------------------
+        bo = BashOperator(
+            task_id='my_task',
+            bash_command="echo day={{ ds }}",
+            dag=dag
+        )
+
+By using the ds macros directly in the template_field, the rendered value results in "day=2020-01-01".
 
-Airflow 2 has low DAG scheduling latency out of the box (particularly when compared with Airflow 1.10.x),
-however if you need more throughput you can :ref:`start multiple schedulers<scheduler:ha>`.
 
-Why next_ds or prev_ds might not contain expected values?
----------------------------------------------------------
+Why ``next_ds`` or ``prev_ds`` might not contain expected values?
+------------------------------------------------------------------
 
 - When scheduling DAG, the ``next_ds`` ``next_ds_nodash`` ``prev_ds`` ``prev_ds_nodash`` are calculated using
   ``execution_date`` and ``schedule_interval``. If you set ``schedule_interval`` as ``None`` or ``@once``,
   the ``next_ds``, ``next_ds_nodash``, ``prev_ds``, ``prev_ds_nodash`` values will be set to ``None``.
 - When manually triggering DAG, the schedule will be ignored, and ``prev_ds == next_ds == ds``
 
+
+Task execution interactions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+What does TemplateNotFound mean?
+---------------------------------
+
+TemplateNotFound errors are usually due to misalignment with user expectations when passing path to operator

Review comment:
       ```suggestion
   ``TemplateNotFound`` errors are usually due to misalignment with user expectations when passing path to operator
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org