You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by ep...@apache.org on 2022/03/16 10:00:05 UTC

[airflow] 18/19: Extend documentation for states of DAGs & tasks and update trigger rules docs (#21382)

This is an automated email from the ASF dual-hosted git repository.

ephraimanierobi pushed a commit to branch v2-2-test
in repository https://gitbox.apache.org/repos/asf/airflow.git

commit cb37eb48c5abdb45b034104a8b3efe0de9820553
Author: Mateusz Nojek <ma...@gmail.com>
AuthorDate: Mon Feb 21 17:27:36 2022 +0100

    Extend documentation for states of DAGs & tasks and update trigger rules docs (#21382)
    
    (cherry picked from commit 4e959358ac4ef9554ff5d82cdc85ab7dc142a639)
---
 docs/apache-airflow/best-practices.rst |  67 +++++++++++++++++++++++++++++++++
 docs/apache-airflow/concepts/dags.rst  |   5 ++-
 docs/apache-airflow/concepts/tasks.rst |   6 +++
 docs/apache-airflow/dag-run.rst        |  27 +++++++++++--
 docs/apache-airflow/img/watcher.png    | Bin 0 -> 41592 bytes
 docs/spelling_wordlist.txt             |   1 +
 6 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/docs/apache-airflow/best-practices.rst b/docs/apache-airflow/best-practices.rst
index 3b01b3e..15fcea1 100644
--- a/docs/apache-airflow/best-practices.rst
+++ b/docs/apache-airflow/best-practices.rst
@@ -326,6 +326,73 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+Example of watcher pattern with trigger rules
+---------------------------------------------
+
+The watcher pattern is how we call a DAG with a task that is "watching" the states of the other tasks. It's primary purpose is to fail a DAG Run when any other task fail.
+The need came from the Airflow system tests that are DAGs with different tasks (similarly like a test containing steps).
+
+Normally, when any task fails, all other tasks are not executed and the whole DAG Run gets failed status too. But when we use trigger rules, we can disrupt the normal flow of running tasks and the whole DAG may represent different status that we expect.
+For example, we can have a teardown task (with trigger rule set to ``"all_done"``) that will be executed regardless of the state of the other tasks (e.g. to clean up the resources). In such situation, the DAG would always run this task and the DAG Run will get the status of this particular task, so we can potentially lose the information about failing tasks.
+If we want to ensure that the DAG with teardown task would fail if any task fails, we need to use the watcher pattern.
+The watcher task is a task that will always fail if triggered, but it needs to be triggered only if any other task fails. It needs to have a trigger rule set to ``"one_failed"`` and it needs also to be a downstream task for all other tasks in the DAG.
+Thanks to this, if every other task will pass, the watcher will be skipped, but when something fails, the watcher task will be executed and fail making the DAG Run fail too.
+
+.. note::
+
+    Be aware that trigger rules only rely on the direct upstream (parent) tasks, e.g. ``one_failed`` will ignore any failed (or ``upstream_failed``) tasks that are not a direct parent of the parameterized task.
+
+It's easier to grab the concept with an example. Let's say that we have the following DAG:
+
+.. code-block:: python
+
+    from datetime import datetime
+    from airflow import DAG
+    from airflow.decorators import task
+    from airflow.exceptions import AirflowException
+    from airflow.operators.bash import BashOperator
+    from airflow.operators.python import PythonOperator
+
+
+    @task(trigger_rule="one_failed", retries=0)
+    def watcher():
+        raise AirflowException("Failing task because one or more upstream tasks failed.")
+
+
+    with DAG(
+        dag_id="watcher_example",
+        schedule_interval="@once",
+        start_date=datetime(2021, 1, 1),
+        catchup=False,
+    ) as dag:
+        failing_task = BashOperator(
+            task_id="failing_task", bash_command="exit 1", retries=0
+        )
+        passing_task = BashOperator(
+            task_id="passing_task", bash_command="echo passing_task"
+        )
+        teardown = BashOperator(
+            task_id="teardown", bash_command="echo teardown", trigger_rule="all_done"
+        )
+
+        failing_task >> passing_task >> teardown
+        list(dag.tasks) >> watcher()
+
+The visual representation of this DAG after execution looks like this:
+
+.. image:: /img/watcher.png
+
+We have several tasks that serve different purposes:
+
+- ``failing_task`` always fails,
+- ``passing_task`` always succeeds (if executed),
+- ``teardown`` is always triggered (regardless the states of the other tasks) and it should always succeed,
+- ``watcher`` is a downstream task for each other task, i.e. it will be triggered when any task fails and thus fail the whole DAG Run, since it's a leaf task.
+
+It's important to note, that without ``watcher`` task, the whole DAG Run will get the ``success`` state, since the only failing task is not the leaf task, and the ``teardown`` task will finish with ``success``.
+If we want the ``watcher`` to monitor the state of all tasks, we need to make it dependent on all of them separately. Thanks to this, we can fail the DAG Run if any of the tasks fail. Note that the watcher task has a trigger rule set to ``"one_failed"``.
+On the other hand, without the ``teardown`` task, the ``watcher`` task will not be needed, because ``failing_task`` will propagate its ``failed`` state to downstream task ``passed_task`` and the whole DAG Run will also get the ``failed`` status.
+
 .. _best_practices/reducing_dag_complexity:
 
 Reducing DAG complexity
diff --git a/docs/apache-airflow/concepts/dags.rst b/docs/apache-airflow/concepts/dags.rst
index 32d21ce..b83b025 100644
--- a/docs/apache-airflow/concepts/dags.rst
+++ b/docs/apache-airflow/concepts/dags.rst
@@ -369,7 +369,7 @@ Note that if you are running the DAG at the very start of its life---specificall
 Trigger Rules
 ~~~~~~~~~~~~~
 
-By default, Airflow will wait for all upstream tasks for a task to be :ref:`successful <concepts:task-states>` before it runs that task.
+By default, Airflow will wait for all upstream (direct parents) tasks for a task to be :ref:`successful <concepts:task-states>` before it runs that task.
 
 However, this is just the default behaviour, and you can control it using the ``trigger_rule`` argument to a Task. The options for ``trigger_rule`` are:
 
@@ -383,6 +383,7 @@ However, this is just the default behaviour, and you can control it using the ``
 * ``none_skipped``: No upstream task is in a ``skipped`` state - that is, all upstream tasks are in a ``success``, ``failed``, or ``upstream_failed`` state
 * ``always``: No dependencies at all, run this task at any time
 
+
 You can also combine this with the :ref:`concepts:depends-on-past` functionality if you wish.
 
 .. note::
@@ -724,7 +725,7 @@ You can also prepare ``.airflowignore`` file for a subfolder in ``DAG_FOLDER`` a
 would only be applicable for that subfolder.
 
 DAG Dependencies
-================
+----------------
 
 *Added in Airflow 2.1*
 
diff --git a/docs/apache-airflow/concepts/tasks.rst b/docs/apache-airflow/concepts/tasks.rst
index f6c6619..41fa556 100644
--- a/docs/apache-airflow/concepts/tasks.rst
+++ b/docs/apache-airflow/concepts/tasks.rst
@@ -36,6 +36,12 @@ Relationships
 
 The key part of using Tasks is defining how they relate to each other - their *dependencies*, or as we say in Airflow, their *upstream* and *downstream* tasks. You declare your Tasks first, and then you declare their dependencies second.
 
+.. note::
+
+    We call the *upstream* task the one that is directly preceding the other task. We used to call it a parent task before.
+    Be aware that this concept does not describe the tasks that are higher in the tasks hierarchy (i.e. they are not a direct parents of the task).
+    Same definition applies to *downstream* task, which needs to be a direct child of the other task.
+
 There are two ways of declaring dependencies - using the ``>>`` and ``<<`` (bitshift) operators::
 
     first_task >> second_task >> [third_task, fourth_task]
diff --git a/docs/apache-airflow/dag-run.rst b/docs/apache-airflow/dag-run.rst
index 1a2bbe3..52d90e0 100644
--- a/docs/apache-airflow/dag-run.rst
+++ b/docs/apache-airflow/dag-run.rst
@@ -18,6 +18,30 @@
 DAG Runs
 =========
 A DAG Run is an object representing an instantiation of the DAG in time.
+Any time the DAG is executed, a DAG Run is created and all tasks inside it are executed. The status of the DAG Run depends on the tasks states.
+Each DAG Run is run separately from another, meaning that you can have running DAG many times at the same time.
+
+.. _dag-run:dag-run-status:
+
+DAG Run Status
+''''''''''''''
+
+A DAG Run status is determined when the execution of the DAG is finished.
+The execution of the DAG depends on its containing tasks and their dependencies.
+The status is assigned to the DAG Run when all of the tasks are in the one of the terminal states (i.e. if there is no possible transition to another state) like ``success``, ``failed`` or ``skipped``.
+The DAG Run is having the status assigned based on the so-called "leaf nodes" or simply "leaves". Leaf nodes are the tasks with no children.
+
+There are two possible terminal states for the DAG Run:
+
+- ``success`` if all of the leaf nodes states are either ``success`` or ``skipped``,
+- ``failed`` if any of the leaf nodes state is either ``failed`` or ``upstream_failed``.
+
+.. note::
+    Be careful if some of your tasks have defined some specific `trigger rule <dags.html#trigger-rules>`_.
+    These can lead to some unexpected behavior, e.g. if you have a leaf task with trigger rule `"all_done"`, it will be executed regardless of the states of the rest of the tasks and if it will succeed, then the whole DAG Run will also be marked as ``success``, even if something failed in the middle.
+
+Cron Presets
+''''''''''''
 
 Each DAG may or may not have a schedule, which informs how DAG Runs are
 created. ``schedule_interval`` is defined as a DAG argument, which can be passed a
@@ -27,9 +51,6 @@ a ``str``, a ``datetime.timedelta`` object, or one of the following cron "preset
 .. tip::
     You can use an online editor for CRON expressions such as `Crontab guru <https://crontab.guru/>`_
 
-Cron Presets
-''''''''''''
-
 +----------------+----------------------------------------------------------------+-----------------+
 | preset         | meaning                                                        | cron            |
 +================+================================================================+=================+
diff --git a/docs/apache-airflow/img/watcher.png b/docs/apache-airflow/img/watcher.png
new file mode 100644
index 0000000..9e0ed2d
Binary files /dev/null and b/docs/apache-airflow/img/watcher.png differ
diff --git a/docs/spelling_wordlist.txt b/docs/spelling_wordlist.txt
index ed114b6..350b79d 100644
--- a/docs/spelling_wordlist.txt
+++ b/docs/spelling_wordlist.txt
@@ -1339,6 +1339,7 @@ taskflow
 taskinstance
 tblproperties
 tcp
+teardown
 templatable
 templateable
 templated