You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by ba...@apache.org on 2022/07/11 13:17:00 UTC

[airflow] branch main updated: Rewrite the Airflow documentation home page (#24795)

This is an automated email from the ASF dual-hosted git repository.

basph pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/main by this push:
     new 32f5eb1e8d Rewrite the Airflow documentation home page (#24795)
32f5eb1e8d is described below

commit 32f5eb1e8da837eac7cd475f8a7baa9ed21fa351
Author: Bas Harenslak <Ba...@users.noreply.github.com>
AuthorDate: Mon Jul 11 15:16:53 2022 +0200

    Rewrite the Airflow documentation home page (#24795)
    
    * Rewrite the Airflow home page
    
    * Rename home to overview
    
    * Ignore parameterization
    
    * Remove history since that can be read elsewhere
    
    * Link companies to inthewild.md
    
    * Process comment
---
 docs/apache-airflow/img/airflow.gif                | Bin 416302 -> 0 bytes
 docs/apache-airflow/img/hello_world_graph_view.png | Bin 0 -> 73688 bytes
 docs/apache-airflow/img/hello_world_grid_view.png  | Bin 0 -> 132851 bytes
 docs/apache-airflow/index.rst                      | 121 +++++++++++++++------
 docs/spelling_wordlist.txt                         |   1 +
 5 files changed, 86 insertions(+), 36 deletions(-)

diff --git a/docs/apache-airflow/img/airflow.gif b/docs/apache-airflow/img/airflow.gif
deleted file mode 100644
index 076fe8e978..0000000000
Binary files a/docs/apache-airflow/img/airflow.gif and /dev/null differ
diff --git a/docs/apache-airflow/img/hello_world_graph_view.png b/docs/apache-airflow/img/hello_world_graph_view.png
new file mode 100644
index 0000000000..18ef6eb4ff
Binary files /dev/null and b/docs/apache-airflow/img/hello_world_graph_view.png differ
diff --git a/docs/apache-airflow/img/hello_world_grid_view.png b/docs/apache-airflow/img/hello_world_grid_view.png
new file mode 100644
index 0000000000..e2140c17eb
Binary files /dev/null and b/docs/apache-airflow/img/hello_world_grid_view.png differ
diff --git a/docs/apache-airflow/index.rst b/docs/apache-airflow/index.rst
index d6a781e0c1..66ac7f9d45 100644
--- a/docs/apache-airflow/index.rst
+++ b/docs/apache-airflow/index.rst
@@ -15,65 +15,114 @@
     specific language governing permissions and limitations
     under the License.
 
+What is Airflow?
+=========================================
+
+`Apache Airflow <https://github.com/apache/airflow>`_ is an open-source platform for developing, scheduling,
+and monitoring batch-oriented workflows. Airflow's extensible Python framework enables you to build workflows
+connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is
+deployable in many ways, varying from a single process on your laptop to a distributed setup to support even
+the biggest workflows.
 
+Workflows as code
+=========================================
+The main characteristic of Airflow workflows is that all workflows are defined in Python code. "Workflows as
+code" serves several purposes:
 
+- **Dynamic**: Airflow pipelines are configured as Python code, allowing for dynamic pipeline generation.
+- **Extensible**: The Airflow framework contains operators to connect with numerous technologies. All Airflow components are extensible to easily adjust to your environment.
+- **Flexible**: Workflow parameterization is built-in leveraging the `Jinja <https://jinja.palletsprojects.com>`_ templating engine.
 
-.. image:: ../../airflow/www/static/pin_large.png
-    :width: 100
+Take a look at the following snippet of code:
 
-Apache Airflow Documentation
-=========================================
+.. code-block:: python
+
+    from datetime import datetime
+
+    from airflow import DAG
+    from airflow.operators.bash import BashOperator
+    from airflow.operators.python import PythonOperator
+
+    # A DAG represents a workflow, a collection of tasks
+    with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule_interval="0 0 * * *") as dag:
 
-Airflow is a platform to programmatically author, schedule and monitor
-workflows.
+        # Tasks are represented as operators
+        hello = BashOperator(task_id="hello", bash_command="echo hello")
+        airflow = PythonOperator(task_id="airflow", python_callable=lambda: print("airflow"))
 
-Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks.
-The Airflow scheduler executes your tasks on an array of workers while
-following the specified dependencies. Rich command line utilities make
-performing complex surgeries on DAGs a snap. The rich user interface
-makes it easy to visualize pipelines running in production,
-monitor progress, and troubleshoot issues when needed.
+        # Set dependencies between tasks
+        hello >> airflow
 
-When workflows are defined as code, they become more maintainable,
-versionable, testable, and collaborative.
 
+Here you see:
 
+- A DAG named "demo", starting on Jan 1st 2022 and running once a day. A DAG is Airflow's representation of a workflow.
+- Two tasks, a BashOperator running a Bash script and a PythonOperator running a Python script
+- ``>>`` between the tasks defines a dependency and controls in which order the tasks will be executed
 
-.. image:: img/airflow.gif
+Airflow evaluates this script and executes the tasks at the set interval and in the defined order. The status
+of the "demo" DAG is visible in the web interface:
 
-------------
+.. image:: /img/hello_world_graph_view.png
+  :alt: Demo DAG in the Graph View, showing the status of one DAG run
 
-Principles
-----------
+This example demonstrates a simple Bash and Python script, but these tasks can run any arbitrary code. Think
+of running a Spark job, moving data between two buckets, or sending an email. The same structure can also be
+seen running over time:
 
-- **Dynamic**:  Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
-- **Extensible**:  Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
-- **Elegant**:  Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful **Jinja** templating engine.
-- **Scalable**:  Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
+.. image:: /img/hello_world_grid_view.png
+  :alt: Demo DAG in the Grid View, showing the status of all DAG runs
 
+Each column represents one DAG run. These are two of the most used views in Airflow, but there are several
+other views which allow you to deep dive into the state of your workflows.
 
-Beyond the Horizon
-------------------
+Why Airflow?
+=========================================
+Airflow is a batch workflow orchestration platform. The Airflow framework contains operators to connect with
+many technologies and is easily extensible to connect with a new technology. If your workflows have a clear
+start and end, and run at regular intervals, they can be programmed as an Airflow DAG.
+
+If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which
+means:
+
+- Workflows can be stored in version control so that you can roll back to previous versions
+- Workflows can be developed by multiple people simultaneously
+- Tests can be written to validate functionality
+- Components are extensible and you can build on a wide collection of existing components
+
+Rich scheduling and execution semantics enable you to easily define complex pipelines, running at regular
+intervals. Backfilling allows you to (re-)run pipelines on historical data after making changes to your logic.
+And the ability to rerun partial pipelines after resolving an error helps maximize efficiency.
 
-Airflow **is not** a data streaming solution. Tasks do not move data from
-one to the other (though tasks can exchange metadata!). Airflow is not
-in the `Spark Streaming <http://spark.apache.org/streaming/>`_
-or `Storm <https://storm.apache.org/>`_ space, it is more comparable to
-`Oozie <http://oozie.apache.org/>`_ or
-`Azkaban <https://azkaban.github.io/>`_.
+Airflow's user interface provides both in-depth views of pipelines and individual tasks, and an overview of
+pipelines over time. From the interface, you can inspect logs and manage tasks, for example retrying a task in
+case of failure.
+
+The open-source nature of Airflow ensures you work on components developed, tested, and used by many other
+`companies <https://github.com/apache/airflow/blob/main/INTHEWILD.md>`_ around the world. In the active
+`community <https://airflow.apache.org/community>`_ you can find plenty of helpful resources in the form of
+blogs posts, articles, conferences, books, and more. You can connect with other peers via several channels
+such as `Slack <https://s.apache.org/airflow-slack>`_ and mailing lists.
+
+Why not Airflow?
+=========================================
+Airflow was built for finite batch workflows. While the CLI and REST API do allow triggering workflows,
+Airflow was not built for infinitely-running event-based workflows. Airflow is not a streaming solution.
+However, a streaming system such as Apache Kafka is often seen working together with Apache Airflow. Kafka can
+be used for ingestion and processing in real-time, event data is written to a storage location, and Airflow
+periodically starts a workflow processing a batch of data.
 
-Workflows are expected to be mostly static or slowly changing. You can think
-of the structure of the tasks in your workflow as slightly more dynamic
-than a database structure would be. Airflow workflows are expected to look
-similar from a run to the next, this allows for clarity around
-unit of work and continuity.
+If you prefer clicking over coding, Airflow is probably not the right solution. The web interface aims to make
+managing workflows as easy as possible and the Airflow framework is continuously improved to make the
+developer experience as smooth as possible. However, the philosophy of Airflow is to define workflows as code
+so coding will always be required.
 
 
 .. toctree::
     :hidden:
     :caption: Content
 
-    Home <self>
+    Overview <self>
     project
     license
     start/index
diff --git a/docs/spelling_wordlist.txt b/docs/spelling_wordlist.txt
index 3af1c984b6..e6779724bf 100644
--- a/docs/spelling_wordlist.txt
+++ b/docs/spelling_wordlist.txt
@@ -1157,6 +1157,7 @@ param
 parametable
 parameterType
 parameterValue
+parameterization
 parameterizing
 paramiko
 params