You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 04:29:45 UTC

[GitHub] [airflow] jedcunningham commented on a diff in pull request #24795: Rewrite the Airflow documentation home page

jedcunningham commented on code in PR #24795:
URL: https://github.com/apache/airflow/pull/24795#discussion_r916451958


##########
docs/apache-airflow/index.rst:
##########
@@ -15,65 +15,114 @@
     specific language governing permissions and limitations
     under the License.
 
+What is Airflow?
+=========================================
+
+`Apache Airflow <https://github.com/apache/airflow>`_ is an open-source platform for developing, scheduling,
+and monitoring batch-oriented workflows. Airflow's extensible Python framework enables you to build workflows
+connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is
+deployable in many ways, varying from a single process on your laptop to a distributed setup to support even
+the biggest workflows.
 
+Workflows as code
+=========================================
+The main characteristic of Airflow workflows is that all workflows are defined in Python code. "Workflows as
+code" serves several purposes:
 
+- **Dynamic**: Airflow pipelines are configured as Python code, allowing for dynamic pipeline generation.
+- **Extensible**: The Airflow framework contains operators to connect with numerous technologies. All Airflow components are extensible to easily adjust to your environment.
+- **Flexible**: Workflow parameterization is built-in leveraging the `Jinja <https://jinja.palletsprojects.com>`_ templating engine.
 
-.. image:: ../../airflow/www/static/pin_large.png
-    :width: 100
+Take a look at the following snippet of code:
 
-Apache Airflow Documentation
-=========================================
+.. code-block:: python
+
+    from datetime import datetime
+
+    from airflow import DAG
+    from airflow.operators.bash import BashOperator
+    from airflow.operators.python import PythonOperator
+
+    # A DAG represents a workflow, a collection of tasks
+    with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule_interval="0 0 * * *") as dag:
 
-Airflow is a platform to programmatically author, schedule and monitor
-workflows.
+        # Tasks are represented as operators
+        hello = BashOperator(task_id="hello", bash_command="echo hello")
+        airflow = PythonOperator(task_id="airflow", python_callable=lambda: print("airflow"))
 
-Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks.
-The Airflow scheduler executes your tasks on an array of workers while
-following the specified dependencies. Rich command line utilities make
-performing complex surgeries on DAGs a snap. The rich user interface
-makes it easy to visualize pipelines running in production,
-monitor progress, and troubleshoot issues when needed.
+        # Set dependencies between tasks
+        hello >> airflow
 
-When workflows are defined as code, they become more maintainable,
-versionable, testable, and collaborative.
 
+Here you see:
 
+- A DAG named "demo", starting on Jan 1st 2022 and running once a day. A DAG is Airflow's representation of a workflow.
+- Two tasks, a BashOperator running a Bash script and a PythonOperator running a Python script
+- ``>>`` between the tasks defines a dependency and controls in which order the tasks will be executed
 
-.. image:: img/airflow.gif
+Airflow evaluates this script and executes the tasks at the set interval and in the defined order. The status
+of the "demo" DAG is visible in the web interface:
 
-------------
+.. image:: /img/hello_world_graph_view.png
+  :alt: Demo DAG in the Graph View, showing the status of one DAG run
 
-Principles
-----------
+This example demonstrates a simple Bash and Python script, but these tasks can run any arbitrary code. Think
+of running a Spark job, moving data between two buckets, or sending an email. The same structure can also be
+seen running over time:
 
-- **Dynamic**:  Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
-- **Extensible**:  Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
-- **Elegant**:  Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful **Jinja** templating engine.
-- **Scalable**:  Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
+.. image:: /img/hello_world_grid_view.png
+  :alt: Demo DAG in the Grid View, showing the status of all DAG runs
 
+Each column represents one DAG run. These are two of the most used views in Airflow, but there are several
+other views which allow you to deep dive into the state of your workflows.
 
-Beyond the Horizon
-------------------
+Why Airflow?
+=========================================
+Airflow is a batch workflow orchestration platform. The Airflow framework contains operators to connect with
+many technologies and is easily extensible to connect with a new technology. If your workflows have a clear
+start and end, and run at regular intervals, they can be programmed as an Airflow DAG.
+
+If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which
+means:
+
+- Workflows can be stored in version control so that you can roll back to previous versions
+- Workflows can be developed by multiple people simultaneously
+- Tests can be written to validate functionality
+- Components are extensible and you can build on a wide collection of existing components
+
+Rich scheduling and execution semantics enable you to easily define complex pipelines, running at regular
+intervals. Backfilling allows you to (re-)run pipelines on historical data after making changes to your logic.
+And the ability to rerun partial pipelines after resolving an error helps maximize efficiency.
 
-Airflow **is not** a data streaming solution. Tasks do not move data from
-one to the other (though tasks can exchange metadata!). Airflow is not
-in the `Spark Streaming <http://spark.apache.org/streaming/>`_
-or `Storm <https://storm.apache.org/>`_ space, it is more comparable to
-`Oozie <http://oozie.apache.org/>`_ or
-`Azkaban <https://azkaban.github.io/>`_.
+Airflow's user interface provides both in-depth views of pipelines and individual tasks, and an overview of
+pipelines over time. From the interface, you can inspect logs and manage tasks, for example retrying a task in
+case of failure.
+
+The open-source nature of Airflow ensures you work on components developed, tested, and used by many other
+`companies <https://github.com/apache/airflow/blob/main/INTHEWILD.md>`_ around the world. In the active
+`community <https://airflow.apache.org/community>`_ you can find plenty of helpful resources in the form of
+blogs posts, articles, conferences, books, and more. You can connect with other peers via several channels
+such as `Slack <https://s.apache.org/airflow-slack>`_ and a mailing list.

Review Comment:
   ```suggestion
   such as `Slack <https://s.apache.org/airflow-slack>`_ and mailing lists.
   ```
   
   Nit, we have a couple.



##########
docs/apache-airflow/index.rst:
##########
@@ -15,65 +15,120 @@
     specific language governing permissions and limitations
     under the License.
 
+What is Airflow?
+=========================================
 
+`Apache Airflow <https://github.com/apache/airflow>`_ is an open-source platform for developing, scheduling,

Review Comment:
   I think adding orchestration here is probably a good idea, but I say we defer it to another PR as we want to match the wording on the homepage as well.



##########
docs/apache-airflow/index.rst:
##########
@@ -15,65 +15,120 @@
     specific language governing permissions and limitations
     under the License.
 
+What is Airflow?
+=========================================
 
+`Apache Airflow <https://github.com/apache/airflow>`_ is an open-source platform for developing, scheduling,
+and monitoring batch-oriented workflows. Airflow's extensible Python framework enables you to build workflows
+connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is
+deployable in many ways, varying from a single process on your laptop to a distributed setup to support even
+the biggest workflows.
 
+Airflow was started in October 2014 by Maxime Beauchemin at Airbnb. It was open source from the very first
+commit and hosted under Airbnb's GitHub in June 2015. The project joined the `Apache Software Foundation
+Incubator program <https://incubator.apache.org/>`_ in March 2016 and the Foundation announced Apache Airflow
+as a Top-Level Project in January 2019. Ever since Airflow's inception, many companies and developers have
+used Airflow to manage their workflows. Airflow has grown to become one of the most-used open source workflow
+orchestration tools and has a very active developer community.
 
-.. image:: ../../airflow/www/static/pin_large.png
-    :width: 100
-
-Apache Airflow Documentation
+Workflows as code
 =========================================
+The main characteristic of Airflow workflows is that all workflows are defined in Python code. "Workflows as
+code" serves several purposes:
+
+- **Dynamic**: Airflow pipelines are configured as Python code, allowing for dynamic pipeline generation.
+- **Extensible**: The Airflow framework contains operators to connect with numerous technologies. All Airflow components are extensible to easily adjust to your environment.
+- **Flexible**: Workflow parameterization is built-in leveraging the `Jinja <https://jinja.palletsprojects.com>`_ templating engine.
+
+Take a look at the following snippet of code:
 
-Airflow is a platform to programmatically author, schedule and monitor
-workflows.
+.. code-block:: python
 
-Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks.
-The Airflow scheduler executes your tasks on an array of workers while
-following the specified dependencies. Rich command line utilities make
-performing complex surgeries on DAGs a snap. The rich user interface
-makes it easy to visualize pipelines running in production,
-monitor progress, and troubleshoot issues when needed.
+    from datetime import datetime
 
-When workflows are defined as code, they become more maintainable,
-versionable, testable, and collaborative.
+    from airflow import DAG
+    from airflow.operators.bash import BashOperator
+    from airflow.operators.python import PythonOperator
 
+    # A DAG represents a workflow, a collection of tasks
+    with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule_interval="0 0 * * *") as dag:
 
+        # Tasks are represented as operators
+        hello = BashOperator(task_id="hello", bash_command="echo hello")
+        airflow = PythonOperator(task_id="airflow", python_callable=lambda: print("airflow"))
 
-.. image:: img/airflow.gif
+        # Set dependencies between tasks
+        hello >> airflow
 
-------------
 
-Principles
-----------
+Here you see:
 
-- **Dynamic**:  Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
-- **Extensible**:  Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
-- **Elegant**:  Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful **Jinja** templating engine.
-- **Scalable**:  Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
+- A DAG named "demo", starting on Jan 1st 2022 and running once a day. A DAG is Airflow's representation of a workflow.
+- Two tasks, a BashOperator running a Bash script and a PythonOperator running a Python script
+- ``>>`` between the tasks defines a dependency and controls in which order the tasks will be executed
 
+Airflow evaluates this script and executes the tasks at the set interval and in the defined order. The status
+of the "demo" DAG is visible in the web interface:
 
-Beyond the Horizon
-------------------
+.. image:: /img/hello_world_graph_view.png
+  :alt: Demo DAG in the Graph View, showing the status of one DAG run
 
-Airflow **is not** a data streaming solution. Tasks do not move data from
-one to the other (though tasks can exchange metadata!). Airflow is not
-in the `Spark Streaming <http://spark.apache.org/streaming/>`_
-or `Storm <https://storm.apache.org/>`_ space, it is more comparable to
-`Oozie <http://oozie.apache.org/>`_ or
-`Azkaban <https://azkaban.github.io/>`_.
+This example demonstrates a simple Bash and Python script, but these tasks can run any arbitrary code. Think
+of running a Spark job, moving data between two buckets, or sending an email. The same structure can also be
+seen running over time:
+
+.. image:: /img/hello_world_grid_view.png
+  :alt: Demo DAG in the Grid View, showing the status of all DAG runs
+
+Each column represents one DAG run. These are two of the most used views in Airflow, but there are several
+other views which allow you to deep dive into the state of your workflows.
+
+Why Airflow?
+=========================================
+Airflow is a batch workflow orchestration platform. The Airflow framework contains operators to connect with
+many technologies and is easily extensible to connect with a new technology. If your workflows have a clear
+start and end, and run at regular intervals, they can be programmed as an Airflow DAG.
+
+If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which
+means:
+
+- Workflows can be stored in version control so that you can roll back to previous versions
+- Workflows can be developed by multiple people simultaneously
+- Tests can be written to validate functionality
+- Components are extensible and you can build on a wide collection of existing components
+

Review Comment:
   I see where you are going with these, but I think we can do without.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org