You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by po...@apache.org on 2023/07/01 21:56:10 UTC

[airflow] branch main updated: Add information for users who ask for requirements (#32262)

This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/main by this push:
     new f6db66e163 Add information for users who ask for requirements (#32262)
f6db66e163 is described below

commit f6db66e16374e504665972feba0831d4148c6d50
Author: Jarek Potiuk <ja...@potiuk.com>
AuthorDate: Sat Jul 1 23:56:03 2023 +0200

    Add information for users who ask for requirements (#32262)
    
    * Add information for users who ask for requirements
    
    This change is based on a number of discussions with the users asking
    what are the minimum requirements for Airflow to run.
    
    While we cannot give precise answer, we should also make the users
    aware that simple answers are not possible, and that when they are
    deciding to install airflow and manage it on their own, they also
    take the responsibility to monitor and adjust the resources they
    need based on the monitoring they have to run.
    
    * Apply suggestions from code review
    
    Co-authored-by: Pankaj Koti <pa...@gmail.com>
    
    * Update docs/apache-airflow/installation/index.rst
    
    ---------
    
    Co-authored-by: Pankaj Koti <pa...@gmail.com>
---
 .../administration-and-deployment/scheduler.rst    |  1 +
 docs/apache-airflow/installation/index.rst         | 72 +++++++++++++++++++++-
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/docs/apache-airflow/administration-and-deployment/scheduler.rst b/docs/apache-airflow/administration-and-deployment/scheduler.rst
index dfc97e6120..1a2a41b136 100644
--- a/docs/apache-airflow/administration-and-deployment/scheduler.rst
+++ b/docs/apache-airflow/administration-and-deployment/scheduler.rst
@@ -154,6 +154,7 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+.. _fine-tuning-scheduler:
 
 Fine-tuning your Scheduler performance
 --------------------------------------
diff --git a/docs/apache-airflow/installation/index.rst b/docs/apache-airflow/installation/index.rst
index 1ddbf5d66b..8f37ca208d 100644
--- a/docs/apache-airflow/installation/index.rst
+++ b/docs/apache-airflow/installation/index.rst
@@ -77,6 +77,9 @@ More details: :doc:`installing-from-sources`
 * You should develop and handle the deployment for all components of Airflow.
 * You are responsible for setting up database, creating and managing database schema with ``airflow db`` commands,
   automated startup and recovery, maintenance, cleanup and upgrades of Airflow and the Airflow Providers.
+* You need to setup monitoring of your system allowing you to observe resources and react to problems.
+* You are expected to configure and manage appropriate resources for the installation (memory, CPU, etc) based
+  on the monitoring of your installation and feedback loop. See the notes about requirements.
 
 **What Apache Airflow Community provides for that method**
 
@@ -123,6 +126,9 @@ More details:  :doc:`/installation/installing-from-pypi`
 * You should develop and handle the deployment for all components of Airflow.
 * You are responsible for setting up database, creating and managing database schema with ``airflow db`` commands,
   automated startup and recovery, maintenance, cleanup and upgrades of Airflow and Airflow Providers.
+* You need to setup monitoring of your system allowing you to observe resources and react to problems.
+* You are expected to configure and manage appropriate resources for the installation (memory, CPU, etc) based
+  on the monitoring of your installation and feedback loop.
 
 **What Apache Airflow Community provides for that method**
 
@@ -181,6 +187,9 @@ and official constraint files- same that are used for installing Airflow from Py
   deployments of containers. You can use your own custom mechanism, custom Kubernetes deployments,
   custom Docker Compose, custom Helm charts etc., and you should choose it based on your experience
   and expectations.
+* You need to setup monitoring of your system allowing you to observe resources and react to problems.
+* You are expected to configure and manage appropriate resources for the installation (memory, CPU, etc) based
+  on the monitoring of your installation and feedback loop.
 
 **What Apache Airflow Community provides for that method**
 
@@ -238,6 +247,9 @@ More details: :doc:`helm-chart:index`
   those changes when released by upgrading the base image. However, you are responsible in creating a
   pipeline of building your own custom images with your own added dependencies and Providers and need to
   repeat the customization step and building your own image when new version of Airflow image is released.
+* You need to setup monitoring of your system allowing you to observe resources and react to problems.
+* You are expected to configure and manage appropriate resources for the installation (memory, CPU, etc) based
+  on the monitoring of your installation and feedback loop.
 
 **What Apache Airflow Community provides for that method**
 
@@ -256,7 +268,6 @@ More details: :doc:`helm-chart:index`
 * If you can provide description of a reproducible problem with Airflow software, you can open
   issue at `GitHub issues <https://github.com/apache/airflow/issues>`__
 
-
 Using Managed Airflow Services
 ''''''''''''''''''''''''''''''
 
@@ -316,3 +327,62 @@ Follow the  `Ecosystem <https://airflow.apache.org/ecosystem/>`__ page to find a
 **Where to ask for help**
 
 * Depends on what the 3rd-party provides. Look at the documentation of the 3rd-party deployment you use.
+
+
+Notes about minimum requirements
+''''''''''''''''''''''''''''''''
+
+There are often questions about minimum requirements for Airflow for production systems, but it is
+not possible to give a simple answer to that question.
+
+The requirements that Airflow might need depend on many factors, including (but not limited to):
+  * The deployment your Airflow is installed with (see above ways of installing Airflow)
+  * The requirements of the deployment environment (for example Kubernetes, Docker, Helm, etc.) that
+    are completely independent from Airflow (for example DNS resources, sharing the nodes/resources)
+    with more (or less) pods and containers that are needed that might depend on particular choice of
+    the technology/cloud/integration of monitoring etc.
+  * Technical details of database, hardware, network, etc. that your deployment is running on
+  * The complexity of the code you add to your DAGS, configuration, plugins, settings etc. (note, that
+    Airflow runs the code that DAG author and Deployment Manager provide)
+  * The number and choice of providers you install and use (Airflow has more than 80 providers) that can
+    be installed by choice of the Deployment Manager and using them might require more resources.
+  * The choice of parameters that you use when tuning Airflow. Airflow has many configuration parameters
+    that can fine-tuned to your needs
+  * The number of DagRuns and tasks instances you run with parallel instances of each in consideration
+  * How complex are the tasks you run
+
+The above "DAG" characteristics will change over time and even will change depending on the time of the day
+or week, so you have to be prepared to continuously monitor the system and adjust the parameters to make
+it works smoothly.
+
+While we can provide some specific minimum requirements for some development "quick start" - such as
+in case of our :ref:`running-airflow-in-docker` quick-start guide, it is not possible to provide any minimum
+requirements for production systems.
+
+The best way to think of resource allocation for Airflow instance is to think of it in terms of process
+control theory - where there are two types of systems:
+
+1. Fully predictable, with few knobs and variables, where you can reliably set the values for the
+   knobs and have an easy way to determine the behaviour of the system
+
+2. Complex systems with multiple variables, that are hard to predict and where you need to monitor
+   the system and adjust the knobs continuously to make sure the system is running smoothly.
+
+Airflow (and generally any modern system running usually on cloud services, with multiple layers responsible
+for resources as well multiple parameters to control their behaviour) is a complex system and they fall
+much more in the second category. If you decide to run Airflow in production on your own, you should be
+prepared for the monitor/observe/adjust feedback loop to make sure the system is running smoothly.
+
+Having a good monitoring system that will allow you to monitor the system and adjust the parameters
+is a must to put that in practice.
+
+There are few guidelines that you can use for optimizing your resource usage as well. The
+:ref:`fine-tuning-scheduler` is a good starting point to fine-tune your scheduler, you can also follow
+the :ref:`best_practice` guide to make sure you are using Airflow in the most efficient way.
+
+Also, one of the important things that Managed Services for Airflow provide is that they make a lot
+of opinionated choices and fine-tune the system for you, so you don't have to worry about it too much.
+With such managed services, there are usually far less number of knobs to turn and choices to make and one
+of the things you pay for is that the Managed Service provider manages the system for you and provides
+paid support and allows you to scale the system as needed and allocate the right resources - following the
+choices made there when it comes to the kinds of deployment you might have.