You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/19 11:01:15 UTC

[GitHub] [airflow] potiuk opened a new pull request #18356: Explain scheduler fine-tuning better

potiuk opened a new pull request #18356:
URL: https://github.com/apache/airflow/pull/18356


   A lot of users have an expectations that Airflow Scheduler will
   `just work` and deliver the `optimal performance` for them without
   realising that in case of such comples systems as Airflow is you
   often have to decide what you should optimise for or accept some
   trade-offs or increase hardware capacity if you are not willing to
   make those trade-offs.
   
   Also it's not clear where the responsibility
   is - should it `just work` or should the user be responsible for
   understanding and fine tuning their system (both approaches are
   possible, there are some complex systmes which utilise a lot of
   automation/AI etc. to fine tune and optmise their behaviour but
   Airflow expects from the users to know a bit more on how the
   scheduling works and Airflow maintainers deliver a lot of
   knobs that can be turned to fine tune the system and to make
   trade-off decisions. This was not explicitely stated in our
   documentation and users could have different expectations about
   it (and they often had judging from issues they raised).
   
   This PR adds a "fine-tuning" chapter that aims to set the
   expectations of the users at the right level - it explains what
   Airflow provides, but also what is the user's responsibility - to
   decide what they are optimising, to see where their bottlenecks
   are and to decide if they need to change the configuration or
   increase hardware capacity (or make appropriate trade-offs).
   
   It also brings more of the fine-tuning parameters to the
   `tuneables` section of scheduler, based on some of the recent
   questions asked by the users - seems that having a specific
   overview of all performance-impacting parameters is a good idea,
   and we only had a very limited subset of those.
   
   Some user prefer `watch` rather than read that's why this PR
   also adds the link to the recording of talk from the
   Airlfow Summit 2021 where Ash describes - in a very concise
   and easy to grasp way - all the whys and hows of the scheduler.
   If you understand why and how the scheduler does what it does,
   fine-tuning decisions are much easier.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] edwardwang888 commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

edwardwang888 commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r711869019



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two

Review comment:
       ```suggestion
   fine-tune Scheduler behaviour. First of all you need to remember that Scheduler performs two
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-924071704


   I think all updated @ashb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] eladkal commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

eladkal commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712259107



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot complex DAGs,

Review comment:
       What is a complex DAG? do we mean just high number of tasks?

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.

Review comment:
       is MsSQL experimental?
   I thought that starting 2.2 we consider it to be stable and fully supported

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot complex DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary
+  in-memory storage. This can lead to memory pressure. You need to observe if your system is not using
+  more memory than it has - which results with using swap disk, which dramatically decreases performance.
+  Note that Airflow Scheduler in versions prior to ``2.1.4`` generated a lot of ``Page Cache`` memory
+  used by log files (when the log files were not removed). This was generally harmless, as the memory
+  is just cache and could be reclaimed at any time by the system, however in version ``2.1.4`` and
+  beyond, writing logs will not generate excessive ``Page Cache`` memory. Regardless - make sure when you look
+  at memory usage, pay attention to the kind of memory you are observing. Usually you should look at
+  ``working memory``(names might vary depending on your deployment) rather than ``total memory used``.
+
+What can you do, to improve Scheduler's performance
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+When you know what your resource usage is, the improvements that you can consider might be:
+
+* improve the logic, efficiency of parsing and reduce complexity of your DAG Python code. It is parsed
+  continuously so optimizing that code might bring tremendous improvements, especially if you try
+  to reach out to some external databases etc. while parsing DAGs (this should be avoided at all cost).
+  The :doc:`/best-practices` document shows a few examples on how you can approach dynamic DAG parsing
+  without reaching out to external sources.
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.

Review comment:
       Do we offer some way of identifying these?
   For example I use a test to check how much time it takes to load the dag and from that I deduce if an expensive call was made in operator init. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712926377



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate

Review comment:
       I took it from the original description - so I will remove it there too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712915300



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.

Review comment:
       Excellent one. That's often overlooked by people but can have tremendous impact. Will add..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712050719



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or tha I/O you use for
+  DAG filesystem is at its limits). Often the problem with scheduler performance is
+  simply because your system is not "capable" enough and this might be the only way. For example if
+  you see that you are using all CPU you have on machine, you might want to add another scheduler on
+  a new machine - in most cases, when you add 2nd or 3rd scheduler, the capacity of scheduling grows
+  linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you might get better effects by
+  simply exchanging one performance aspect for another. For example if you want to decrease the
+  cpu usage, you might increase file processing interval (but the result will be that new DAGs will
+  appear with bigger delay). Usually performance tuning is the art of balancing different aspects.
+* sometimes you change scheduler behaviour slightly (for example change parsing sort order)
+  in order to get better fine-tuned results for your particular deployment.
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+Here are the most important tunables you can use to impact various performance aspects of the Scheduler:
+
 .. _scheduler:ha:tunables:
 
-Scheduler Tuneables
-"""""""""""""""""""
+Scheduler Tunables

Review comment:
       Good for me . The "tunables" word sounds nice but is a bit strange indeed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] edwardwang888 commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

edwardwang888 commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r711869019



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two

Review comment:
       ```suggestion
   fine-tune Scheduler behaviour. First of all you need to remember that Scheduler performs two
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] edwardwang888 commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

edwardwang888 commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r711870134



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users

Review comment:
       ```suggestion
   30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS

Review comment:
       ```suggestion
       * what kind of filesystem you have to share the DAGs
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or tha I/O you use for
+  DAG filesystem is at its limits). Often the problem with scheduler performance is
+  simply because your system is not "capable" enough and this might be the only way. For example if
+  you see that you are using all CPU you have on machine, you might want to add another scheduler on
+  a new machine - in most cases, when you add 2nd or 3rd scheduler, the capacity of scheduling grows
+  linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you might get better effects by
+  simply exchanging one performance aspect for another. For example if you want to decrease the
+  cpu usage, you might increase file processing interval (but the result will be that new DAGs will

Review comment:
       ```suggestion
     CPU usage, you might increase file processing interval (but the result will be that new DAGs will
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control aspects of the Scheduler HA
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the dag files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised)
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse dags. This defines

Review comment:
       ```suggestion
     The scheduler can run multiple processes in parallel to parse DAGs. This defines
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two

Review comment:
       ```suggestion
   fine-tune Scheduler behaviour. First of all you need to remember that Scheduler performs two
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop

Review comment:
       ```suggestion
      * How many new DAG runs should be created/scheduled per loop
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process

Review comment:
       ```suggestion
     the observation of your performance, bottlenecks. Performance improvement is an iterative process.
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or tha I/O you use for

Review comment:
       ```suggestion
   * increase hardware capacity (for example if you see that CPU is limiting you or that I/O you use for
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control aspects of the Scheduler HA
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the dag files to decide the parsing order.

Review comment:
       ```suggestion
     The scheduler will list and sort the DAG files to decide the parsing order.
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control aspects of the Scheduler HA
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the dag files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised)
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse dags. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  The number of seconds to wait between consecutive DAG file processing
+
+- :ref:`config:scheduler__schedule_after_task_execution`
+  Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of
+  the same DAG. Leaving this on will mean tasks in the same DAG execute quicker,
+  but might starve out other dags in some circumstances

Review comment:
       ```suggestion
     but might starve out other DAGs in some circumstances
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712384933



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.

Review comment:
       When we release 2.2 it will not be. But I plan to merge this into v2-1 branch and publish it for 2.1.4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712925076



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot of DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary

Review comment:
       Agree in general for all the "shared C libraries" and "shared python libraries loaded before fork" part. I will reword it to explain it better. It's indeed all about Heap Space and Stack under the hood, but it also depends a bit on a sequence of loading - for example with the "Optimized Hook imports" on one hand you have faster startup, but on the other hand, the lirbraries that are imported after the fork are duplicated - so the truth is somewhere in-between.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-922806024


   > * There are a lot of points that aren't incorrect. However, as a reader, I think "why?" a lot. For example here: "what kind of filesystem you have to share the DAGs". I assume the average reader understands there are different filesystems, but doesn't know how to tune Airflow for either filesystem. However, the text doesn't explain the consequences of using e.g. NFS vs local filesystem, or what to tune in the Airflow settings. Would add some context to those statements, or leave them out.
   
   Good point. Added a separate paragraph describing more about resources (filesystem/I/O, Database, memory) 
   
   > * Would explain the implications of changing every config option. E.g. what happens if you set `scheduler__processor_poll_interval` very low and what happens if you set it very high?
   
   I think this is pretty much already explained when you go to details for each of those - in the "configuration" - in the `scheduler fine tuning` I just wanted to put a general description (and the link to the parameter explains more details).
   
   > * Would mention the importance of proper monitoring. Without data you know nothing :-) Can we relate tweaking certain config options to certain metrics?
   
   Agree we should stress it (I added this as 'the most important point'). However I would avoid explaining specific metrics. First of all  - I have no idea and no practice here (and I think most of us don't). We know some general resources, their impact and the "knobs".  Each deployment , filesystem, monitoring tool etc. has its own specific way of naming/monitoring/alerting so I think high-level "what to check?" is much better from our point of view than "how to check?" and "what parameters should have which values". 
   
   Also this is a bit dangerous to be very specific. Airflow is a complex system and has many parameters to tune and whatever we write in such document, it will be taken "literally" and people will rely on it and complain if what we describe here "does not work the exact way it is described". I think we should be very clear about setting expectations about this document. and be very firm and even "assertive" here. We will not give people the "exact" answers they are looking for when it comes to performance. They will get the "knobs", information on generally what they should pay attention to, and general impact of the "knobs".
   
   But we cannot do this FOR the users who maintain Airflow instances. It's their job to fine-tune it. Airflow is not a self-tuning system (it could be but this would be a huge effort - and one that often takes years to master for Managed Service - see for example Kubernetes auto-scaling in GCP). In order to do the job - they need to experiment with their deployment. We will never answer "What cionfiguration we have to have to achive this and that performance with those kind of DAGs we have". People WANT that answer, of course. And it is asked many times. But they will never get that answer from us.
   
    We just give them the information what they can  do to get there, but they still need to learn, understand the knobs and experiment a bit with them. And knowing which knobs to turn and in which directions if you want certain effect is the exact purpose of the document.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712922936



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python

Review comment:
       I think It does make a lot of difference (at least in perception of performance). There are really (mostly anecdotal but I saw it several time) evidences of how  - for example buing extra EFS IOPS improved both perceived performance and stability of scheduler. I've seen many people who believe that by employing a distributed filesystem, they "magically" got  instantly distributed files, which - especially with cloud Filesystem like EFS is not at all true. Also when working with Composer - they used GCS fuse under the hood and it caused a looot of troubles and stability/perceived performance issues before it was properly optimized and fine-tuned - and even then there were cases which make it work terribly (for example changing one line in a 100s of huge DAG files will require to effectively download all of those files from scratch and it might take minutes, causing all kind of problems where the files are in inconsistent state. 
   
   I think we need to make people aware that choosing the right Filesystem matters and that it has huge impact on perceived  performance in a number of cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-923005461


   I also added some more structure to the whole chapter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-922830675


   I will update it afternoon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-923265216


   The PR is likely ready to be merged. No tests are needed as no important environment files, nor python files were modified by it. However, committers might decide that full test matrix is needed and add the 'full tests needed' label. Then you should rebase it to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712914945



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
+  execution. If you can decrease the number of tasks that your DAG use, this will likely improve overall
+  scheduling and performance (however be aware that Airflow's flexibility comes from splitting the

Review comment:
       Good point. I will change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712917928



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking

Review comment:
       Yeah. I think "no lock" is a bad idea (of mine) and we should remove it. It might have some undesireable effects 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-923000889


   @BasPH @edwardwang888 -> all things addressed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712917284



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks

Review comment:
       Agree,




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712925076



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot of DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary

Review comment:
       Agree in general for all the "shared C libraries" and "shared python libraries imported before fork" part. I will reword it to explain it better. It's indeed all about Heap Space and Stack under the hood, but it also depends a bit on a sequence of loading - for example with the "Optimized Hook imports" on one hand you have faster startup, but on the other hand, the lirbraries that are imported after the fork are duplicated - so the truth is somewhere in-between.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] BasPH commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

BasPH commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r711984761



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:

Review comment:
       ```suggestion
   The Scheduler is responsible for two operations:
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or tha I/O you use for
+  DAG filesystem is at its limits). Often the problem with scheduler performance is
+  simply because your system is not "capable" enough and this might be the only way. For example if
+  you see that you are using all CPU you have on machine, you might want to add another scheduler on
+  a new machine - in most cases, when you add 2nd or 3rd scheduler, the capacity of scheduling grows
+  linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you might get better effects by
+  simply exchanging one performance aspect for another. For example if you want to decrease the
+  cpu usage, you might increase file processing interval (but the result will be that new DAGs will
+  appear with bigger delay). Usually performance tuning is the art of balancing different aspects.
+* sometimes you change scheduler behaviour slightly (for example change parsing sort order)
+  in order to get better fine-tuned results for your particular deployment.
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+Here are the most important tunables you can use to impact various performance aspects of the Scheduler:
+
 .. _scheduler:ha:tunables:
 
-Scheduler Tuneables
-"""""""""""""""""""
+Scheduler Tunables

Review comment:
       Would keep this aligned with Airflow terminology
   ```suggestion
   Scheduler configuration options
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured

Review comment:
       Nothing wrong here, but would try to avoid internal details not necessary to the story
   
   ```suggestion
   1. continuously parsing DAG files and synchronizing with the DAG in the database
   2. continuously scheduling tasks for execution
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control aspects of the Scheduler HA
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the dag files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised)
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse dags. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  The number of seconds to wait between consecutive DAG file processing
+
+- :ref:`config:scheduler__schedule_after_task_execution`
+  Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of
+  the same DAG. Leaving this on will mean tasks in the same DAG execute quicker,
+  but might starve out other dags in some circumstances

Review comment:
       What circumstances?

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:

Review comment:
       ```suggestion
   Those two tasks are executed in parallel by the scheduler and run independently of each other in different processes. In order to fine-tune your scheduler, you need to include a number of factors:
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG`` form in the database
+* continuously finds and schedules for execution the next tasks to run and sends those tasks for
+  execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly independent from each other and
+they are run using different processes. You can fine tune the behaviour of both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGS
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new dag runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or tha I/O you use for
+  DAG filesystem is at its limits). Often the problem with scheduler performance is
+  simply because your system is not "capable" enough and this might be the only way. For example if
+  you see that you are using all CPU you have on machine, you might want to add another scheduler on
+  a new machine - in most cases, when you add 2nd or 3rd scheduler, the capacity of scheduling grows
+  linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you might get better effects by
+  simply exchanging one performance aspect for another. For example if you want to decrease the
+  cpu usage, you might increase file processing interval (but the result will be that new DAGs will
+  appear with bigger delay). Usually performance tuning is the art of balancing different aspects.
+* sometimes you change scheduler behaviour slightly (for example change parsing sort order)
+  in order to get better fine-tuned results for your particular deployment.
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+Here are the most important tunables you can use to impact various performance aspects of the Scheduler:
+
 .. _scheduler:ha:tunables:
 
-Scheduler Tuneables
-"""""""""""""""""""
+Scheduler Tunables
+""""""""""""""""""
 
-The following config settings can be used to control aspects of the Scheduler HA loop.
+The following config settings can be used to control aspects of the Scheduler.
+However you can also look at other scheduler configuration parameters available at
+:doc:`../configurations-ref` in ``[scheduler]`` section.

Review comment:
       Would leave this out, or mention something like "non-performance related configuration options"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712916297



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -255,9 +297,34 @@ No additional code needs to be written by the user to run this test.
 
 .. code-block:: bash
 
- python your-dag-file.py
+     python your-dag-file.py
+
+Running the above command without any error ensures your DAG does not contain any uninstalled dependency,
+syntax errors, etc. Make sure that you load your DAG in an environment that corresponds to your
+scheduler environment - with the same dependencies, environment variables, common code referred from the
+DAG.
+
+This is also a great way to check if your DAG loads faster after an optimization, if you want to attempt
+to optimize DAG loading time. Simply run the DAG and measure the time it takes, but again you have to
+make sure your DAG runs with the same dependencies, environment variables, common code.
+Make sure to run it several time in succession to account for caching effects. Compare the results
+before and after the optimization in order to assess the impact of the optimization.
+
+There are many ways to measure the time of processing, one of them in Linux environment is to
+use built-in ``time`` command
+
+.. code-block:: bash
+
+     time python your-dag-file.py
+
+Result:
+
+.. code-block:: text
+
+     python your-dag-file.py 0.05s user 0.02s system 1% cpu 1.033 total
 
-Running the above command without any error ensures your DAG does not contain any uninstalled dependency, syntax errors, etc.
+The important metrics is the "total time" - which tells you how long elapsed time it took
+to process the DAG.

Review comment:
       Thought about it, yeah. Might be worth mentioning indeed (though I tried to be quite clear here that it's about the "improvements" indeed it's good to make this statement.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-923188341


   Hey @eladkal - see the latest fixup commit. I tried to capture the right balance between telling users how they can approach to simplify their DAG complexity while not attempting to tell them exactly how to do it (which would be impossible) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712937017



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
+  execution. If you can decrease the number of tasks that your DAG use, this will likely improve overall
+  scheduling and performance (however be aware that Airflow's flexibility comes from splitting the
+  work between multiple independent and sometimes parallel tasks and it makes it easier to reason
+  about the logic of your DAG when it is split to a number independent, standalone tasks. Also Airflow allows
+  to re-run only specific tasks when needed which might improve maintainability of the DAG - so you have to
+  strike the right balance between optimization, readability and maintainability which is best for your team.
+
+* Make smaller number of DAGs per file. While Airflow 2 is optimized for the case of having multiple DAGs
+  in one file, there are some parts of the system that make it sometimes less performant, or introduce more
+  delays than having those DAGs split among many files. Just the fact that one file can only be parsed by one
+  FileProcessor, makes it less scalable for example. If you have many DAGs generated from one file,
+  consider splitting them if you observe processing and scheduling delays.

Review comment:
       I clarified it a bit in upcoming change (to make it clear that it is in the case you see delays in changes propagated from the files to UI)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] edwardwang888 commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

edwardwang888 commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712664619



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot of DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary
+  in-memory storage. This can lead to memory pressure. You need to observe if your system is not using
+  more memory than it has - which results with using swap disk, which dramatically decreases performance.

Review comment:
       Since you are referring to the scenario of using more memory than you have, I think it's clearer to word this way?
   ```suggestion
     in-memory storage. This can lead to memory pressure. You need to observe if your system is using
     more memory than it has - which results with using swap disk, which dramatically decreases performance.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -255,9 +297,34 @@ No additional code needs to be written by the user to run this test.
 
 .. code-block:: bash
 
- python your-dag-file.py
+     python your-dag-file.py
+
+Running the above command without any error ensures your DAG does not contain any uninstalled dependency,
+syntax errors, etc. Make sure that you load your DAG in an environment that corresponds to your
+scheduler environment - with the same dependencies, environment variables, common code referred from the
+DAG.
+
+This is also a great way to check if your DAG loads faster after an optimization, if you wan to attempt
+to optimize DAG loading time. Simply run the DAG and measure the time it takes, but again you have to
+make sure your DAG runs wit the same dependencies, environment variables, common code.
+Make sure to run it several time in succession to account for caching effects. Compare the results
+before and after the optimization in order to asses the impact of the optimization.

Review comment:
       ```suggestion
   before and after the optimization in order to assess the impact of the optimization.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate lest tasks. Every task adds additional processing overhead for scheduling and

Review comment:
       Typo, but `fewer` is gramatically correct 🙂 
   ```suggestion
   * Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -255,9 +297,34 @@ No additional code needs to be written by the user to run this test.
 
 .. code-block:: bash
 
- python your-dag-file.py
+     python your-dag-file.py
+
+Running the above command without any error ensures your DAG does not contain any uninstalled dependency,
+syntax errors, etc. Make sure that you load your DAG in an environment that corresponds to your
+scheduler environment - with the same dependencies, environment variables, common code referred from the
+DAG.
+
+This is also a great way to check if your DAG loads faster after an optimization, if you wan to attempt

Review comment:
       ```suggestion
   This is also a great way to check if your DAG loads faster after an optimization, if you want to attempt
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -255,9 +297,34 @@ No additional code needs to be written by the user to run this test.
 
 .. code-block:: bash
 
- python your-dag-file.py
+     python your-dag-file.py
+
+Running the above command without any error ensures your DAG does not contain any uninstalled dependency,
+syntax errors, etc. Make sure that you load your DAG in an environment that corresponds to your
+scheduler environment - with the same dependencies, environment variables, common code referred from the
+DAG.
+
+This is also a great way to check if your DAG loads faster after an optimization, if you wan to attempt
+to optimize DAG loading time. Simply run the DAG and measure the time it takes, but again you have to
+make sure your DAG runs wit the same dependencies, environment variables, common code.

Review comment:
       ```suggestion
   make sure your DAG runs with the same dependencies, environment variables, common code.
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)

Review comment:
       ```suggestion
       * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712396478



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot complex DAGs,

Review comment:
       I think there are multiple things that might be seen as "complex" , For example "one that takes a lot of time to parse", generates "plenty of dags", "generates plenty of tasks", "generates plenty of tasks interconnection". I have no (and I think no-one else has) a ready to give recipe "this is too complex" or "this is simple enough". 
   
   Each user can write their dags in their own ways. We can determine some bad practices (and we did already - with external DB usage for example). Python is so versatile that it's difficult to say "this is complex" this is "simple". However it's rather easy to say "this is less complex than it's been before" when you simplify it" and I think users should be gently guided that they should understand what they do when they write Python code and work out their own "ways" of developing DAGs which lead to less complexity.
   
   But this is a good point to point people to some ways of at least assessing it - we have the right chapter in `best practices` about testing dag - for example "check tha your DAG imports <fast>" and try to speed it up, if you perceive this being too long (again we cannot tell what "fast" is  - for some users this might be ms, for some several seconds is still ok. But every one of those users can be told `if you want to improve performance - made DAG loading "faster", and here is the tool to test it`. This is precisely what I am going to do.
   
   I will describe the above. I think users should be made aware that there is no ready-to-use-recipe for simple vs. complex and they should simply strive to improve their DAG in terms of loading time, number of Tasks and interconnections between them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712396962



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot complex DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary
+  in-memory storage. This can lead to memory pressure. You need to observe if your system is not using
+  more memory than it has - which results with using swap disk, which dramatically decreases performance.
+  Note that Airflow Scheduler in versions prior to ``2.1.4`` generated a lot of ``Page Cache`` memory
+  used by log files (when the log files were not removed). This was generally harmless, as the memory
+  is just cache and could be reclaimed at any time by the system, however in version ``2.1.4`` and
+  beyond, writing logs will not generate excessive ``Page Cache`` memory. Regardless - make sure when you look
+  at memory usage, pay attention to the kind of memory you are observing. Usually you should look at
+  ``working memory``(names might vary depending on your deployment) rather than ``total memory used``.
+
+What can you do, to improve Scheduler's performance
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+When you know what your resource usage is, the improvements that you can consider might be:
+
+* improve the logic, efficiency of parsing and reduce complexity of your DAG Python code. It is parsed
+  continuously so optimizing that code might bring tremendous improvements, especially if you try
+  to reach out to some external databases etc. while parsing DAGs (this should be avoided at all cost).
+  The :doc:`/best-practices` document shows a few examples on how you can approach dynamic DAG parsing
+  without reaching out to external sources.
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.

Review comment:
       Yeah. That's the https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#testing-a-dag which I will refer to.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk merged pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk merged pull request #18356:
URL: https://github.com/apache/airflow/pull/18356


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] ashb commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

ashb commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712881658



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
+  execution. If you can decrease the number of tasks that your DAG use, this will likely improve overall
+  scheduling and performance (however be aware that Airflow's flexibility comes from splitting the
+  work between multiple independent and sometimes parallel tasks and it makes it easier to reason
+  about the logic of your DAG when it is split to a number independent, standalone tasks. Also Airflow allows
+  to re-run only specific tasks when needed which might improve maintainability of the DAG - so you have to
+  strike the right balance between optimization, readability and maintainability which is best for your team.
+
+* Make smaller number of DAGs per file. While Airflow 2 is optimized for the case of having multiple DAGs
+  in one file, there are some parts of the system that make it sometimes less performant, or introduce more
+  delays than having those DAGs split among many files. Just the fact that one file can only be parsed by one
+  FileProcessor, makes it less scalable for example. If you have many DAGs generated from one file,
+  consider splitting them if you observe processing and scheduling delays.

Review comment:
       This makes zero difference to scheduling though, so we'll need to re-word this section.
   
   Since the scheduler operates soley on the serialized format, _once_ the dag is parsed once and in the serialized table the time to parse will have no impact on how fast the scheduler makes decisions about a DAG.
   
   The only time this would come back in to play is when executing the task (which is also only a problem if you are really trying to squeeze task start up time to a minimum.

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
+  execution. If you can decrease the number of tasks that your DAG use, this will likely improve overall
+  scheduling and performance (however be aware that Airflow's flexibility comes from splitting the

Review comment:
       I'm not sure it does -- not to any really noticeable degree.
   
   The tree structure of the DAG has more impact than the number of the tasks I think (i.e. a linear chain of a->b->...z doesn't have to do all that much work as it only looks at the next task in the chain each time. I think?).
   
   But I'm not sure we should ever suggest that users make fewer tasks.

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -255,9 +297,34 @@ No additional code needs to be written by the user to run this test.
 
 .. code-block:: bash
 
- python your-dag-file.py
+     python your-dag-file.py
+
+Running the above command without any error ensures your DAG does not contain any uninstalled dependency,
+syntax errors, etc. Make sure that you load your DAG in an environment that corresponds to your
+scheduler environment - with the same dependencies, environment variables, common code referred from the
+DAG.
+
+This is also a great way to check if your DAG loads faster after an optimization, if you want to attempt
+to optimize DAG loading time. Simply run the DAG and measure the time it takes, but again you have to
+make sure your DAG runs with the same dependencies, environment variables, common code.
+Make sure to run it several time in succession to account for caching effects. Compare the results
+before and after the optimization in order to assess the impact of the optimization.
+
+There are many ways to measure the time of processing, one of them in Linux environment is to
+use built-in ``time`` command
+
+.. code-block:: bash
+
+     time python your-dag-file.py
+
+Result:
+
+.. code-block:: text
+
+     python your-dag-file.py 0.05s user 0.02s system 1% cpu 1.033 total
 
-Running the above command without any error ensures your DAG does not contain any uninstalled dependency, syntax errors, etc.
+The important metrics is the "total time" - which tells you how long elapsed time it took
+to process the DAG.

Review comment:
       Might be worth adding a note that this is a slightly inflated time -- as _this_ command will have to `import airflow` etc, but when being loaded form within Airflow the `airflow` modules are already loaded pre-fork.

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.

Review comment:
       A good way of doing this when using Python callables is move imports from top level of file in to inside of the python callables.
   
   (Importing a big module such as numpy or pandas would do a surprising amount of disk io, and none of it should be needed at DAG definition/parse time)

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking

Review comment:
       I'm not sure this should be mentioned, at least not without a massive warning -- turning it off means 1 scheduler only (or else incorrect behaviour!)

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate

Review comment:
       I don't think this is true -- it looks like it just applies a `LIMIT` to the query.
   
   (i.e. I can't see anywhere the gets TIs and then passes `n` of these to TI.filter_for_tis)

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised).
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every

Review comment:
       ```suggestion
     Number of seconds after which a DAG file is reparsed. The DAG file is parsed every
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised).
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse DAGs. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  The number of seconds to wait between consecutive DAG file processing.

Review comment:
       This parameter is now badly named, as this is nothing to do with file processing anymore. Whoops.
   
   This controls how long the scheduler will sleep between loops, but _iff_ there was nothing to do in the loop. i.e. if it scheduled something then it will start the next loop iteration straight away.

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)

Review comment:
       ```suggestion
       * whether parsing your DAG file involves heavy processing at the top level (Hint! It should not. See :ref:`best-practices/top_level_code`)
   ```
   

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``

Review comment:
       ```suggestion
   You can take a look at the Airflow Summit 2021 talk
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot of DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary

Review comment:
       Because of forking and copy-on-write, each process doesn't pay the full cost of this memory usage: it is shared between processes,
   
   i.e. one process might have 300Mb of active memory, but a second process doesn't need an extra 300Mb, it needs _almost_ nothing extra apart from the data it processes in memory.

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)

Review comment:
       ```suggestion
       * how large the DAG files are (remember dag parser needs to read and parse the file every n seconds)
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks

Review comment:
       I think we shouldn't mention this -- as modulo bugs that we've fixed this is _awlays_ going to be better as it distributes the work, and also performs scheduling on a smaller subset of the task graph!
   
   ```suggestion
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python

Review comment:
       This is less of a factor to scheduler overall, so I think we should move this further down the list.

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot of DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary
+  in-memory storage. This can lead to memory pressure. You need to observe if your system is using
+  more memory than it has - which results with using swap disk, which dramatically decreases performance.
+  Note that Airflow Scheduler in versions prior to ``2.1.4`` generated a lot of ``Page Cache`` memory
+  used by log files (when the log files were not removed). This was generally harmless, as the memory
+  is just cache and could be reclaimed at any time by the system, however in version ``2.1.4`` and
+  beyond, writing logs will not generate excessive ``Page Cache`` memory. Regardless - make sure when you look
+  at memory usage, pay attention to the kind of memory you are observing. Usually you should look at
+  ``working memory``(names might vary depending on your deployment) rather than ``total memory used``.
+
+What can you do, to improve Scheduler's performance
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+When you know what your resource usage is, the improvements that you can consider might be:
+
+* improve the logic, efficiency of parsing and reduce complexity of your top-level DAG Python code. It is
+  parsed continuously so optimizing that code might bring tremendous improvements, especially if you try
+  to reach out to some external databases etc. while parsing DAGs (this should be avoided at all cost).
+  The :ref:`best_practices/top_level_code` explains what are the best practices for writing your top-level
+  Python code. The :ref:`best_practices/reducing_dag_complexity` document provides some ares that you might
+  look at when you want to reduce complexity of your code.
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or that I/O you use for
+  DAG filesystem is at its limits). Often the problem with scheduler performance is
+  simply because your system is not "capable" enough and this might be the only way. For example if
+  you see that you are using all CPU you have on machine, you might want to add another scheduler on
+  a new machine - in most cases, when you add 2nd or 3rd scheduler, the capacity of scheduling grows
+  linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you might get better effects by
+  simply exchanging one performance aspect for another. For example if you want to decrease the
+  CPU usage, you might increase file processing interval (but the result will be that new DAGs will
+  appear with bigger delay). Usually performance tuning is the art of balancing different aspects.
+* sometimes you change scheduler behaviour slightly (for example change parsing sort order)
+  in order to get better fine-tuned results for your particular deployment.
+
+
 .. _scheduler:ha:tunables:
 
-Scheduler Tuneables
-"""""""""""""""""""
+Scheduler Configuration options
+"""""""""""""""""""""""""""""""
 
-The following config settings can be used to control aspects of the Scheduler HA loop.
+The following config settings can be used to control aspects of the Scheduler.
+However you can also look at other non-performance-related scheduler configuration parameters available at
+:doc:`../configurations-ref` in ``[scheduler]`` section.
 
 - :ref:`config:scheduler__max_dagruns_to_create_per_loop`
 
-  This changes the number of dags that are locked by each scheduler when
-  creating dag runs. One possible reason for setting this lower is if you
-  have huge dags and are running multiple schedules, you won't want one
+  This changes the number of DAGs that are locked by each scheduler when
+  creating DAG runs. One possible reason for setting this lower is if you
+  have huge DAGs and are running multiple schedules, you won't want one

Review comment:
       ```suggestion
     have huge DAGs (in the order of 10k+ tasks per DAG) and are running multiple schedulers, you won't want one
   ```

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised).
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse DAGs. This defines

Review comment:
       ```suggestion
     The scheduler can run multiple processes in parallel to parse DAG files. This defines
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712914552



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
+  execution. If you can decrease the number of tasks that your DAG use, this will likely improve overall
+  scheduling and performance (however be aware that Airflow's flexibility comes from splitting the
+  work between multiple independent and sometimes parallel tasks and it makes it easier to reason
+  about the logic of your DAG when it is split to a number independent, standalone tasks. Also Airflow allows
+  to re-run only specific tasks when needed which might improve maintainability of the DAG - so you have to
+  strike the right balance between optimization, readability and maintainability which is best for your team.
+
+* Make smaller number of DAGs per file. While Airflow 2 is optimized for the case of having multiple DAGs
+  in one file, there are some parts of the system that make it sometimes less performant, or introduce more
+  delays than having those DAGs split among many files. Just the fact that one file can only be parsed by one
+  FileProcessor, makes it less scalable for example. If you have many DAGs generated from one file,
+  consider splitting them if you observe processing and scheduling delays.

Review comment:
       Good point. However, I think for many people Scheduling and File Proces are part of the same "performance" camp (at least for now) when they are run as the same "scheduler". They might visibly impact the "perception" that scheduling is slow - even if it's the "parsing" that causes the perception (for users it does not really matter if the problem is with "parsing" or scheduling - when the submit a new version of DAG, they will see delays simply. And when they debug performance problems they do not know whether it's parser or scheduler, so I'd say we are explicit that this is both for Parsing and Scheduling, but until we separate it in the way that we can clearly decouple those from users, I'd keep those two together (maybe better explained when things relate to scheduler and when to parser).
   
   WDYT?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712927003



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised).
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse DAGs. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  The number of seconds to wait between consecutive DAG file processing.

Review comment:
       I will explain it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712927003



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -180,10 +338,43 @@ The following config settings can be used to control aspects of the Scheduler HA
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised).
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse DAGs. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  The number of seconds to wait between consecutive DAG file processing.

Review comment:
       I will explain it (and we might want to rename the parameter separately)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712052857



##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control aspects of the Scheduler HA
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the dag files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by one or more of the following:
+
+  - reversion to full table scan - complexity of query predicate
+  - excessive locking
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised)
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse dags. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  The number of seconds to wait between consecutive DAG file processing
+
+- :ref:`config:scheduler__schedule_after_task_execution`
+  Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of
+  the same DAG. Leaving this on will mean tasks in the same DAG execute quicker,
+  but might starve out other dags in some circumstances

Review comment:
       Don;t know. took it from the reference. I think this might be a bit different PR to review description of those config variables - as this also goes into "airflow.cfg" and the .yaml with configuration.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#issuecomment-923373712


   @BasPH @edwardwang888  - any more comments :)? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Posted by GitBox <gi...@apache.org>.

potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712939727



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. See below
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing overhead for scheduling and
+  execution. If you can decrease the number of tasks that your DAG use, this will likely improve overall
+  scheduling and performance (however be aware that Airflow's flexibility comes from splitting the

Review comment:
       Updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org