You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by po...@apache.org on 2021/09/23 15:19:08 UTC

[airflow] branch v2-1-test updated (6c79a01 -> 1a598ad)

This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a change to branch v2-1-test
in repository https://gitbox.apache.org/repos/asf/airflow.git.


 discard 6c79a01  Explain scheduler fine-tuning better (#18356)
     new 1a598ad  Explain scheduler fine-tuning better (#18356)

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (6c79a01)
            \
             N -- N -- N   refs/heads/v2-1-test (1a598ad)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 airflow/config_templates/config.yml          | 140 +++++++++------------------
 airflow/config_templates/default_airflow.cfg |  92 +++++++-----------
 docs/apache-airflow/concepts/scheduler.rst   |  11 ---
 docs/spelling_wordlist.txt                   |   1 +
 4 files changed, 85 insertions(+), 159 deletions(-)

[airflow] 01/01: Explain scheduler fine-tuning better (#18356)

Posted by po...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch v2-1-test
in repository https://gitbox.apache.org/repos/asf/airflow.git

commit 1a598ad092a4e1061139bb431745dc3ae467a2dc
Author: Jarek Potiuk <ja...@potiuk.com>
AuthorDate: Tue Sep 21 22:26:47 2021 +0200

    Explain scheduler fine-tuning better (#18356)
    
    * Explain scheduler fine-tuning better
    
    A lot of users have an expectations that Airflow Scheduler will
    `just work` and deliver the `optimal performance` for them without
    realising that in case of such comples systems as Airflow is you
    often have to decide what you should optimise for or accept some
    trade-offs or increase hardware capacity if you are not willing to
    make those trade-offs.
    
    Also it's not clear where the responsibility
    is - should it `just work` or should the user be responsible for
    understanding and fine tuning their system (both approaches are
    possible, there are some complex systmes which utilise a lot of
    automation/AI etc. to fine tune and optmise their behaviour but
    Airflow expects from the users to know a bit more on how the
    scheduling works and Airflow maintainers deliver a lot of
    knobs that can be turned to fine tune the system and to make
    trade-off decisions. This was not explicitely stated in our
    documentation and users could have different expectations about
    it (and they often had judging from issues they raised).
    
    This PR adds a "fine-tuning" chapter that aims to set the
    expectations of the users at the right level - it explains what
    Airflow provides, but also what is the user's responsibility - to
    decide what they are optimising, to see where their bottlenecks
    are and to decide if they need to change the configuration or
    increase hardware capacity (or make appropriate trade-offs).
    
    It also brings more of the fine-tuning parameters to the
    `tuneables` section of scheduler, based on some of the recent
    questions asked by the users - seems that having a specific
    overview of all performance-impacting parameters is a good idea,
    and we only had a very limited subset of those.
    
    Some user prefer `watch` rather than read that's why this PR
    also adds the link to the recording of talk from the
    Airlfow Summit 2021 where Ash describes - in a very concise
    and easy to grasp way - all the whys and hows of the scheduler.
    If you understand why and how the scheduler does what it does,
    fine-tuning decisions are much easier.
    
    * fixup! Explain scheduler fine-tuning better
    
    (cherry picked from commit eed2ef65e1d1283fa9a34e6002f456b0aceb17c1)
---
 docs/apache-airflow/best-practices.rst     | 177 ++++++++++++++++++++--
 docs/apache-airflow/concepts/scheduler.rst | 229 ++++++++++++++++++++++++++---
 docs/spelling_wordlist.txt                 |   4 +-
 3 files changed, 376 insertions(+), 34 deletions(-)

diff --git a/docs/apache-airflow/best-practices.rst b/docs/apache-airflow/best-practices.rst
index 6b88776..15ef926 100644
--- a/docs/apache-airflow/best-practices.rst
+++ b/docs/apache-airflow/best-practices.rst
@@ -88,8 +88,10 @@ and the downstream tasks can pull the path from XCom and use it to read the data
 The tasks should also not store any authentication parameters such as passwords or token inside them.
 Where at all possible, use :doc:`Connections </concepts/connections>` to store data securely in Airflow backend and retrieve them using a unique connection id.
 
-Top level Python Code and Dynamic DAGs
---------------------------------------
+.. _best_practices/top_level_code:
+
+Top level Python Code
+---------------------
 
 You should avoid writing the top level code which is not necessary to create Operators
 and build DAG relations between them. This is because of the design decision for the scheduler of Airflow
@@ -103,15 +105,87 @@ in DAGs is correctly reflected in scheduled tasks.
 
 Specifically you should not run any database access, heavy computations and networking operations.
 
-This limitation is especially important in case of dynamic DAG configuration, which can be configured
-essentially in one of those ways:
+One of the important factors impacting DAG loading time, that might be overlooked by Python developers is
+that top-level imports might take surprisingly a lot of time and they can generate a lot of overhead
+and this can be easily avoided by converting them to local imports inside Python callables for example.
+
+Consider the example below - the first DAG will parse significantly slower (in the orders of seconds)
+than equivalent DAG where the ``numpy`` module is imported as local import in the callable.
+
+Bad example:
+
+.. code-block:: python
+
+  from datetime import datetime
+
+  from airflow import DAG
+  from airflow.operators.python import PythonOperator
+
+  import numpy as np  # <-- THIS IS A VERY BAD IDEA! DON'T DO THAT!
+
+  with DAG(
+      dag_id="example_python_operator",
+      schedule_interval=None,
+      start_date=datetime(2021, 1, 1),
+      catchup=False,
+      tags=["example"],
+  ) as dag:
+
+      def print_array():
+          """Print Numpy array."""
+          a = np.arange(15).reshape(3, 5)
+          print(a)
+          return a
+
+      run_this = PythonOperator(
+          task_id="print_the_context",
+          python_callable=print_array,
+      )
+
+Good example:
+
+.. code-block:: python
+
+  from datetime import datetime
+
+  from airflow import DAG
+  from airflow.operators.python import PythonOperator
+
+  with DAG(
+      dag_id="example_python_operator",
+      schedule_interval=None,
+      start_date=datetime(2021, 1, 1),
+      catchup=False,
+      tags=["example"],
+  ) as dag:
+
+      def print_array():
+          """Print Numpy array."""
+          import numpy as np  # <- THIS IS HOW NUMPY SHOULD BE IMPORTED IN THIS CASE
+
+          a = np.arange(15).reshape(3, 5)
+          print(a)
+          return a
+
+      run_this = PythonOperator(
+          task_id="print_the_context",
+          python_callable=print_array,
+      )
+
+
+
+Dynamic DAG Generation
+----------------------
+
+Avoiding excessive processing at the top level code described in the previous chapter is especially important
+in case of dynamic DAG configuration, which can be configured essentially in one of those ways:
 
 * via `environment variables <https://wiki.archlinux.org/title/environment_variables>`_ (not to be mistaken
   with the :doc:`Airflow Variables </concepts/variables>`)
 * via externally provided, generated Python code, containing meta-data in the DAG folder
 * via externally provided, generated configuration meta-data file in the DAG folder
 
-All cases are described in the following chapters.
+All cases are described in the following sections.
 
 Dynamic DAGs with environment variables
 .......................................
@@ -241,11 +315,52 @@ each parameter by following the links):
 * :ref:`config:scheduler__parsing_processes`
 * :ref:`config:scheduler__file_parsing_sort_mode`
 
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation is just executing
+Python code and it's up to you to make it as performant as possible. There are no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be implemented in various ways
+  but this is the one that has biggest impact on scheduler's performance. Whenever you have a chance to make
+  your DAG load faster - go for it, if your goal is to improve performance. Look at the
+  :ref:`best_practices/top_level_code` to get some tips of how you can do it. Also see at
+  :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate simpler structure. Every task dependency adds additional processing overhead for
+  scheduling and execution. The DAG that has simple linear structure ``A -> B -> C`` will experience
+  less delays in task scheduling that DAG that has a deeply nested tree structure with exponentially growing
+  number of depending tasks for example. If you can make your DAGs more linear - where at single point in
+  execution there are as few potential candidates to run among the tasks, this will likely improve overall
+  scheduling performance.
+
+* Make smaller number of DAGs per file. While Airflow 2 is optimized for the case of having multiple DAGs
+  in one file, there are some parts of the system that make it sometimes less performant, or introduce more
+  delays than having those DAGs split among many files. Just the fact that one file can only be parsed by one
+  FileProcessor, makes it less scalable for example. If you have many DAGs generated from one file,
+  consider splitting them if you observe it takes a long time to reflect changes in your DAG files in the
+  UI of Airflow.
+
 Testing a DAG
 ^^^^^^^^^^^^^
 
-Airflow users should treat DAGs as production level code, and DAGs should have various associated tests to ensure that they produce expected results.
-You can write a wide variety of tests for a DAG. Let's take a look at some of them.
+Airflow users should treat DAGs as production level code, and DAGs should have various associated tests to
+ensure that they produce expected results. You can write a wide variety of tests for a DAG.
+Let's take a look at some of them.
+
+.. _best_practices/dag_loader_test:
 
 DAG Loader Test
 ---------------
@@ -255,9 +370,53 @@ No additional code needs to be written by the user to run this test.
 
 .. code-block:: bash
 
- python your-dag-file.py
+     python your-dag-file.py
+
+Running the above command without any error ensures your DAG does not contain any uninstalled dependency,
+syntax errors, etc. Make sure that you load your DAG in an environment that corresponds to your
+scheduler environment - with the same dependencies, environment variables, common code referred from the
+DAG.
+
+This is also a great way to check if your DAG loads faster after an optimization, if you want to attempt
+to optimize DAG loading time. Simply run the DAG and measure the time it takes, but again you have to
+make sure your DAG runs with the same dependencies, environment variables, common code.
+
+There are many ways to measure the time of processing, one of them in Linux environment is to
+use built-in ``time`` command. Make sure to run it several times in succession to account for
+caching effects. Compare the results before and after the optimization (in the same conditions - using
+the same machine, environment etc.) in order to assess the impact of the optimization.
+
+.. code-block:: bash
+
+     time python airflow/example_dags/example_python_operator.py
+
+Result:
+
+.. code-block:: text
+
+    real    0m0.699s
+    user    0m0.590s
+    sys     0m0.108s
+
+The important metrics is the "real time" - which tells you how long time it took
+to process the DAG. Note that when loading the file this way, you are starting a new interpreter so there is
+an initial loading time that is not present when Airflow parses the DAG. You can assess the
+time of initialization by running:
+
+.. code-block:: bash
+
+     time python -c ''
+
+Result:
+
+.. code-block:: text
+
+    real    0m0.073s
+    user    0m0.037s
+    sys     0m0.039s
 
-Running the above command without any error ensures your DAG does not contain any uninstalled dependency, syntax errors, etc.
+In this case the initial interpreter startup time is ~ 0.07s which is about 10% of time needed to parse
+the example_python_operator.py above so the actual parsing time is about ~ 0.62 s for the example DAG.
 
 You can look into :ref:`Testing a DAG <testing>` for details on how to test individual operators.
 
diff --git a/docs/apache-airflow/concepts/scheduler.rst b/docs/apache-airflow/concepts/scheduler.rst
index 0a1079e..1cd4ecc 100644
--- a/docs/apache-airflow/concepts/scheduler.rst
+++ b/docs/apache-airflow/concepts/scheduler.rst
@@ -20,6 +20,9 @@
 Scheduler
 ==========
 
+.. contents:: :local:
+
+
 The Airflow scheduler monitors all tasks and DAGs, then triggers the
 task instances once their dependencies are complete. Behind the scenes,
 the scheduler spins up a subprocess, which monitors and stays in sync with all
@@ -43,13 +46,13 @@ Your DAGs will start executing once the scheduler is running successfully.
 
 .. note::
 
-    The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG.
-    Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``,
+    The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG.
+    Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``,
     sequentially.
 
 
 The scheduler won't trigger your tasks until the period it covers has ended e.g., A job with ``schedule_interval`` set as ``@daily`` runs after the day
-has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed.
+has ended. This technique makes sure that whatever data is required for that period is fully available before the DAG is executed.
 In the UI, it appears as if Airflow is running your tasks a day **late**
 
 .. note::
@@ -138,18 +141,174 @@ The following databases are fully supported and provide an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember dag parser needs to read and parse the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks and dependencies they have)
+    * whether parsing your DAG file involves importing a lot of libraries or heavy processing at the top level
+      (Hint! It should not. See :ref:`best_practices/top_level_code`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * How often the scheduler should perform cleanup and check for orphaned tasks/adopting them
+
+In order to perform fine-tuning, it's good to understand how Scheduler works under-the-hood.
+You can take a look at the Airflow Summit 2021 talk
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task,
+depending on your particular deployment, your DAG structure, hardware availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when managing the
+deployment is to decide what you are going to optimize for. Some users are ok with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what aspect of performance is
+most important for you and decide which knobs you want to turn in which direction.
+
+Generally for fine-tuning, your approach should be the same as for any performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools that you usually use to
+  monitor your system. This document does not go into details of particular metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are the usual limiting factors
+* based on your expectations and observations - decide what is your next improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing (sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler continuously reads and
+  re-parses those files. The same files have to be made available to workers, so often they are
+  stored in a distributed filesystem. You can use various filesystems for that purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of this document. You should
+  observe statistics and usage of your filesystem to determine if problems come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS (and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does not have to use a
+  distributed filesystem to read the files, the files are available locally for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks for local storage. Those
+  distribution mechanisms have other characteristics that might make them not the best choice for you,
+  but if your problems with performance come from distributed filesystem performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want to increase performance and
+  process more things in parallel. Airflow is known from being "database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database connections will be opened.
+  This is generally not a problem for MySQL as its model of handling connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The :doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when you have a lot of DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and you will see delays between
+  submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of classes imported, temporary
+  in-memory storage. A lot of it is optimized by Airflow by using forking and copy-on-write memory used
+  but in case new classes are imported after forking this can lead to extra memory pressure.
+  You need to observe if your system is using more memory than it has - which results with using swap disk,
+  which dramatically decreases performance. Note that Airflow Scheduler in versions prior to ``2.1.4``
+  generated a lot of ``Page Cache`` memory used by log files (when the log files were not removed).
+  This was generally harmless, as the memory is just cache and could be reclaimed at any time by the system,
+  however in version ``2.1.4`` and beyond, writing logs will not generate excessive ``Page Cache`` memory.
+  Regardless - make sure when you look at memory usage, pay attention to the kind of memory you are observing.
+  Usually you should look at ``working memory``(names might vary depending on your deployment) rather
+  than ``total memory used``.
+
+What can you do, to improve Scheduler's performance
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+When you know what your resource usage is, the improvements that you can consider might be:
+
+* improve the logic, efficiency of parsing and reduce complexity of your top-level DAG Python code. It is
+  parsed continuously so optimizing that code might bring tremendous improvements, especially if you try
+  to reach out to some external databases etc. while parsing DAGs (this should be avoided at all cost).
+  The :ref:`best_practices/top_level_code` explains what are the best practices for writing your top-level
+  Python code. The :ref:`best_practices/reducing_dag_complexity` document provides some ares that you might
+  look at when you want to reduce complexity of your code.
+* improve utilization of your resources. This is when you have a free capacity in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you or that I/O you use for
+  DAG filesystem is at its limits). Often the problem with scheduler performance is
+  simply because your system is not "capable" enough and this might be the only way. For example if
+  you see that you are using all CPU you have on machine, you might want to add another scheduler on
+  a new machine - in most cases, when you add 2nd or 3rd scheduler, the capacity of scheduling grows
+  linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you might get better effects by
+  simply exchanging one performance aspect for another. For example if you want to decrease the
+  CPU usage, you might increase file processing interval (but the result will be that new DAGs will
+  appear with bigger delay). Usually performance tuning is the art of balancing different aspects.
+* sometimes you change scheduler behaviour slightly (for example change parsing sort order)
+  in order to get better fine-tuned results for your particular deployment.
+
+
 .. _scheduler:ha:tunables:
 
-Scheduler Tuneables
-"""""""""""""""""""
+Scheduler Configuration options
+"""""""""""""""""""""""""""""""
 
-The following config settings can be used to control aspects of the Scheduler HA loop.
+The following config settings can be used to control aspects of the Scheduler.
+However you can also look at other non-performance-related scheduler configuration parameters available at
+:doc:`../configurations-ref` in ``[scheduler]`` section.
 
 - :ref:`config:scheduler__max_dagruns_to_create_per_loop`
 
-  This changes the number of dags that are locked by each scheduler when
-  creating dag runs. One possible reason for setting this lower is if you
-  have huge dags and are running multiple schedules, you won't want one
+  This changes the number of DAGs that are locked by each scheduler when
+  creating DAG runs. One possible reason for setting this lower is if you
+  have huge DAGs (in the order of 10k+ tasks per DAG) and are running multiple schedulers, you won't want one
   scheduler to do all the work.
 
 - :ref:`config:scheduler__max_dagruns_per_loop_to_schedule`
@@ -158,14 +317,14 @@ The following config settings can be used to control aspects of the Scheduler HA
   and queuing tasks. Increasing this limit will allow more throughput for
   smaller DAGs but will likely slow down throughput for larger (>500
   tasks for example) DAGs. Setting this too high when using multiple
-  schedulers could also lead to one scheduler taking all the dag runs
+  schedulers could also lead to one scheduler taking all the DAG runs
   leaving no work for the others.
 
 - :ref:`config:scheduler__use_row_level_locking`
 
   Should the scheduler issue ``SELECT ... FOR UPDATE`` in relevant queries.
   If this is set to False then you should not run more than a single
-  scheduler at once
+  scheduler at once.
 
 - :ref:`config:scheduler__pool_metrics_interval`
 
@@ -174,27 +333,49 @@ The following config settings can be used to control aspects of the Scheduler HA
   this, so this should be set to match the same period as your statsd roll-up
   period.
 
-- :ref:`config:scheduler__clean_tis_without_dagrun_interval`
-
-  How often should each scheduler run a check to "clean up" TaskInstance rows
-  that are found to no longer have a matching DagRun row.
-
-  In normal operation the scheduler won't do this, it is only possible to do
-  this by deleting rows via the UI, or directly in the DB. You can set this
-  lower if this check is not important to you -- tasks will be left in what
-  ever state they are until the cleanup happens, at which point they will be
-  set to failed.
-
 - :ref:`config:scheduler__orphaned_tasks_check_interval`
 
   How often (in seconds) should the scheduler check for orphaned tasks or dead
   SchedulerJobs.
 
   This setting controls how a dead scheduler will be noticed and the tasks it
-  was "supervising" get picked up by another scheduler. (The tasks will stay
-  running, so there is no harm in not detecting this for a while.)
+  was "supervising" get picked up by another scheduler. The tasks will stay
+  running, so there is no harm in not detecting this for a while.
 
   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+  How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+  The scheduler will list and sort the DAG files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+  The batch size of queries in the scheduling main loop. If this is too high, SQL query
+  performance may be impacted by complexity of query predicate, and/or excessive locking.
+
+  Additionally, you may hit the maximum allowable query length for your db.
+  Set this to 0 for no limit (not advised).
+
+- :ref:`config:scheduler__min_file_process_interval`
+  Number of seconds after which a DAG file is re-parsed. The DAG file is parsed every
+  min_file_process_interval number of seconds. Updates to DAGs are reflected after
+  this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+  The scheduler can run multiple processes in parallel to parse DAG files. This defines
+  how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+  Controls how long the scheduler will sleep between loops, but if there was nothing to do
+  in the loop. i.e. if it scheduled something then it will start the next loop
+  iteration straight away. This parameter is badly named (historical reasons) and it will be
+  renamed in the future with deprecation of the current name.
+
+- :ref:`config:scheduler__schedule_after_task_execution`
+  Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of
+  the same DAG. Leaving this on will mean tasks in the same DAG execute quicker,
+  but might starve out other DAGs in some circumstances.
diff --git a/docs/spelling_wordlist.txt b/docs/spelling_wordlist.txt
index 914aeef..6c0b089 100644
--- a/docs/spelling_wordlist.txt
+++ b/docs/spelling_wordlist.txt
@@ -347,7 +347,7 @@ Tez
 Thinknear
 ToC
 Tooltip
-Tuneables
+Tunables
 UA
 Umask
 Un
@@ -636,6 +636,7 @@ dimensionX
 dingding
 dir
 dirs
+discoverability
 discoverable
 displayName
 distcp
@@ -1299,6 +1300,7 @@ trino
 trojan
 tsv
 ttl
+tunables
 txt
 tz
 tzinfo