You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/02/28 23:55:28 UTC

[GitHub] [airflow] jedcunningham commented on a change in pull request #21879: Add docs re upgrade / downgrade

jedcunningham commented on a change in pull request #21879:
URL: https://github.com/apache/airflow/pull/21879#discussion_r816340156



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more dag and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with command ``airflow db clean``.
+
+See :ref:`db clean usage<cli-db-clean>` for more details.
+
+Upgrades and downgrades
+-----------------------
+
+Backup your database
+^^^^^^^^^^^^^^^^^^^^
+
+It's always a wise idea to backup the metadata database before undertaking any operation modifying the database.
+
+Disabling the scheduler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You might consider disabling the Airflow cluster while you perform such maintenance.  One way to do so would be to set the param ``[scheduler] > use_job_schedule`` and wait for any running dags to complete; after this no new dag runs will be created unless externally triggered.

Review comment:
       ```suggestion
   You might consider disabling the Airflow cluster while you perform such maintenance.  One way to do so would be to set the param ``[scheduler] > use_job_schedule`` to ``False`` and wait for any running DAGs to complete; after this no new DAG runs will be created unless externally triggered.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more dag and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with command ``airflow db clean``.
+
+See :ref:`db clean usage<cli-db-clean>` for more details.
+
+Upgrades and downgrades
+-----------------------
+
+Backup your database
+^^^^^^^^^^^^^^^^^^^^
+
+It's always a wise idea to backup the metadata database before undertaking any operation modifying the database.
+
+Disabling the scheduler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You might consider disabling the Airflow cluster while you perform such maintenance.  One way to do so would be to set the param ``[scheduler] > use_job_schedule`` and wait for any running dags to complete; after this no new dag runs will be created unless externally triggered.
+
+Another way to accomplish roughly the same thing is to use the ``dags pause`` command.  You *must* keep track of the dags that are paused before you begin this operation, otherwise when it comes time to unpause, you won't know which ones should remain paused!  So first run ``airflow dags list``, then store the list of unpaused dags, and keep this list somewhere so that later you can unpause only these.
+
+Upgrades
+^^^^^^^^
+
+Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command.
+
+If desired, you may apply each upgrade migration manually, one at a time.  To do so use the ``--revision-range`` option with ``db upgrade``.  Do *not* skip running the alembic revision id update commands; this is how airflow will know where you are upgrading from or two the next time you need to.  See :doc:`/migrations-ref.rst` for a mapping between revision and version.

Review comment:
       ```suggestion
   If desired, you may apply each upgrade migration manually, one at a time.  To do so use the ``--revision-range`` option with ``db upgrade``.  Do *not* skip running the Alembic revision id update commands; this is how Airflow will know where you are upgrading from or two the next time you need to.  See :doc:`/migrations-ref.rst` for a mapping between revision and version.
   ```

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a dag run record is deleted, all of its associated task instances will also be deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest DagRun.  If you delete all dag runs, Airflow may schedule an old dag run that was already completed, e.g. if you have set ``catchup=True``.  So the ``db clean`` will preserve the latest non-manually-triggered dag run to preserve continuity in scheduling.
+
+Considerations for backfillable dags
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Not all dags are designed for use with airflow's backfill command.  But for those which are, special care is warranted.  If you delete dag runs, and if you run backfill over a range of dates that includes the deleted dag runs, those runs will be recreated and run again.  For this reason, if you have dags that fall into this category you may want to refrain from deleting dag runs and only clean other large tables such as task instance and log etc.
+
+
+.. _cli-db-downgrade:
+
+Downgrading airflow
+-------------------
+
+.. note::
+
+    It's recommended that you backup your database before running ``db downgrade`` or any other database operation.
+
+You can downgrade to a particular Airflow version with the ``db downgrade`` command.  Alternatively you may provide an alembic revision id to downgrade to.
+
+If you want to preview the commands but not execute them, use option ``--sql-only``.
+
+Options ``--from-revision`` and ``--from-version`` may only be used in conjunction with the ``--sql-only`` option, because if actually *running* migrations we should always downgrade from current revision.

Review comment:
       ```suggestion
   Options ``--from-revision`` and ``--from-version`` may only be used in conjunction with the ``--sql-only`` option, because when actually *running* migrations we should always downgrade from current revision.
   ```
   Maybe?

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a dag run record is deleted, all of its associated task instances will also be deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest DagRun.  If you delete all dag runs, Airflow may schedule an old dag run that was already completed, e.g. if you have set ``catchup=True``.  So the ``db clean`` will preserve the latest non-manually-triggered dag run to preserve continuity in scheduling.

Review comment:
       ```suggestion
   Commonly, Airflow determines which DagRun to run next by looking up the latest DagRun.  If you delete all DAG runs, Airflow may schedule an old DAG run that was already completed, e.g. if you have set ``catchup=True``.  So the ``db clean`` will preserve the latest non-manually-triggered DAG run to preserve continuity in scheduling.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more dag and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with command ``airflow db clean``.
+
+See :ref:`db clean usage<cli-db-clean>` for more details.
+
+Upgrades and downgrades
+-----------------------
+
+Backup your database
+^^^^^^^^^^^^^^^^^^^^
+
+It's always a wise idea to backup the metadata database before undertaking any operation modifying the database.
+
+Disabling the scheduler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You might consider disabling the Airflow cluster while you perform such maintenance.  One way to do so would be to set the param ``[scheduler] > use_job_schedule`` and wait for any running dags to complete; after this no new dag runs will be created unless externally triggered.
+
+Another way to accomplish roughly the same thing is to use the ``dags pause`` command.  You *must* keep track of the dags that are paused before you begin this operation, otherwise when it comes time to unpause, you won't know which ones should remain paused!  So first run ``airflow dags list``, then store the list of unpaused dags, and keep this list somewhere so that later you can unpause only these.
+
+Upgrades
+^^^^^^^^
+
+Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command.
+
+If desired, you may apply each upgrade migration manually, one at a time.  To do so use the ``--revision-range`` option with ``db upgrade``.  Do *not* skip running the alembic revision id update commands; this is how airflow will know where you are upgrading from or two the next time you need to.  See :doc:`/migrations-ref.rst` for a mapping between revision and version.

Review comment:
       ```suggestion
   If desired, you may apply each upgrade migration manually, one at a time.  To do so use the ``--revision-range`` option with ``db upgrade``.  Do *not* skip running the alembic revision id update commands; this is how airflow will know where you are upgrading from the next time you need to.  See :doc:`/migrations-ref.rst` for a mapping between revision and version.
   ```
   
   It just it used for the "from", no?

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a dag run record is deleted, all of its associated task instances will also be deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest DagRun.  If you delete all dag runs, Airflow may schedule an old dag run that was already completed, e.g. if you have set ``catchup=True``.  So the ``db clean`` will preserve the latest non-manually-triggered dag run to preserve continuity in scheduling.
+
+Considerations for backfillable dags
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Not all dags are designed for use with airflow's backfill command.  But for those which are, special care is warranted.  If you delete dag runs, and if you run backfill over a range of dates that includes the deleted dag runs, those runs will be recreated and run again.  For this reason, if you have dags that fall into this category you may want to refrain from deleting dag runs and only clean other large tables such as task instance and log etc.

Review comment:
       ```suggestion
   Not all DAGs are designed for use with Airflow's backfill command.  But for those which are, special care is warranted.  If you delete DAG runs, and if you run backfill over a range of dates that includes the deleted DAG runs, those runs will be recreated and run again.  For this reason, if you have DAGs that fall into this category you may want to refrain from deleting DAG runs and only clean other large tables such as task instance and log etc.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more dag and task runs and event logs accumulate.

Review comment:
       ```suggestion
   Over time, the metadata database will increase its storage footprint as more DAG and task runs and event logs accumulate.
   ```

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a dag run record is deleted, all of its associated task instances will also be deleted.

Review comment:
       ```suggestion
   Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a DAG run record is deleted, all of its associated task instances will also be deleted.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more dag and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with command ``airflow db clean``.

Review comment:
       ```suggestion
   You can use the Airflow CLI to purge old data with the command ``airflow db clean``.
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more dag and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with command ``airflow db clean``.
+
+See :ref:`db clean usage<cli-db-clean>` for more details.
+
+Upgrades and downgrades
+-----------------------
+
+Backup your database
+^^^^^^^^^^^^^^^^^^^^
+
+It's always a wise idea to backup the metadata database before undertaking any operation modifying the database.
+
+Disabling the scheduler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You might consider disabling the Airflow cluster while you perform such maintenance.  One way to do so would be to set the param ``[scheduler] > use_job_schedule`` and wait for any running dags to complete; after this no new dag runs will be created unless externally triggered.
+
+Another way to accomplish roughly the same thing is to use the ``dags pause`` command.  You *must* keep track of the dags that are paused before you begin this operation, otherwise when it comes time to unpause, you won't know which ones should remain paused!  So first run ``airflow dags list``, then store the list of unpaused dags, and keep this list somewhere so that later you can unpause only these.

Review comment:
       ```suggestion
   Another way to accomplish roughly the same thing is to use the ``dags pause`` command.  You *must* keep track of the DAGs that are paused before you begin this operation, otherwise when it comes time to unpause, you won't know which ones should remain paused!  So first run ``airflow dags list``, then store the list of unpaused DAGs, and keep this list somewhere so that later you can unpause only these.
   ```

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a dag run record is deleted, all of its associated task instances will also be deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest DagRun.  If you delete all dag runs, Airflow may schedule an old dag run that was already completed, e.g. if you have set ``catchup=True``.  So the ``db clean`` will preserve the latest non-manually-triggered dag run to preserve continuity in scheduling.
+
+Considerations for backfillable dags
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Not all dags are designed for use with airflow's backfill command.  But for those which are, special care is warranted.  If you delete dag runs, and if you run backfill over a range of dates that includes the deleted dag runs, those runs will be recreated and run again.  For this reason, if you have dags that fall into this category you may want to refrain from deleting dag runs and only clean other large tables such as task instance and log etc.
+
+
+.. _cli-db-downgrade:
+
+Downgrading airflow
+-------------------
+
+.. note::
+
+    It's recommended that you backup your database before running ``db downgrade`` or any other database operation.
+
+You can downgrade to a particular Airflow version with the ``db downgrade`` command.  Alternatively you may provide an alembic revision id to downgrade to.
+
+If you want to preview the commands but not execute them, use option ``--sql-only``.
+
+Options ``--from-revision`` and ``--from-version`` may only be used in conjunction with the ``--sql-only`` option, because if actually *running* migrations we should always downgrade from current revision.
+
+For a mapping between Airflow version and alembic revision see :doc:`/migrations-ref.rst`.

Review comment:
       ```suggestion
   For a mapping between Airflow version and Alembic revision see :doc:`/migrations-ref.rst`.
   ```

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For example, the ``task_instance`` table keys to the ``dag_run`` table, so if a dag run record is deleted, all of its associated task instances will also be deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest DagRun.  If you delete all dag runs, Airflow may schedule an old dag run that was already completed, e.g. if you have set ``catchup=True``.  So the ``db clean`` will preserve the latest non-manually-triggered dag run to preserve continuity in scheduling.
+
+Considerations for backfillable dags
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Not all dags are designed for use with airflow's backfill command.  But for those which are, special care is warranted.  If you delete dag runs, and if you run backfill over a range of dates that includes the deleted dag runs, those runs will be recreated and run again.  For this reason, if you have dags that fall into this category you may want to refrain from deleting dag runs and only clean other large tables such as task instance and log etc.
+
+
+.. _cli-db-downgrade:
+
+Downgrading airflow
+-------------------
+
+.. note::
+
+    It's recommended that you backup your database before running ``db downgrade`` or any other database operation.
+
+You can downgrade to a particular Airflow version with the ``db downgrade`` command.  Alternatively you may provide an alembic revision id to downgrade to.

Review comment:
       ```suggestion
   You can downgrade to a particular Airflow version with the ``db downgrade`` command.  Alternatively you may provide an Alembic revision id to downgrade to.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org