You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/12/29 11:41:35 UTC

[PR] Add description on the ways how users should approach DB monitoring [airflow]

potiuk opened a new pull request, #36483:
URL: https://github.com/apache/airflow/pull/36483

   Often our users are not aware that they are responsible for setting up and monitoring the database they chose as the metaa-data backend.
   
   While details of the tables and database structure of the metadata DB used by Airflow is internal detail, the monitoring, tracking the usage, fine-tuning and optimisation of the database configuration and detecting some cases where database becomes a bottle neck is generally a task that Deployment Manager should be aware of and it should be approached in a generic way - specific to the database chosen by the Deployment Manager and it also depends a lot on the choice of managed database if managed database is chosen by the Deployment Manager.
   
   This chapter makes it explicit and gives enough leads to the Deployment Manager to be able to follow after they chose the database, it also explain the specific parameters tha the Deployment Manager should pay attention to when setting up such monitoring.
   
   We also add an explanation of how Deployment Manager can setup client-side logging of SQL queries generated by Airflow in case database access is suspected for performance issues with Airflow, as a poor-man's version of complete, server-side monitoring and explains caveats of such client side configuraiton.
   
   <!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at
   
      http://www.apache.org/licenses/LICENSE-2.0
   
    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
    -->
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   
   
   <!-- Please keep an empty line above the dashes. -->
   ---
   **^ Add meaningful description above**
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #36483:
URL: https://github.com/apache/airflow/pull/36483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872217674

   > Great initiative. I have ideas along similar lines. I have observed in our use case where db grows (1GB per day) indefinitely unless manually run the db clean cli command. So, it will eventually degrade the performance of the database over time. If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.
   
   Good points . I will add them 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872241850

   Here it is - after employing Chat GPT (after few iterations and discussion with it) and applying the comments above. @BasPH - does it look better ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872229139

   > https://www.datadoghq.com/dg/monitor/databases/ is great for this kind of monitoring.
   
   I am not sure if we should endorse specific services in our docs, but I will leave enough clues for the users to search for services that offer it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "jscheffl (via GitHub)" <gi...@apache.org>.
jscheffl commented on code in PR #36483:
URL: https://github.com/apache/airflow/pull/36483#discussion_r1438203426


##########
docs/apache-airflow/howto/set-up-database.rst:
##########
@@ -383,6 +383,69 @@ After configuring the database and connecting to it in Airflow configuration, yo
 
     airflow db migrate
 
+Monitoring your database and logging queries
+--------------------------------------------
+
+Airflow uses the relational metadata DB a LOT. When scheduling and executing tasks, database is the central
+and crucial part of all the calculations and synchronization and it is important to monitor your database
+and make sure it is configured properly and that there are no excessive and long-running queries that impact
+Airflow performance. Such long or excessive queries might occur due to bugs in the code, missing optimizations,
+there is also a possibility that database optimization engine makes wrong decisions based on statistics of
+data - for example when the statistics of data in your database gets outdated.

Review Comment:
   Very long sentences. I'd propose to re-phrase a bit, e.g.:
   ```suggestion
   Airflow uses the relational metadata DB a lot. When scheduling and executing tasks, database is the central
   and crucial part of all the calculations and synchronization. It is important to monitor your database
   and make sure it is configured properly. Excessive and long-running queries will most probably impact
   Airflow performance. Such long or excessive queries might occur due to specific cases in your workflow, 
   missing optimizations or even due to bugs in the code (which maintainers try to take care for but glitches
   can appear). There is also a possibility that database optimization engine makes wrong decisions based
   on (outdated) statistics of data - for example when the set of data in your database changes and database
   statistics gets outdated.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872189592

   > If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.
   
   IMHO, the `db clean` command available in the Airflow CLI and the Python methods provided in the module `db_cleanup` are sufficient for the users because they can be easily used in a simple dag with a bash operator for the CLI or Python operator for the Python API. (that's what most of Airflow users do) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #36483:
URL: https://github.com/apache/airflow/pull/36483#discussion_r1438414061


##########
docs/apache-airflow/core-concepts/tasks.rst:
##########
@@ -244,7 +244,7 @@ Zombie/Undead Tasks
 No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:
 
 * *Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
-  (e.g. their process didn't send a recent heartbeat as it got killed, or the machine died). Airflow will find these
+  (e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these

Review Comment:
   This failed when I run --clean-build locally so i fixed it :)



##########
docs/apache-airflow/core-concepts/tasks.rst:
##########
@@ -273,7 +273,7 @@ The explanation of the criteria used in the above snippet to detect zombie tasks
 
 3. **Job Type**
 
-    The job associated with the task must be of type "LocalTaskJob."
+    The job associated with the task must be of type ``LocalTaskJob``.

Review Comment:
   Same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1871984595

   Inspired by discussion in https://github.com/apache/airflow/discussions/36454 - whenever user/issue is likely related to excessive DB usage, we should be able to just link the discussion/issue to this documentation, and possibly it will even inspire Deployment Managers to set-up monitoring of their DB as they might not be aware this falls under their responsibility when they choose the DB backend and that they have to learn how to monitor the DB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872483058

   I will merge it for now - we can always correct it later. I believe the structured way is much cleaner and responds to the "too many details" concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "BasPH (via GitHub)" <gi...@apache.org>.
BasPH commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872078454

   I agree with explaining monitoring of Airflow's meta-database in the docs. However, this chapter contains a lot of (IMO) unnecessary details which makes it difficult to extract the key information.
   
   How about restructuring it as:
   - 1 sentence explaining "everything you see and do in Airflow is registered in the database, therefore it's important to monitor"
   - 1 sentence with disclaimer that monitoring the database is up to yourself, not part of Airflow
   - 1 paragraph stating "Here are a few key considerations when monitoring your Airflow database" with the bulletpoints from your text
   - 1 paragraph explaining how to enable query logging together with the "why" and a warning that it can have a performance impact


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add description on the ways how users should approach DB monitoring [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #36483:
URL: https://github.com/apache/airflow/pull/36483#issuecomment-1872226453

   > > If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.
   > 
   > IMHO, the `db clean` command available in the Airflow CLI and the Python methods provided in the module `db_cleanup` are sufficient for the users because they can be easily used in a simple dag with a bash operator for the CLI or Python operator for the Python API. (that's what most of Airflow users do)
   
   Agree here with @hussein-awala . Those are tools we give the Deployment Managers but they should adjust the usage to their needs and run those tools as and when needed with the parameters they decide are best for their deployment.
   
   Generally speaking I think we should stress (and this is partially my motivation) that Airflow (components) and Airflow maintainers will NOT solve the problems of DB retention, configuration and optimization. Also commenting to @BasPH - I think  the part where we explicilty say `Hey Deployment Manager - it's Your job to make sure to keep your database healthy and to analyse if you see some performance issues` is extremely important  - and also FAIR,
   
   I think (and in parts that is my motivation behind this change) is not only to help our users to see what they should do but also make it `crystal clear` they are responsible for taking care about all those things. 
   
   It's extremely complex and rather brittle to develop a generic and applicable to all cases ways to keep the database healthy and optimized. There is a good reason why databases need maintenance and someone to look after that - otherwise we would have it already implemented in Postgres and MySQL so that it's completely `worry-free` and handle all usage cases. Airflow and the way it can be used - from small single node installation to 100s of nodes and 1000s of DAGs and tasks does not narrow down the DB maintenance problem to much smaller and simpler case - not enough that we provide a "zero maintenance" solution out-of the box.  Even the @dirrao 's request is going in that direction. But I think it's far too much to expect this kind of "zero maintenance" from a non-managed, free, open-source software. 
   
   But some of our users **expect** this will happen. And with that documentation chapter  I want to make it crystal clear and set expectations 
   
   Also mentioning "managed database" in this context is important - those databases **might** be closer to "zero-maintenance". And then if you go to "managed Airflow" - that's even more "zero-maintenance" when you pay someone to manage Airflow, then absolutely - yes, you should expect you do not have to worry about database maintenance and the one who provides "managed Airflow" should take care about it - this is precisely why you pay  (among other things) - you pay for the maintanance that you do not have to do.
   
   @BasPH - yep I hear you. I will try to remove some duplications and restructure it a bit (I just bought Chat GPT 4.5 subscription **just** to see if it can help with that). But the points about "Deployment Manager responsibity" especially and settting clear expectations what they have to do is crucial part of this change and one that is the main reason I am doing it in the first place.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org