You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/06/30 07:20:07 UTC

[GitHub] [airflow] rotemseekingalpha opened a new issue, #24745: Airflow DB cleanup is very slow

rotemseekingalpha opened a new issue, #24745:
URL: https://github.com/apache/airflow/issues/24745

   ### Apache Airflow version
   
   2.3.2 (latest released)
   
   ### What happened
   
   When deleting Dag Runs or DAGs from the UI, the response is HTTP ERROR 504 (timeout).
   IMHO, this is mainly happening because the delete process relies on the foreign key cascade effect.
   After investigating the issue in the database (PostgreSQL), I suspect the problem is the locks on the `xcom` table.
   I found that there are multiple `delete` operations, performed on the `xcom` table, regardless if the task is returning a value.
   Could it be that for every task instance, the Airflow code tries to delete the relevant `xcom` value?
   
   ```
   $ psql -h airflow2-******* -U airflow -f task_instance_delete.sql 
   Timing is on.
   DELETE 57
   Time: 26510.282 ms (00:26.510)
   DELETE 57
   Time: 31985.225 ms (00:31.985)
   DELETE 57
   Time: 27645.672 ms (00:27.646)
   DELETE 57
   Time: 30971.105 ms (00:30.971)
   DELETE 57
   Time: 22710.749 ms (00:22.711)
   DELETE 57
   Time: 75835.418 ms (01:15.835)
   DELETE 57
   Time: 52803.962 ms (00:52.804)
   DELETE 57
   Time: 48853.255 ms (00:48.853)
   DELETE 57
   Time: 49665.834 ms (00:49.666)
   DELETE 57
   Time: 38279.934 ms (00:38.280)
   DELETE 57
   Time: 26739.524 ms (00:26.740)
   DELETE 57
   Time: 30795.999 ms (00:30.796)
   DELETE 57
   Time: 31985.267 ms (00:31.985)
   DELETE 57
   Time: 36109.525 ms (00:36.110)
   DELETE 57
   Time: 25500.953 ms (00:25.501)
   DELETE 57
   Time: 19761.695 ms (00:19.762)
   DELETE 57
   Time: 19623.964 ms (00:19.624)
   DELETE 57
   Time: 19174.745 ms (00:19.175)
   DELETE 57
   Time: 25555.729 ms (00:25.556)
   DELETE 57
   Time: 22745.851 ms (00:22.746)
   DELETE 57
   Time: 23406.203 ms (00:23.406)
   DELETE 57
   Time: 19464.922 ms (00:19.465)
   DELETE 57
   Time: 27405.270 ms (00:27.405)
   DELETE 57
   Time: 84066.345 ms (01:24.066)
   DELETE 57
   Time: 58763.224 ms (00:58.763)
   ```
   
   `task_instance_delete.sql` contains the following:
   ```
   DELETE from task_instance where dag_id='some_dag_id' and run_id='some_run_id';
   ```
   
   
   ### What you think should happen instead
   
   `delete` operations should be faster and UI should not receive a timeout.
   Perhaps have a poll mechanism to check if DAG was deleted
   
   ### How to reproduce
   
   Try to delete task_instance table from a large deployment of Airflow (hundreds of DAGs, task instances, and xcom values).
   
   ### Operating System
   
   Ubuntu 20.04.3 LTS
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-amazon    @ apache_airflow_providers_amazon-4.0.0-py3-none-any.whl
   apache-airflow-providers-ftp       @ apache_airflow_providers_ftp-3.0.0-py3-none-any.whl
   apache-airflow-providers-http      @ apache_airflow_providers_http-3.0.0-py3-none-any.whl
   apache-airflow-providers-imap      @ apache_airflow_providers_imap-3.0.0-py3-none-any.whl
   apache-airflow-providers-mongo     @ apache_airflow_providers_mongo-3.0.0-py3-none-any.whl
   apache-airflow-providers-mysql     @ apache_airflow_providers_mysql-3.0.0-py3-none-any.whl
   apache-airflow-providers-pagerduty @ apache_airflow_providers_pagerduty-3.0.0-py3-none-any.whl
   apache-airflow-providers-postgres  @ apache_airflow_providers_postgres-5.0.0-py3-none-any.whl
   apache-airflow-providers-redis     @ apache_airflow_providers_redis-3.0.0-py3-none-any.whl
   apache-airflow-providers-sendgrid  @ apache_airflow_providers_sendgrid-3.0.0-py3-none-any.whl
   apache-airflow-providers-slack     @ apache_airflow_providers_slack-5.0.0-py3-none-any.whl
   apache-airflow-providers-sqlite    @ apache_airflow_providers_sqlite-3.0.0-py3-none-any.whl
   apache-airflow-providers-ssh       @ apache_airflow_providers_ssh-3.0.0-py3-none-any.whl
   apache-airflow-providers-vertica   @ apache_airflow_providers_vertica-3.0.0-py3-none-any.whl
   ```
   
   ### Deployment
   
   Virtualenv installation
   
   ### Deployment details
   
   Python 3.8.10
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] rotemseekingalpha commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
rotemseekingalpha commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170868151

   @uranusjr This was just an example of what happens in the database.
   I am not saying I actively deleting data directly from the database.
   I am using any of the [airflow_db_cleanup](https://cloud.google.com/composer/docs/cleanup-airflow-database) DAGs provided online


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170915187

   @rotemseekingalpha - if you feel like it, you can provide a PR for that (but I am afraid this is a very complex change to implement from the UI. The UI is not built in the way that it will handle ALL cases of all problems of database of all sizes. This is just impossible, and entirely not needed to not necessarily complicate the architecture of the UI.
   
   If you would like to introduce asynchronouse delete and polling and long deletion of a DAG with REALLY huge history (like apparently is your case), then it means you have to add a separate process that will control it and run it and UI just triggering the deletion and polling for it. Yes. It is possible, no we might choose not to do it because it unnecessary complicates things and there are other options:
   
   1) you can manually delete the dag history from the DB.
   
   2) (better) in latest Airflow 2.3. you have `airflow db clean` - CLI command whch will purge historical data (records older than a given date) and is a recommended way to keep your database small and snappy. You should upgrade and start using this command instead of the cleanup dags, which (while being helpful in the past and useful have been really replaced by the CLI command). We highly recommend using it. You can run those cleanup scripts in the way you find best (including running them in bash oparetor of airlfow). 
   
   If you feel like it and want to become one of the > 2100 contributors (most - users like you) - you are most welcome to propose such a solution. This is a free software, and anyone can do it. You can do it too, if you feel it will help with your problem and you find the other ways of solving the problem as  not "enough". Many of our users actually contributed something because they felt they should give back for the free software they use and their case was specific enough for their case  but also replicable by others, that they felt the need they can help others by contributing a change. 
   
   But just be warned -  this one might take quite some time as it migh require some architectural decisions and possibly an Airflow Improvement Proposal (https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals) to write.
   
   But if you really think it's useful and needed - you are most welcome if you want to join those more than 2100 contributors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr closed issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
uranusjr closed issue #24745: Airflow DB cleanup is very slow
URL: https://github.com/apache/airflow/issues/24745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170944229

   Airflow db clean is an official airflew feature. Not a db operation per se.that's why 1) from my answer is far worse (because it is not supported and if you do it you have to be prepared for it to stop working any time. There is nothing to forbid you doing so. But by doing it you take over responsibility of maintaining your cleanup scripts. We are not policemen forbidding you to do stuff, you just need to bear the consequences of your choices 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] rotemseekingalpha commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
rotemseekingalpha commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170939339

   @potiuk Thank you for your comment. I will definitely move to performing cleanups using the Airflow cli
   @uranusjr said operations on the database are not supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170871977

   That’s still directly interacting with the database. As I mentioned, it’s not actively supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] rotemseekingalpha commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
rotemseekingalpha commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170950878

   @potiuk thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] rotemseekingalpha commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
rotemseekingalpha commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170872413

   @uranusjr What about the timeout when deleting DAG/DagRun from UI?
   Is that supported?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170905940

   UI operations are supported so we should improve those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #24745: Airflow DB cleanup is very slow

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #24745:
URL: https://github.com/apache/airflow/issues/24745#issuecomment-1170866081

   Directly interacting with the database is not supported. If you have a thing you want to improve, we can accept a pull request if it does not go against other design decisions, but we are not going to actively provide solutions for you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org