You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/06/15 21:21:21 UTC

[GitHub] [airflow] wolfier opened a new issue, #24484: `airflow db clean task_instance` takes a long time

wolfier opened a new issue, #24484:
URL: https://github.com/apache/airflow/issues/24484

   ### Apache Airflow version
   
   2.3.1
   
   ### What happened
   
   When I ran the `airflow db clean task_instance` command, it can take up to 9 hours to complete. The database around 3215220 rows in the `task_instance` table and  51602 rows in the `dag_run` table. The overall size of the database is around 1 TB.
   
   I believe the issue is because of the cascade constraints on others tables as well as the lack of indexes on task_instance foreign keys. 
   
   Running delete on a small number of rows gives this shows most of the time is spent in xcom and task_fail tables
   
   ```
   explain (analyze,buffers,timing) delete from task_instance t1 where t1.run_id = 'manual__2022-05-11T01:09:05.856703+00:00'; rollback;
   Trigger for constraint task_reschedule_ti_fkey: time=3.208 calls=23
   Trigger for constraint task_map_task_instance_fkey: time=1.848 calls=23
   Trigger for constraint xcom_task_instance_fkey: time=4457.779 calls=23
   Trigger for constraint rtif_ti_fkey: time=3.135 calls=23
   Trigger for constraint task_fail_ti_fkey: time=1164.183 calls=23
   ```
   
   I temporarily fixed it by adding these indexes.
   
   ```
   create index idx_task_reschedule_dr_fkey on task_reschedule (dag_id, run_id);
   create index idx_xcom_ti_fkey on xcom (dag_id, task_id, run_id, map_index);
   create index idx_task_fail_ti_fkey on task_fail (dag_id, task_id, run_id, map_index);
   ```
   
   ### What you think should happen instead
   
   It should not take 9 hours to complete a clean up process. Before upgrading to 2.3.x, it was taking no more than 5 minutes.
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   N/A
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on issue #24484: `airflow db clean task_instance` takes a long time

Posted by GitBox <gi...@apache.org>.
dstandish commented on issue #24484:
URL: https://github.com/apache/airflow/issues/24484#issuecomment-1157113234

   @wolfier do you have a pgdump of this database so we can do testing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish closed issue #24484: `airflow db clean task_instance` takes a long time

Posted by GitBox <gi...@apache.org>.
dstandish closed issue #24484: `airflow db clean task_instance` takes a long time
URL: https://github.com/apache/airflow/issues/24484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org