You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/12/13 07:31:20 UTC

[GitHub] [airflow] kushsharma opened a new issue #20249: Slow deletion of a DAG

kushsharma opened a new issue #20249:
URL: https://github.com/apache/airflow/issues/20249


   ### Apache Airflow version
   
   2.2.1
   
   ### What happened
   
   We have an airflow instance for approximately 6k DAGs. 
   - If we delete a DAG from UI, the UI times out
   - If we delete a DAG from CLI, it completes but sometimes takes up to a half-hour to finish.
   
   Most of the execution time appears to be consumed in database queries. I know I can just throw more CPU and memory to the db instance and hope it works but I think we can do better during delete operation. Correct me if I am wrong but I think this is the code that gets executed when deleting a DAG from UI or CLI via `delete_dag.py`
   ```python
       for model in models.base.Base._decl_class_registry.values():
           if hasattr(model, "dag_id"):
               if keep_records_in_log and model.__name__ == 'Log':
                   continue
               cond = or_(model.dag_id == dag_id, model.dag_id.like(dag_id + ".%"))
               count += session.query(model).filter(cond).delete(synchronize_session='fetch')
       if dag.is_subdag:
           parent_dag_id, task_id = dag_id.rsplit(".", 1)
           for model in TaskFail, models.TaskInstance:
               count += (
                   session.query(model).filter(model.dag_id == parent_dag_id, model.task_id == task_id).delete()
               )
   ```
   
   I see we are iterating over all the models and doing a `dag_id` match. Some of the tables don't have an index over `dag_id` column like `job` which is making this operation really slow. This could be one easy fix for this issue.
   
   For example, the following query took 20 mins to finish in 16cpu 32gb Postgres instance:
   ```
   SELECT job.id AS job_id FROM job WHERE job.dag_id = $1 OR job.dag_id LIKE $2
   ```
   This is just one of the many queries that are being executed during the delete operation.
   
   ### What you expected to happen
   
   Deletion of DAG should not take this much time.
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   nix
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #20249: Slow deletion of a DAG

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #20249:
URL: https://github.com/apache/airflow/issues/20249#issuecomment-992182957


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kushsharma commented on issue #20249: DAG deletion is slow due to lack of database indexes on dag_id

Posted by GitBox <gi...@apache.org>.
kushsharma commented on issue #20249:
URL: https://github.com/apache/airflow/issues/20249#issuecomment-992604422


   There are composite indexes on some of the tables, I don't think we need explicit index of `dag_id` for them like 
   - `task_reschedule`: `idx_task_reschedule_dag_task_date" btree (dag_id, task_id, execution_date)`
   - `sensor_instance`: `ti_primary_key" UNIQUE, btree (dag_id, task_id, execution_date)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed issue #20249: DAG deletion is slow due to lack of database indexes on dag_id

Posted by GitBox <gi...@apache.org>.
kaxil closed issue #20249:
URL: https://github.com/apache/airflow/issues/20249


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #20249: Slow deletion of a DAG

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #20249:
URL: https://github.com/apache/airflow/issues/20249#issuecomment-992244972


   Let’s add indexes to those tables then. Would you be interested in looking into those various tables and create one for each of them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sanjayhallan commented on issue #20249: DAG deletion is slow due to lack of database indexes on dag_id

Posted by GitBox <gi...@apache.org>.
sanjayhallan commented on issue #20249:
URL: https://github.com/apache/airflow/issues/20249#issuecomment-992540275


   bit of analysis for this ticket, these are the models which do and dont have indexes on dag id
   
   ```>>> for model in models.base.Base._decl_class_registry.values():
   ...     if hasattr(model, "dag_id"):
   ...             print(model)
   ... 
   <class 'airflow.models.log.Log'> yes btree index
   <class 'airflow.models.taskfail.TaskFail'> no 
   <class 'airflow.models.taskreschedule.TaskReschedule'> no
   <class 'airflow.models.xcom.BaseXCom'> no
   <class 'airflow.models.taskinstance.TaskInstance'> no
   <class 'airflow.models.dagrun.DagRun'> no
   <class 'airflow.models.dag.DagTag'> no
   <class 'airflow.models.dag.DagModel'> yes PK index
   <class 'airflow.models.renderedtifields.RenderedTaskInstanceFields'> no
   <class 'airflow.models.sensorinstance.SensorInstance'> no
   <class 'airflow.models.slamiss.SlaMiss'> yes btree index
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org