You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/02/07 17:59:11 UTC

[GitHub] [airflow] SamWheating opened a new issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

SamWheating opened a new issue #21397:
URL: https://github.com/apache/airflow/issues/21397


   ### Apache Airflow version
   
   2.2.2
   
   ### What happened
   
   When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL instance was running near 100% CPU utilization while it was closer to 20% on Airflow 2.1.2.
   
   After a long investigation, we found that the queries introduced in https://github.com/apache/airflow/pull/17121 are extremely hard on the database, especially when there are multiple schedulers, a high value of `parsing_processes` and a large number of DAGs. 
   
   Our setup is as follows:
    - 4x Schedulers
    - ~20k rows in the `dag` table
    - ~10k DAG files
    - `parsing_processes=32`
    
   So each time a DAG is parsed, a query like this will be run:
   ```sql
   UPDATE dag
   SET is_active=0
   WHERE dag.fileloc = '/path/to_my/dag.py'
   AND dag.is_active = 1
   AND dag.dag_id NOT IN ('my-dag-1', 'my-dag-2');
   ```
   
   And because `dag` isn't indexed on `fileloc`, the query ends up doing a full-table-scan (or nearly a full-table-scan), and this is repeated for every single file which is processed. 
   
   When I added these queries, I tested the change in a local `breeze` environment with a relatively small number of DAGs and thus did not notice the performance implications.
   
   At our scale / configuration, we have approximately 128 of these poorly-performant queries running in parallel, each scanning approximately 20,000 rows. Understandably this was really hard on the database which ended up drastically impacting the performance of other queries. 
   
   We were able to reduce the impact by lowering the `parsing_processes`, cleaning up old entries in the `dag` table and increasing the `min_file_processing_interval`, but none of these mitigations really address the root of the problem. 
   
   We are currently working on a fix which moves this cleanup to the DAG Processor manager and eliminates un-indexed queries and should be able to submit a preliminary pull request for review in the next few days. 
   
   ### What you expected to happen
   
   Removing DAGs which no longer exist in files should not put so much strain on the database. 
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1077901406


   Closed in https://github.com/apache/airflow/pull/21399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031800935


   Oh thanks Mysql.
   
   Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031796932


   It sounds like this would be fixed by adding an index on dag model table, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031799103


   Yes, but we can't add an index on the `fileloc` column because the column size is too large (I think its a `VARCHAR(2000)` or something)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031808756


   > Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.
   
   In this case would it make sense to add a `fileloc_hash` column to the `DAG` table? I can also push my proposed alternate implementation shortly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating closed issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
SamWheating closed issue #21397:
URL: https://github.com/apache/airflow/issues/21397


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031809580


   It might, but if you have another solution that feels less hacky (as I don't like the "split" column all that much) lets take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031800935


   Oh thanks Mysql.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031891355


   > Oh thanks Mysql.
   
   A lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1049268618


   Now that one fix has been proposed (https://github.com/apache/airflow/pull/21399) and I have validated it in our Airflow deployments, what do you think about the proposed implementation? Any preferences between:
   
   1) Adding an indexed `fileloc_hash` to the DAG table and refactoring the existing code to use this index.
   2) bulk-deactivating deactivating files from the processor manager (as proposed above)
   
   I am happy to help implement either change but would really like some careful review and discussion of the proposed methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org