You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/02/07 17:59:11 UTC
[GitHub] [airflow] SamWheating opened a new issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
SamWheating opened a new issue #21397:
URL: https://github.com/apache/airflow/issues/21397
### Apache Airflow version
2.2.2
### What happened
When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL instance was running near 100% CPU utilization while it was closer to 20% on Airflow 2.1.2.
After a long investigation, we found that the queries introduced in https://github.com/apache/airflow/pull/17121 are extremely hard on the database, especially when there are multiple schedulers, a high value of `parsing_processes` and a large number of DAGs.
Our setup is as follows:
- 4x Schedulers
- ~20k rows in the `dag` table
- ~10k DAG files
- `parsing_processes=32`
So each time a DAG is parsed, a query like this will be run:
```sql
UPDATE dag
SET is_active=0
WHERE dag.fileloc = '/path/to_my/dag.py'
AND dag.is_active = 1
AND dag.dag_id NOT IN ('my-dag-1', 'my-dag-2');
```
And because `dag` isn't indexed on `fileloc`, the query ends up doing a full-table-scan (or nearly a full-table-scan), and this is repeated for every single file which is processed.
When I added these queries, I tested the change in a local `breeze` environment with a relatively small number of DAGs and thus did not notice the performance implications.
At our scale / configuration, we have approximately 128 of these poorly-performant queries running in parallel, each scanning approximately 20,000 rows. Understandably this was really hard on the database which ended up drastically impacting the performance of other queries.
We were able to reduce the impact by lowering the `parsing_processes`, cleaning up old entries in the `dag` table and increasing the `min_file_processing_interval`, but none of these mitigations really address the root of the problem.
We are currently working on a fix which moves this cleanup to the DAG Processor manager and eliminates un-indexed queries and should be able to submit a preliminary pull request for review in the next few days.
### What you expected to happen
Removing DAGs which no longer exist in files should not put so much strain on the database.
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux 10 (buster)
### Versions of Apache Airflow Providers
_No response_
### Deployment
Other 3rd-party Helm chart
### Deployment details
_No response_
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1077901406
Closed in https://github.com/apache/airflow/pull/21399
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb edited a comment on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
ashb edited a comment on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031800935
Oh thanks Mysql.
Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
ashb commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031796932
It sounds like this would be fixed by adding an index on dag model table, right?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031799103
Yes, but we can't add an index on the `fileloc` column because the column size is too large (I think its a `VARCHAR(2000)` or something)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031808756
> Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.
In this case would it make sense to add a `fileloc_hash` column to the `DAG` table? I can also push my proposed alternate implementation shortly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] SamWheating closed issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
SamWheating closed issue #21397:
URL: https://github.com/apache/airflow/issues/21397
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
ashb commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031809580
It might, but if you have another solution that feels less hacky (as I don't like the "split" column all that much) lets take a look
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
ashb commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031800935
Oh thanks Mysql.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1031891355
> Oh thanks Mysql.
A lot.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] SamWheating commented on issue #21397: Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs
Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #21397:
URL: https://github.com/apache/airflow/issues/21397#issuecomment-1049268618
Now that one fix has been proposed (https://github.com/apache/airflow/pull/21399) and I have validated it in our Airflow deployments, what do you think about the proposed implementation? Any preferences between:
1) Adding an indexed `fileloc_hash` to the DAG table and refactoring the existing code to use this index.
2) bulk-deactivating deactivating files from the processor manager (as proposed above)
I am happy to help implement either change but would really like some careful review and discussion of the proposed methods.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org