You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/01/04 03:59:00 UTC
[jira] [Commented] (AIRFLOW-3627) Queries in /task_stats of both
/www & /www_rbac can be refined significantly by minor changes
[ https://issues.apache.org/jira/browse/AIRFLOW-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733814#comment-16733814 ]
ASF GitHub Bot commented on AIRFLOW-3627:
-----------------------------------------
XD-DENG commented on pull request #4433: [AIRFLOW-3627] Improve queries performance in /task_stats in views.py by ~20%
URL: https://github.com/apache/incubator-airflow/pull/4433
### Jira
- https://issues.apache.org/jira/browse/AIRFLOW-3627
### Description
`/task_stats` is used quite heavily. Every time the main page is loaded, it will be called. But actually the performance of it can be improved significantly by minor changes.
1. In the sqlalchemy `.filter()` statement, no need to explicitly add `==True`. The value itself is already Boolean. Doing another explicit comparison is adding heavy overhead. This change helps improve query performance significantly
2. We can merge the multiple `.filter()`s (https://docs.sqlalchemy.org/en/latest/orm/query.html#sqlalchemy.orm.query.Query.filter).
**After these changes, the query time can be reduced by ~20%**
#### Benchmarking
```python
from airflow.settings import Session
from airflow import models
from datetime import datetime
def test0():
s.query(DM).filter(DM.is_active == True).filter(DM.dag_id == 'tutorial')
def test1():
s.query(DM).filter(DM.is_active).filter(DM.dag_id == 'tutorial')
def test2():
s.query(DM).filter(DM.is_active, DM.dag_id == 'tutorial')
if __name__ == '__main__':
import timeit
s = Session()
DM = models.DagModel
n = 300000
print(s.query(DM).filter(DM.is_paused == True).count() == s.query(DM).filter(DM.is_paused).count())
print(timeit.timeit("test0()", number=n, setup="from __main__ import test0"))
print(timeit.timeit("test1()", number=n, setup="from __main__ import test1"))
print(timeit.timeit("test2()", number=n, setup="from __main__ import test2"))
print("\n --- Run Tests in Reverse Order --- \n")
print(timeit.timeit("test2()", number=n, setup="from __main__ import test2"))
print(timeit.timeit("test1()", number=n, setup="from __main__ import test1"))
print(timeit.timeit("test0()", number=n, setup="from __main__ import test0"))
```
Result:
```
True
94.96372179000173
65.610312083998
65.09665720800695
--- Run Tests in Reverse Order ---
61.59604939300334
65.70635982099338
83.37481481899158
```
### Code Quality
- [x] Passes `flake8`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
> Queries in /task_stats of both /www & /www_rbac can be refined significantly by minor changes
> ---------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-3627
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3627
> Project: Apache Airflow
> Issue Type: Improvement
> Components: webserver
> Reporter: Xiaodong DENG
> Assignee: Xiaodong DENG
> Priority: Critical
>
> /task_stats is used heavily in UI. Every time the main page is loaded, it will be called.
> But actually the performance of it can be improved significantly by minor changes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)