You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/01/04 03:59:00 UTC

[jira] [Commented] (AIRFLOW-3627) Queries in /task_stats of both /www & /www_rbac can be refined significantly by minor changes

    [ https://issues.apache.org/jira/browse/AIRFLOW-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733814#comment-16733814 ] 

ASF GitHub Bot commented on AIRFLOW-3627:
-----------------------------------------

XD-DENG commented on pull request #4433: [AIRFLOW-3627] Improve queries performance in /task_stats in views.py by ~20%
URL: https://github.com/apache/incubator-airflow/pull/4433
 
 
   
   ### Jira
   
     - https://issues.apache.org/jira/browse/AIRFLOW-3627
   
   ### Description
   
   `/task_stats` is used quite heavily. Every time the main page is loaded, it will be called. But actually the performance of it can be improved significantly by minor changes.
   
   1. In the sqlalchemy `.filter()` statement, no need to explicitly add `==True`. The value itself is already Boolean. Doing another explicit comparison is adding heavy overhead. This change helps improve query performance significantly
   
   2. We can merge the multiple `.filter()`s (https://docs.sqlalchemy.org/en/latest/orm/query.html#sqlalchemy.orm.query.Query.filter).
   
   **After these changes, the query time can be reduced by ~20%**
   
   #### Benchmarking
   
   ```python
   from airflow.settings import Session
   from airflow import models
   from datetime import datetime
   
   def test0():
       s.query(DM).filter(DM.is_active == True).filter(DM.dag_id == 'tutorial')
   
   def test1():
       s.query(DM).filter(DM.is_active).filter(DM.dag_id == 'tutorial')
   
   def test2():
       s.query(DM).filter(DM.is_active, DM.dag_id == 'tutorial')
   
   if __name__ == '__main__':
   
       import timeit
   
       s = Session()
       DM = models.DagModel
       n = 300000
   
       print(s.query(DM).filter(DM.is_paused == True).count() == s.query(DM).filter(DM.is_paused).count())
   
       print(timeit.timeit("test0()", number=n, setup="from __main__ import test0"))
       print(timeit.timeit("test1()", number=n, setup="from __main__ import test1"))
       print(timeit.timeit("test2()", number=n, setup="from __main__ import test2"))
   
       print("\n --- Run Tests in Reverse Order --- \n")
   
       print(timeit.timeit("test2()", number=n, setup="from __main__ import test2"))
       print(timeit.timeit("test1()", number=n, setup="from __main__ import test1"))
       print(timeit.timeit("test0()", number=n, setup="from __main__ import test0"))
   ```
   
   Result:
   ```
   True
   
   94.96372179000173
   65.610312083998
   65.09665720800695
   
    --- Run Tests in Reverse Order ---
   
   61.59604939300334
   65.70635982099338
   83.37481481899158
   ```
   ### Code Quality
   
   - [x] Passes `flake8`
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Queries in /task_stats of both /www & /www_rbac can be refined significantly by minor changes
> ---------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3627
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3627
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: webserver
>            Reporter: Xiaodong DENG
>            Assignee: Xiaodong DENG
>            Priority: Critical
>
> /task_stats is used heavily in UI. Every time the main page is loaded, it will be called.
> But actually the performance of it can be improved significantly by minor changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)