You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/05/29 10:01:50 UTC

[GitHub] [airflow] turbaszek edited a comment on pull request #8227: Add run_type to DagRun

turbaszek edited a comment on pull request #8227:
URL: https://github.com/apache/airflow/pull/8227#issuecomment-635886652


   Ok, I did a quick test on 30k DagRun records.
   
   ```python
   def make():
       with create_session() as s:
           for j, t in enumerate(DagRunType):
               for i in range(10000):
                   now = days_ago(i)
                   run_id = f"{t.value}__{str(uuid.uuid4())}"
                   run = DagRun(
                       dag_id=f"test_{j}_{t.value}",
                       run_id=run_id,
                       execution_date=now,
                       run_type=t.value,
                   )
                   s.add(run)
           s.commit()
   
   if __name__ == '__main__':
       make()
   
       DR = DagRun
       N = 100
   
       def query(kwargs):
           with create_session() as s:
               s.query(DR).filter_by(**kwargs)
   
       @timing(N)
       @repeat(N)
       def test(kwargs):
           query(kwargs)
   
   
       kwargs = (
           ("dag_id", "test_0_backfill"),
           ("run_id", "backfill__b7d81c6b-a460-4a47-97c0-b2d587a48b12"),
           ("state", "running"),
           ("run_type", "backfill")
       )
       for i in range(1, 5):
           cs = combinations(kwargs, i)
           for c in cs:
               k = dict(c)
               print(f"Case: {k.keys()}")
               test(k)
   ```
   
   Then for each filter keys combination I counted how many times given idex yielded the minimum time, here is the result:
   ```json
   "Index('dag_id_state_type', dag_id, _state, run_type),": 4,
   "Index('dag_id_state', dag_id, _state),Index('dag_id_type', dag_id, run_type),": 0,
   "Index('dag_id_type', dag_id, run_type),": 4,
   "Index('dag_id_state', dag_id, _state),": 7}
   ```
   
   Of course this means that "if you use each combination of keys as often as others" then use `Index('dag_id_state', dag_id, _state)` and for sure do not use a double index, unless you remember to specify which index should be used in query (easy to abuse).
   
   Here are exact result: https://docs.google.com/spreadsheets/d/102Y-u7Uy2YF_-f0y-sxhUIZzNTrfPPYQVqfjBmAp1Oo/edit?usp=sharing
   
   And what you can see is that `Index('dag_id_state_type', dag_id, _state, run_type),` wins sometimes but when it loses it's quite visible. And when `Index('dag_id_type', dag_id, run_type)` wins it's not a spectacular victory. Additionally I am unable to decide what type of queries are performed often than others. 
   
   Thus, I would say that we should keep `Index('dag_id_state', dag_id, _state)`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org