You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 16:29:24 UTC

[GitHub] [airflow] dstandish commented on pull request #24925: Fix cartesian join re dataset deps in update_state

dstandish commented on PR #24925:
URL: https://github.com/apache/airflow/pull/24925#issuecomment-1179171012

   @jedcunningham the reason for the subquery is to avoid join explosion.  
   
   We could do this and it would get to the same place:
   ```
   SELECT distinct ddr.dag_id
   FROM dataset_task_ref dtr
   JOIN dataset_dag_ref ddr ON ddr.dataset_id = dtr.dataset_id
   WHERE dtr.dag_id = 'dag1';
   ```
   
   But because DTR can have multiple dags per dataset, if we don't dedupe before joining, you'll double the matches in DDR.
   
   So for example suppose you have 
   
   dataset1 > [dag3,dag4,dag5,dag6]
   
   Then suppose you have 
   
   dag1+task1 > dataset1
   
   Then when you join to DDR you'll get 4 rows.
   
   But if you have
   
   dag1+task1 > dataset1
   dag1+task2 > dataset1
   dag1+task3 > dataset1
   
   Then this join produces 12 rows.
   
   But if you use the subquery, it will only produce 4 rows.
   
   And this is what it looks like with the subquery:
   
   SELECT ddr.dag_id
   FROM dataset_dag_ref ddr
   JOIN (
       SELECT DISTINCT dtr.dataset_id AS dataset_id
       FROM dataset_task_ref dtr
       WHERE dtr.dag_id = 'dag1'
   ) AS a ON a.dataset_id = ddr.dataset_id
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org