You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 16:29:24 UTC
[GitHub] [airflow] dstandish commented on pull request #24925: Fix cartesian join re dataset deps in update_state
dstandish commented on PR #24925:
URL: https://github.com/apache/airflow/pull/24925#issuecomment-1179171012
@jedcunningham the reason for the subquery is to avoid join explosion.
We could do this and it would get to the same place:
```
SELECT distinct ddr.dag_id
FROM dataset_task_ref dtr
JOIN dataset_dag_ref ddr ON ddr.dataset_id = dtr.dataset_id
WHERE dtr.dag_id = 'dag1';
```
But because DTR can have multiple dags per dataset, if we don't dedupe before joining, you'll double the matches in DDR.
So for example suppose you have
dataset1 > [dag3,dag4,dag5,dag6]
Then suppose you have
dag1+task1 > dataset1
Then when you join to DDR you'll get 4 rows.
But if you have
dag1+task1 > dataset1
dag1+task2 > dataset1
dag1+task3 > dataset1
Then this join produces 12 rows.
But if you use the subquery, it will only produce 4 rows.
And this is what it looks like with the subquery:
SELECT ddr.dag_id
FROM dataset_dag_ref ddr
JOIN (
SELECT DISTINCT dtr.dataset_id AS dataset_id
FROM dataset_task_ref dtr
WHERE dtr.dag_id = 'dag1'
) AS a ON a.dataset_id = ddr.dataset_id
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org