You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 14:00:16 UTC
[GitHub] [airflow] dstandish opened a new pull request, #24925: Fix cartesian join re dataset deps in update_state
dstandish opened a new pull request, #24925:
URL: https://github.com/apache/airflow/pull/24925
We use the subquery to get distinct downstream dataset references for the dag (there could be
multiple tasks that touch the same dataset), then join to DDR to get the dags pointing to
those datasets.
resolves https://github.com/apache/airflow/projects/16#card-83978429
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dstandish commented on pull request #24925: Fix cartesian join re dataset deps in update_state
Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #24925:
URL: https://github.com/apache/airflow/pull/24925#issuecomment-1179171012
@jedcunningham the reason for the subquery is to avoid join explosion.
We could do this and it would get to the same place:
```
SELECT distinct ddr.dag_id
FROM dataset_task_ref dtr
JOIN dataset_dag_ref ddr ON ddr.dataset_id = dtr.dataset_id
WHERE dtr.dag_id = 'dag1';
```
But because DTR can have multiple dags per dataset, if we don't dedupe before joining, you'll double the matches in DDR.
So for example suppose you have
dataset1 > [dag3,dag4,dag5,dag6]
Then suppose you have
dag1+task1 > dataset1
Then when you join to DDR you'll get 4 rows.
But if you have
dag1+task1 > dataset1
dag1+task2 > dataset1
dag1+task3 > dataset1
Then this join produces 12 rows.
But if you use the subquery, it will only produce 4 rows.
And this is what it looks like with the subquery:
SELECT ddr.dag_id
FROM dataset_dag_ref ddr
JOIN (
SELECT DISTINCT dtr.dataset_id AS dataset_id
FROM dataset_task_ref dtr
WHERE dtr.dag_id = 'dag1'
) AS a ON a.dataset_id = ddr.dataset_id
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jedcunningham commented on pull request #24925: Fix cartesian join re dataset deps in update_state
Posted by GitBox <gi...@apache.org>.
jedcunningham commented on PR #24925:
URL: https://github.com/apache/airflow/pull/24925#issuecomment-1179176960
The distinct would only return the 4, no. Either way, enough bikeshedding from me 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dstandish commented on pull request #24925: Fix cartesian join re dataset deps in update_state
Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #24925:
URL: https://github.com/apache/airflow/pull/24925#issuecomment-1179179793
> The distinct would only return the 4, no.
@jedcunningham no i was saying the _join_ would produce 12 rows (which would have to be deduped by distinct)
the query does produce the same result i'm just saying it is many-to-many and that's what i'm avoiding with the subquery
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dstandish merged pull request #24925: Fix cartesian join re dataset deps in update_state
Posted by GitBox <gi...@apache.org>.
dstandish merged PR #24925:
URL: https://github.com/apache/airflow/pull/24925
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jedcunningham commented on pull request #24925: Fix cartesian join re dataset deps in update_state
Posted by GitBox <gi...@apache.org>.
jedcunningham commented on PR #24925:
URL: https://github.com/apache/airflow/pull/24925#issuecomment-1179130188
Did you profile this against an explicit join with proper conditionals? I'm curious if the subquery is actually faster?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org