You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/09/01 13:35:21 UTC

[GitHub] [airflow] ashb opened a new pull request, #26103: Don't error when multiple tasks produce the same dataset

ashb opened a new pull request, #26103:
URL: https://github.com/apache/airflow/pull/26103

   Previously this was "racey", so running multiple dags all updating the
   same outlet dataset (or how I ran in to this: mapped tasks) would cause
   some of them to fail with a unique constraing violation.
   
   The fix has two paths, one generic and an optimized version for
   Postgres.
   
   The generic one is likely slightly slower, and uses the pattern that the
   SQLA docs have for exactly this case. To quote
   
   > This pattern is ideal for situations such as using PostgreSQL and
   > catching IntegrityError to detect duplicate rows; PostgreSQL normally
   > aborts the entire tranasction when such an error is raised, however when
   > using SAVEPOINT, the outer transaction is maintained. In the example
   > below a list of data is persisted into the database, with the occasional
   > "duplicate primary key" record skipped, without rolling back the entire
   > operation:
   
   However for PostgreSQL specifically, there is a better approach we can
   do: use it's `ON CONFLICT DO NOTHING` approach. This also allows us to
   do the whole process in a single SQL statement (vs 1 select + 1 insert
   per for the slow path)
   
   Fixes #25210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb merged pull request #26103: Don't error when multiple tasks produce the same dataset

Posted by GitBox <gi...@apache.org>.
ashb merged PR #26103:
URL: https://github.com/apache/airflow/pull/26103


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #26103: Don't error when multiple tasks produce the same dataset

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #26103:
URL: https://github.com/apache/airflow/pull/26103#issuecomment-1234396902

   Nice 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #26103: Don't error when multiple tasks produce the same dataset

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #26103:
URL: https://github.com/apache/airflow/pull/26103#discussion_r960665262


##########
tests/models/test_taskinstance.py:
##########
@@ -1714,10 +1714,19 @@ def test_outlet_datasets(self, create_task_instance, clear_datasets):
         ti.refresh_from_db()
         assert ti.state == TaskInstanceState.SUCCESS
 
+        # check that no other dataset events recorded
+        event = (
+            session.query(DatasetEvent)
+            .join(DatasetEvent.dataset)
+            .filter(DatasetEvent.source_task_instance == ti)
+            .one()
+        )
+        assert event
+        assert event.dataset
+
         # check that one queue record created for each dag that depends on dataset 1
-        assert session.query(DatasetDagRunQueue.target_dag_id).filter(

Review Comment:
   This change is because SQLA was warning us about a (likely) unintended cross-product effect!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org