You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/04 09:39:02 UTC

[GitHub] [airflow] ashb commented on a diff in pull request #24743: Get dataset-driven scheduling working

ashb commented on code in PR #24743:
URL: https://github.com/apache/airflow/pull/24743#discussion_r912820582


##########
airflow/models/dagrun.py:
##########
@@ -631,6 +631,32 @@ def update_state(
         session.merge(self)
         # We do not flush here for performance reasons(It increases queries count by +20)
 
+        from airflow.models import Dataset
+        from airflow.models.dataset_dag_run_event import DatasetDagRunEvent as DDRE
+        from airflow.models.serialized_dag import SerializedDagModel
+
+        datasets = []
+        for task in self.dag.tasks:
+            for outlet in getattr(task, '_outlets', []):
+                if isinstance(outlet, Dataset):
+                    datasets.append(outlet)
+        dataset_ids = [x.get_dataset_id(session=session) for x in datasets]
+        events_to_process = session.query(DDRE).filter(DDRE.dataset_id.in_(dataset_ids)).all()

Review Comment:
   > add a new top level scheduler query that will partition dags using the same kind of skip locked logic of the main scheduler query and create necessary dag runs
   
   I was hoping to explicitly not have to do this (as mentioned in the thread I linked.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org