You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "error418 (via GitHub)" <gi...@apache.org> on 2023/12/05 20:11:19 UTC

[PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

error418 opened a new pull request, #36075:
URL: https://github.com/apache/airflow/pull/36075

   <!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at
   
      http://www.apache.org/licenses/LICENSE-2.0
   
    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
    -->
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   
   
   <!-- Please keep an empty line above the dashes. -->
   ---
   Persists Dataset extras to the according DatasetEvent to be used in Tasks invoked by Data aware scheduling. Please see linked issue for more information
   
   closes: #35297
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

Posted by "blag (via GitHub)" <gi...@apache.org>.
blag commented on PR #36075:
URL: https://github.com/apache/airflow/pull/36075#issuecomment-1843333200

   Please refer to the PR implementing this, and the discussion about an `extra` parameter [here](https://github.com/apache/airflow/pull/25419#pullrequestreview-1061066994).
   
   IIRC, the intent here was _not_ to break database normalization and copy `DatasetModel.extra` for every DatasetEvent, it was to enable third party `DatasetEventManager`s to record additional data from external sources (including non-Airflow sources) with a minimum of fuss. The `extra` fields of `DatasetModel` and `DatasetEvent` are similarly named, but used for different and distinct purposes, and I did _not_ intend for `DatasetEvent.extra` to be copied from `DatasetModel.extra`.
   
   Given that, I would not expect `DatasetEvent.extra` to be used within Airflow, as if we wanted to store more information in the `DatasetEvent` table, we would just update the database schema. However, just because it is not used internally within Airflow does not mean may not be useful.
   
   IMO #35297 is invalid, because tasks should be able to query the Airflow DB directly for the dataset extra field using the keys in `triggering_dataset_events`, and should _absolutely not_ expect the dataset extra field to be copied to all dataset event extra fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

Posted by "error418 (via GitHub)" <gi...@apache.org>.
error418 commented on PR #36075:
URL: https://github.com/apache/airflow/pull/36075#issuecomment-1842371559

   @uranusjr I also thought the implementation was intended this way. I removed the merge with the latest commit to this PR for the review.
   
   In case the discussion comes to the result that a merge of the dictionaries is required we can quickly roll back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

Posted by "blag (via GitHub)" <gi...@apache.org>.
blag commented on PR #36075:
URL: https://github.com/apache/airflow/pull/36075#issuecomment-1843390310

   The docs are not optimally worded and definitely conflate datasets and dataset events, possibly leading to this confusion. See [here](https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html).
   
   > Fetching information from a Triggering Dataset Event
   > 
   > A triggered DAG can fetch information from the Dataset that triggered it using the `triggering_dataset_events` template or parameter.
   
   Do datasets trigger DAGs, or do dataset _events_ trigger DAGs? The title of this section and its text seem to disagree. The answer is: dataset events trigger DAGs.
   
   A one word change would vastly improve the precision of this section:
   
   > A triggered DAG can fetch information from the Dataset event that triggered it using the `triggering_dataset_events` template or parameter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on PR #36075:
URL: https://github.com/apache/airflow/pull/36075#issuecomment-1842180077

   Do we need to do the merge at all? From what I can tell, the `extra` parameter on `register_dataset_change` is currently not used anywhere, and I suspect the original intention was to use that field to forward extra from the dataset directly to dataset event instead. Maybe we should just remove `extra=None` from `register_dataset_change` and just do the forwarding.
   
   cc @blag for some history on this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36075:
URL: https://github.com/apache/airflow/pull/36075#issuecomment-1905060329

   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix(datasets): persist dataset extra dict to dataset event [airflow]

Posted by "error418 (via GitHub)" <gi...@apache.org>.
error418 closed pull request #36075: fix(datasets): persist dataset extra dict to dataset event
URL: https://github.com/apache/airflow/pull/36075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org