You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/23 04:19:06 UTC

[GitHub] [airflow] dstandish opened a new pull request, #25250: Add dataset to dataset event response

dstandish opened a new pull request, #25250:
URL: https://github.com/apache/airflow/pull/25250

   Added `dataset` node on dataset event object in API response.
   
   Response now looks like this:
   
   ```
   {
     "dataset_events": [
       {
         "created_at": "2022-07-11T02:16:25.934787+00:00",
         "dataset": {
           "created_at": "2022-07-11T02:15:52.567828+00:00",
           "extra": null,
           "id": 25,
           "updated_at": "2022-07-11T02:15:52.567838+00:00",
           "uri": "s3://downstream_1_task/dataset_other.txt"
         },
         "dataset_id": 25,
         "extra": null,
         "id": 17,
         "source_dag_id": "dag3",
         "source_map_index": -1,
         "source_run_id": "dataset_triggered__2022-07-11T02:16:08.050068+00:00",
         "source_task_id": "downstream_1"
       },
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #25250: Add URI to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1194092914

   > For consistency should we call the field `dataset_uri`? (It's longer so I'm not a huge fan of it, but 🤷)
   
   thought about that actually.  probably a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] bbovenzi commented on pull request #25250: Add dataset to dataset event response

Posted by GitBox <gi...@apache.org>.
bbovenzi commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1194006553

   > @bbovenzi what's the context again? TI (i.e. downstream) or DAG (i.e. upstream)? Either way we know, or could know, the datasets that it maps to, and could provide a dictionary from ID to URI (e.g. in serialized repr). Could we use something like that to lookup the ID in the event response? (and therefore keep the API "normalized" _and_ avoid multiple queries?
   
   It's both types of events. But as I think about it more, let's just have the UI do the extra requests. Although, we need to get the correct datasets details without querying for all datasets. So can we pass an array of dataset_ids we want to get details for or specify "datasets upstream of this dag id" or "datasets downstream of this dag id - task id"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on pull request #25250: Add URI to dataset event response

Posted by GitBox <gi...@apache.org>.
ashb commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1194083324

   For consistency should we call the field `dataset_uri`? (It's longer so I'm not a huge fan of it, but :shrug:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #25250: Add dataset to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1193989243

   > > You are right @ephraimbuddy , was trying to add something that Brent needed but there must be a better way
   > 
   > A few options:
   > 
   > 1. The UI is just inefficient and has to do multiple API requests.
   > 2. We make a UI-only API endpoint with exactly what we need
   > 3. We refactor the id and display name for a dataset. Task id, DAG id and Run id are all readable
   
   Yeah i guess what I'm agreeing with is, maybe it doesn't make sense to include the full dataset object (as i've done here) in the event API.  cus thinking, if you're just after the events, i guess it's a bit inefficient.  but, we could just add just the URI, for example, instead of the full object.
   
   and, that's what i set out to do initially, but then i couldn't immediately figure out how to map the response field to `dataset.uri` (i.e. the attr on the attr -- there's a `datatset` attr on DatasetEvent through SQLA relationship).  i could made it happen by adding a `uri` property on DatasetEvent but i didn't really want to.... so i just added the full object.  if you know how to map `dataset.uri` to a field in response, lemme know.
   
   ---
   
   all of these options, i don't have a strong feeling either way really.  but i don't think there's anything wrong with having the endpoint be more friendly than the database schema.  the database schema is optimized for db access and storage.  but the rest API is not bound by those considerations necessarily.
   
   @bbovenzi what's the context again? TI (i.e. downstream) or DAG (i.e. upstream)?  Either way we know, or could know, the datasets that it maps to, and could provide a dictionary from ID to URI (e.g. in serialized repr).  Could we use something like that to lookup the ID in the event response?  (and therefore keep the API "normalized" _and_ avoid multiple queries?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] bbovenzi commented on pull request #25250: Add dataset to dataset event response

Posted by GitBox <gi...@apache.org>.
bbovenzi commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1193802421

   > You are right @ephraimbuddy , was trying to add something that Brent needed but there must be a better way
   
   A few options:
   
   1. The UI is just inefficient and has to do multiple API requests.
   2. We make a UI-only API endpoint with exactly what we need
   3. We refactor the id and display name for a dataset. Task id, DAG id and Run id are all readable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #25250: Add URI to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1194079246

   i added just URI to dataset event response at top level. i think it's a good solution.  it's reasonable to include the logical key along with the surr key in a response like this..... certainly more so than including the full object nested. and it saves complexity in UI so, let's do it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #25250: Add dataset to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1193331492

   > For me, I prefer that we just have the ID of the dataset in the events table instead of the whole dataset. However, if we want the dataset to be there, I think we should do it at the schema level and not extract it at the endpoint level. This way it's validated and guarded from future changes
   
   You are right @ephraimbuddy , was trying to add something that Brent needed but there must be a better way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish merged pull request #25250: Add URI to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish merged PR #25250:
URL: https://github.com/apache/airflow/pull/25250


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #25250: Add dataset to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1194004358

   update: figured out how to just add URI... will do that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #25250: Add URI to dataset event response

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #25250:
URL: https://github.com/apache/airflow/pull/25250#issuecomment-1194097345

   > > For consistency should we call the field `dataset_uri`? (It's longer so I'm not a huge fan of it, but 🤷)
   > 
   > thought about that actually. probably a good idea.
   
   actually yeah def should do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org