You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/02 09:27:08 UTC

[GitHub] [airflow] mobuchowski opened a new issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

mobuchowski opened a new issue #17984:
URL: https://github.com/apache/airflow/issues/17984


   ### Description
   
   Provide way for `LineageBackend` to be notified of task instance start and failed execution. 
   
   Proposed implementation:
   
   1) add start and fail methods to LineageBackend
   2) prepare_lineage decorator should call LineageBackend's start method the same way that apply_lineage calls send_lineage.
   3) call fail method on where on_failure_callback is called.
   
   ### Use case/motivation
   
   I'm working on [OpenLineage](https://openlineage.io/) integration for Airflow 2. [I've created MVP of this.](https://github.com/OpenLineage/OpenLineage/blob/airflow/2/integration/airflow/openlineage/airflow/backend.py)
   
   In contrast to Airflow 1.10's integration via subclassing DAG, `LineageBackend` provides easy way to integrate OpenLineage just via configuration.
   
   However, `LineageBackend.send_lineage` is called just on successful task execution, while we also want to track newly started and failed tasks.  
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] julienledem commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
julienledem commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-957011915


   It makes sense to me. The intension is to track task execution including the lineage (what the task reads and where it writes). We're happy to adjust to this. It depends a bit what you put in the context object


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] julienledem commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
julienledem commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926986837


   Just adding my two cents to what @mobuchowski is saying as one of the OpenLineage contributors.
   
   I'll start by describing how OpenLineage defines data lineage as this is where opinions might diverge. (my goal being to help make a decision on the discussion above)
   
   OpenLineage tracks lineage in real time as a task runs. 
   - An event is sent when the task starts
   - Another event is sent when the task completes with a status (success, failure, ...)
   
   This gives information in real-time that the data is being modified (or not). In terms of lineage it is important to track data transformations that should have happened as much as transformation that did happen.
   
   Tracking a start event let us know the difference between "it is in progress" and "it's not even started". And tracking a failed events tells us "it's not happening".
   
   It is also important to track failed tasks because, if it is a best practice for tasks to be atomic, a lot of tasks are not.
    - A failed task might have mutated data and it is important to know
    - Failed jobs still have read data
   
   This lineage is really at the task instance level which allows tracking how lineage changes over time and how code changes (for example) impact it.
   That's why the metadata related to that particular task instance is collected:
    - what was the schema of the inputs and outputs at the time it ran
    - what was the version of the code
    - etc
   
   I agree that this is a wider definition than just tracking the current dependencies between tables. It is more granular and it is a superset of what the current Lineage Backend does.
   
   Adding a new contract would work but I would think that over time it would subsume the current Lineage Backend as it would provide more functionality for a goal that is very similar and possibly that would be confusing.
   
   My opinion is that expanding that contract makes sense but we (OpenLineage) are happy to adapt to what the community thinks is best.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bolkedebruin commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
bolkedebruin commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-972796562


   `on_state_change` can be a `@hookspec` I guess for a plugin, maybe leaving some freedom to a developer to do 
   
   ```
   class MyOperator(BaseOperator):
     def  on_state_change(xxx):
       super()
       do_my_own_stuff()
   
   ```
   
   So that the hook still gets called?
   
   I think most of what you want is included in the standard context, but I am not entirely sure. So we would need to verifiy this.
     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-916862802


   Hey @potiuk, maybe you had time to look at this :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926490876


   My main question is this: does this actually belong in the Lineage backend, or is this just how OpenLineage works?
   
   Cos as you've described it that is going far beyond my impression of what a lineage system is all about.
   
   I'm not saying it's not useful -- I just question if this belongs in the lineage backend or if it should be elsewhere in Airflow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926539116


   (I'm willing to discuss this and be told I'm wrong, so nothing I say is a hard no yet.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-925854261


   Can you explain why a lineage backend might even care about a task failing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926503778


   That might be true, and I'm open for suggestions if you think there's better place. For 1.10, we subclassed DAG, and that approach is way more brittle.
   
   From my perspective, using LineageBackend is nice, because
   
   1. as LineageBackend developer, it gives me access to context, which contains all important metadata I'd want to extract 
   2. as LineageBackend developer, it gives me rather stable API, compared to messing with the internals
   3. for users, it's really easy to setup and configure - it does not require any change to their existing DAGs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski edited a comment on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski edited a comment on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909


   >So my understanding of lineage for a data workflow is two things alone:
   
   Just a general thought, but I think all the implementations of LineageBackend did something more than pure inputs and outputs tracking. How the dataset was produced is at least as important. That means consuming metadata.
   
   [Atlas (previously in Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py)
   [Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40)
   
   I don't think any of them are interested in failed runs though. 
   
   >In my opinion you are building more than just lineage tracking -- but something larger, as , so it's my opnion that these events do not belong in the lineage backend interface.
   
   Yes, we're interested in broad metadata around data. Ultimately, you could use OpenLineage events in data discovery tool, presenting schemas of your datasets. You could build alerting around data quality facets of those events.  
   
   >I'm leaning towards the ability to add/configure global task hook points for this sort of thing, rather than forcing something in to the lineage api that only OpenLineage wants.
   
   Sure. I'm willing to contribute solution that fits Airflow best. 
   
   >Where should this run? On the scheduler, or the runner?
   
   My best guess is scheduler. Ideally it would be independent of particular executor. My question is: could we get same kind of `context` there? For example, I'd want to look at `PostgresOperator` instance and get `sql` property.   
   
   >If there was a problem launching the runner (pod failure, celery farted) and the task is marked as failed early does that change A1?
   
    I'd want to get notification for this situation - we scheduled the job, but it did not run.
   How is this handled with retries? I think I'd not really need information about retry now. I'd want information about rerun though.
   
   Overall, I think key feature here would be to make it as simple for end users as possible to use. That means no changes in user DAGs. Ideally we'd use similar mechanism to load class as we have with LineageBackend.
   
   For inspiration, I'd look at what Spark does with [`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala). There is general api that you can implement to receive various events during spark job run. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926068388


   > and see that Airflow DAG updating it failed
   
   Without meaning to be to much of a purist - if the task failed it shouldn't have updated any result, so _shouldn't_ have the case you describe.
   
   :thinking:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-911549932


   Certainly :) . I will take a look over weekend - we are a bit busy with upcoming releases now and some "scrambling" around that but that souds super-interesting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pgzmnk commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
pgzmnk commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-924862638


   A few thoughts:
   - This is a great idea. It'd make OpenLineage an easier choice since it'd be easy to implement for multiple DAGs at once. It'd also be easier to maintain as Airflow evolves.
   - Proposition 2 sounds more aligned with the existing state. Let's ensure it's backward compatible. Projects such as [DataHub subclass LineageBackend](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub_provider/lineage/datahub.py#L38).
   - Otherwise, for proposition 3 it might make sense to decorate `on_failure_callback` (and maybe `on_retry_callback` ?) to keep the method available. 
   
   On a further downstream note: What differences do you foresee for the lineage 'payload' for successful/started/failed tasks in terms of the OpenLineage schema? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-916871219


   Not yet. Life took over. But I will


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-925847867


   @pgzmnk @potiuk created PR for this issue: https://github.com/apache/airflow/pull/18470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-966505971


   I think that having something global rather than per operator makes more sense in our case.
   Conceptually, it's easier to inject - Airflow user specifies in config that they want to use `OpenLineagePlugin`.
   The plugin registers what events it wants to listen for.
   Airflow notifies plugin when one of those events fire.
   
   Looks like the `pluggy` approach works like that - just with different implementation, calls instead of events.
   
   The `on_state_change` approach might work better when you want to specify different behavior for different operators - or just as a hook interface for pluggy?
   
   >What do you want in context?
   
   Besides particular task and task instance, we use `DagRun` for stuff like execution date and id, and `Dag` for getting id and description. Generally the more metadata we can have about it, the better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bolkedebruin edited a comment on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
bolkedebruin edited a comment on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-956272791


   I thin it is okay to streach the defintion of lineage like this. However, then I would not call it data lineage anymore. It is more like a generic auditing system then as it not only tracks data but starts tracking tasks as well. Is that the intention? We can expose this kind of information of course. `post_execute` seems ambigious in its defintion as it is unclear whether is is ran on success only. I prefer airflow to have a `on_state_change(old, new, context))` function which we can decorate with a plugin kind of way to handle this so you can tie in the LineageBackend to register to these state changes.
   
   Does that make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-911449974


   @potiuk @pgzmnk you might be interested :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926061181


   @ashb Sure. I want to implement OpenLineage integration for Airflow 2.1+ using `LineageBackend`. Actually, I've [done MVP of it here](https://github.com/OpenLineage/OpenLineage/blob/airflow/2/integration/airflow/openlineage/airflow/backend.py) :wink: 
   
   As for failed runs, let's imagine we detected data quality issues in some dataset, which caused our ML models to misbehave. We could look at multiple places to understand what happened, or look at lineage graph for our dataset, and see that Airflow updating it failed.
   
   This is not something that is just conceptual, it exists already for Airflow 1.10, Spark, dbt - you can take a look at https://demo.datakin.com or Marquez's Airflow example https://github.com/MarquezProject/marquez/tree/main/examples/airflow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski edited a comment on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski edited a comment on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926061181


   @ashb Sure. I want to implement OpenLineage integration for Airflow 2.1+ using `LineageBackend`. Actually, I've [done MVP of it here](https://github.com/OpenLineage/OpenLineage/blob/airflow/2/integration/airflow/openlineage/airflow/backend.py) :wink: 
   
   As for failed runs, let's imagine we detected data quality issues in some dataset, which caused our ML models to misbehave. We could look at multiple places to understand what happened, or look at lineage graph for our dataset, and see that Airflow DAG updating it failed.
   
   This is not something that is just conceptual, it exists already for Airflow 1.10, Spark, dbt - you can take a look at https://demo.datakin.com or Marquez's Airflow example https://github.com/MarquezProject/marquez/tree/main/examples/airflow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bolkedebruin commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
bolkedebruin commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-956272791


   I thin it is okay to streach the defintion of lineage like this. However, then I would not call it data lineage anymore. It is more like a generic auditing system then as it not only tracks data but starts tracking tasks as well. We can expose this kind of information of course. `post_execute` seems ambigious in its defintion as it is unclear whether is is ran on success only. I prefer airflow to have a `on_state_change(old, new, context))` function which we can decorate with a plugin kind of way to handle this so you can tie in the LineageBackend to register to these state changes.
   
   Does that make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bolkedebruin commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
bolkedebruin commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-966276136


   I see `on_state_change` as part of the operator so `BashOperator.on_state_change` that inherits everthing the tasks knows about itself including the standard context (ontext contains references to related objects to the task instance and is documented under the macros section of the API). What do you want in context? Maybe we should pass more in context for security reasons to limit the 'surface of exposure of the task'.
   
   Designing this a bit more I think this would be nice to have in Airflow:
   
   ```
   # returns new state
   def state_handler(old: State, new: State, ctx -> Context) -> State
   
   def on_state_change(old: State, new: State, ctx -> Context) -> None
     # python 3.10
     # call legacy functions
     match State:
       case State.FAILED:
         self.on_failure_callback(ctx)
       case State.ON_RETRY:
         self.on_retry_callback(ctx)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski edited a comment on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski edited a comment on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909


   >So my understanding of lineage for a data workflow is two things alone:
   
   Just a general thought, but I think all the implementations of LineageBackend did something more than pure inputs and outputs tracking. How the dataset was produced is at least as important. That means consuming metadata.
   
   [Atlas (previously in Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py)
   [Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40)
   
   I don't think any of them are interested in failed runs though. 
   
   >In my opinion you are building more than just lineage tracking -- but something larger, as , so it's my opnion that these events do not belong in the lineage backend interface.
   
   Yes, we're interested in broad metadata around data. Ultimately, you could use OpenLineage events in data discovery tool, presenting schemas of your datasets. You could build alerting around data quality facets of those events.  
   
   >I'm leaning towards the ability to add/configure global task hook points for this sort of thing, rather than forcing something in to the lineage api that only OpenLineage wants.
   
   Sure. I'm willing to contribute solution that fits Airflow best. 
   
   >Where should this run? On the scheduler, or the runner?
   
   My best guess is scheduler. Ideally it would be independent of particular executor. My question is: could we get same kind of `context` there? For example, I'd want to look at `PostgresOperator` instance and get `sql` property.   
   
   >If there was a problem launching the runner (pod failure, celery farted) and the task is marked as failed early does that change A1?
   
    I'd want to get notification for situation 2) - we scheduled the job, but it did not run.
   How is this handled with retries? I think I'd not really need information about retry now. I'd want information about rerun though.
   
   Overall, I think key feature here would be to make it as simple for end users as possible to use. That means no changes in user DAGs. Ideally we'd use similar mechanism to load class as we have with LineageBackend.
   
   For inspiration, I'd look at what Spark does with [`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala). There is general api that you can implement to receive various events during spark job run. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926536817


   So my understanding of lineage for a data workflow is two things alone:
   
   - What inputs did I consume
   - What outputs did I produce.
   
   In my opinion you are building more than just lineage tracking -- but something larger, as , so it's my opnion that these events do not belong in the lineage backend interface.
   
   So lets talk about an alternative interface to get you the stable API you want.
   
   Some questions:
   
   1. Where should this run? On the scheduler, or the runner?
   2. If there was a problem launching the runner (pod failure, celery farted) and the task is marked as failed early does that change A1?
   
   I'm leaning towards the ability to add/configure global task hook points for this sort of thing, rather than forcing something in to the lineage api  that only OpenLineage wants.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926081284


   Yes, but only assuming Airflow is the only system here, and that something else did not trigger running the model.
   For example, the ETL team might use Airflow for their tasks, and team training the models uses something else (I'm not sure what's popular there, Kubeflow?)
   That actually is one of points of OpenLineage - it brings it all together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] julienledem commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
julienledem commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-957011915


   It makes sense to me. The intension is to track task execution including the lineage (what the task reads and where it writes). We're happy to adjust to this. It depends a bit what you put in the context object


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] julienledem commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
julienledem commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-957011915


   It makes sense to me. The intension is to track task execution including the lineage (what the task reads and where it writes). We're happy to adjust to this. It depends a bit what you put in the context object


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-966336050


   This might be a good time to bring in a formal plugin system via https://pypi.org/project/pluggy/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski edited a comment on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski edited a comment on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-966505971


   I think that having something "global" rather than per operator makes more sense in our case.
   Conceptually, it's easier to inject - Airflow user specifies in config that they want to use `OpenLineagePlugin`.
   The plugin registers what events it wants to listen for.
   Airflow notifies plugin when one of those events fire.
   
   Looks like the `pluggy` approach works like that - just with different implementation, calls instead of events.
   
   The `on_state_change` approach might work better when you want to specify different behavior for different operators - or just as a hook interface for pluggy?
   
   >What do you want in context?
   
   Besides particular task and task instance, we use `DagRun` for stuff like execution date and id, and `Dag` for getting id and description. Generally the more metadata we can have about it, the better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-911447743


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pgzmnk commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
pgzmnk commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-924870130


   @bolkedebruin what factors should be considered to decide the best implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909


   >So my understanding of lineage for a data workflow is two things alone:
   
   Just a general thought, but I think all the implementations of LineageBackend did something more than pure inputs and outputs tracking. How the dataset was produced is at least as important. That means consuming metadata.
   
   [Atlas (previously in Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py)
   [Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40)
   
   I don't think any of them are interested in failed runs though. 
   
   >In my opinion you are building more than just lineage tracking -- but something larger, as , so it's my opnion that these events do not belong in the lineage backend interface.
   
   Yes, we're interested in broad metadata around data. Ultimately, you could use OpenLineage events in data discovery tool, presenting schemas of your datasets. You could build alerting around data quality facets of those events.  
   
   >I'm leaning towards the ability to add/configure global task hook points for this sort of thing, rather than forcing something in to the lineage api that only OpenLineage wants.
   
   Sure. I'm willing to contribute solution that fits Airflow best. 
   
   >Where should this run? On the scheduler, or the runner?
   
   My best guess is scheduler. Ideally it would be independent of particular executor. My question is: could we get same kind of `context` there? For example, I'd want to look at `PostgresOperator` instance and get `sql` property.   
   
    I'd want to get notification for situation 2) - we scheduled the job, but it did not run.
   How is this handled with retries? I think I'd not really need information about retry now. I'd want information about rerun though.
   
   Overall, I think key feature here would be to make it as simple for end users as possible to use. That means no changes in user DAGs. Ideally we'd use similar mechanism to load class as we have with LineageBackend.
   
   For inspiration, I'd look at what Spark does with [`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala). There is general api that you can implement to receive various events during spark job run. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-956224557


   Sorry, I was on vactaion. Let me collect my thoughts again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski edited a comment on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski edited a comment on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909


   >So my understanding of lineage for a data workflow is two things alone:
   
   Just a general thought, but I think all the implementations of LineageBackend did something more than pure inputs and outputs tracking. How the dataset was produced is at least as important. That means consuming metadata.
   
   [Atlas (previously in Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py)
   [Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40)
   
   I don't think any of them are interested in failed runs though. 
   
   >In my opinion you are building more than just lineage tracking -- but something larger, as , so it's my opnion that these events do not belong in the lineage backend interface.
   
   Yes, we're interested in broad metadata around data. Ultimately, you could use OpenLineage events in data discovery tool, presenting schemas of your datasets. You could build alerting around data quality facets of those events.  
   
   >I'm leaning towards the ability to add/configure global task hook points for this sort of thing, rather than forcing something in to the lineage api that only OpenLineage wants.
   
   Sure. I'm willing to contribute solution that fits Airflow best. 
   
   >Where should this run? On the scheduler, or the runner?
   
   My best guess is scheduler. Ideally it would be independent of particular executor. My question is: could we get same kind of `context` there? For example, I'd want to look at `PostgresOperator` instance and get `sql` property.   
   
   >If there was a problem launching the runner (pod failure, celery farted) and the task is marked as failed early does that change A1?
   
    I'd want to get notification for this situation - we scheduled the job, but it did not run.
   How is this handled with retries? I think I'd not really need information about retry now. I'd want information about rerun though.
   
   Overall, I think key feature here would be to make it as simple for end users as possible to use. That means no changes in user DAGs. Ideally we'd use similar mechanism to load class as we have with LineageBackend.
   
   For inspiration, I'd look at what Spark does with [`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L296). There is general api that you can implement to receive various events during spark job run. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] joshi95 commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
joshi95 commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-922668169


   @potiuk Did you get a chance to look into it ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Posted by GitBox <gi...@apache.org>.
mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-933450001


   Hey @ashb, any thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org