You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/23 14:16:46 UTC

[GitHub] [airflow] collinmcnulty opened a new issue, #25254: Add Task Instance Lifecycle History table

collinmcnulty opened a new issue, #25254:
URL: https://github.com/apache/airflow/issues/25254

   ### Description
   
   Any time an Airflow component changes the state of a task instance, it should record that change in an audit-log-like table of changes. Thus the user will be able to easily see what happened to their tasks.
   
   | dag_id      | task_id            | run_id            | map_index | state  | time_changed        | component_type | component_id |
   |-------------|--------------------|-------------------|-----------|--------|---------------------|----------------|--------------|
   | example_dag | config_file_sensor | scheduled_2022... | -1        | queued | 2022-07-25T12:01:01 | scheduler      | <uuid>       |
   | example_dag | config_file_sensor | scheduled_2022... | -1        | running | 2022-07-25T12:24:01 | worker      | <uuid>       |
   
   Since task_instance is already one of the biggest tables, this table definitely has the potential to be very big. I think it should probably be off by default with a config flag for turning it on. It seems like it should probably only be used in conjunction with regular runs of `airflow db clean`.
   
   ### Use case/motivation
   
   Tracing the lifecycle of a task instance across Airflow component logs is quite tedious and involves effectively building the described table in your head or on a notepad. Many times when I'm trying to understand what happened to a task, such investigation is necessary. It would also help answer questions like "which task instances were in [state] at this particular time in the past".
   
   ### Related issues
   
   #25252 
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #25254: Add Task Instance Lifecycle History table

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1193143649

   I am not sure if keeping it in airflow MetaData DB makes sense, This will put ENORMOUS pressure on the database. We are not going to use it in other parts of the MetaData DB. for anything else - just to "dump" the information. 
   
   Since we are going to make most airflow components DB-less (see https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API) that will also put additional pressure on those components to have more communication overhead to write such database enttry.
   
   I think (but I will not close that one yet) this one, similarly to #25252  has much better potential when implemented as part of our OpenTelemetry effort (which has already been approved and voted on https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow) - we are talking about gathering much more of information from Airflow via standard telemetry interfaces, so storing them in the MetaDataDB as opposed to keep them in external systems that are supposed to manage system telemetry and be able to track various kind of telemetry (including traces which are the best matching part of the Open-Telemetry proposa) are much better choice IMHO.
   
   I'd say any "database" entries here that we need should rather keep track fo changes of the DAG structure (which should be part of another AIP https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning  and this is the type of information that should be stored in Airflow Metadata. This is because such versioning can be used by Airlfow itself to make decision (back-filling). 
   
   Following this analogy, I personally think such table of state change would only make sense if we are going to use it for something else. For example IF such a table (or similar) would be a side-effect of implementing SLA feature "properly" then yeah - we could consider that as part of Airlfow Metadata. But if the only reason is to "track the history of changes by human", then we simply try to implement into Airlfow what Telemetry systems are doing way better than any of our implementations can be and we should rather focus on making sure our OpenTelemetry integration allows for it rather than trying to replicate it in-airflow.
   
   This is what I think, but I am curious what others think about it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] julienledem commented on issue #25254: Add Task Instance Lifecycle History table

Posted by GitBox <gi...@apache.org>.
julienledem commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1194826508

   Another place where it would be useful to surface this information is through the OpenLineage integration. Currently, it sees only state transitions that occurs on the Worker through the TaskInstanceListener. It would be useful to collect more of those through additional listeners in Airflow. The OL integration could send more events besides start and end and capture the additional info as metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] collinmcnulty closed issue #25254: Add Task Instance Lifecycle History table

Posted by "collinmcnulty (via GitHub)" <gi...@apache.org>.
collinmcnulty closed issue #25254: Add Task Instance Lifecycle History table
URL: https://github.com/apache/airflow/issues/25254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #25254: Add Task Instance Lifecycle History table

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1196876199

   > Another place where it would be useful to surface this information is through the OpenLineage integration. Currently, it sees only state transitions that occurs on the Worker through the TaskInstanceListener. It would be useful to collect more of those through additional listeners in Airflow. The OL integration could send more events besides start and end and capture the additional info as metadata.
   
   On that front (might be a good place to discuss)- what do you think @julienledem and @howardyoo  about the relation betweeen OTEL traces and OpenLineage ones ?
   
   I see those two as pretty orthogonal. OTEL is task/DAG based, and OpenLineage (and lineage in general) are dataset-based. As I see, those are two rather separate and differnet dimentions you can look at when it comes to Airlfow DAGs. They have some things in common, but for anything else than basic HelloWorld, these two will be rather different and will have some common points but the topology of those two will be quite dramatically different. I see the OTEL trace more like "Technical" - more DevOPS thing (where you look at airflwo UI and try to figure if the "system" works as expected, where OpenLineage looks at "data" provenience and lineage (i.e. when you try to see if your data is right). I've recently heard from a few places that the "Observability" term is quite overloaded and it really should be "Data Observability" vs. "Software Observability". 
   Those two overlap of course (and problems in one might even impact the other but they are essentially two rather different dimensions). If my view is correct, Task Instance State is in fact much more of the "Software Observability" than "Data Observability" and as such belongs more to OTEL than Lineage. 
   
   But I am curious what's your take on it :)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on issue #25254: Add Task Instance Lifecycle History table

Posted by GitBox <gi...@apache.org>.
dstandish commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1194387317

   other options to consider... 
   
   you could set up CDC on your database for this table
   
   we could add taskinstance state changes to the existing `log` table.  but yeah, this would produce a _lot_ of events.  would could imagine making it optional, to enable verbose state tracking for debugging purposes.  but i do wonder what impact this would have on the codebase, whether it would be overly burdensome, and introduce complexity.  and would it be trustworthy, e.g. if we miss a logging there....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] howardyoo commented on issue #25254: Add Task Instance Lifecycle History table

Posted by GitBox <gi...@apache.org>.
howardyoo commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1194606372

   Traces should be something worth to look into in addressing these sort of data (using logs to track changes? that would be possible, but not ideal), and as @potiuk mentioned, Specifications such as OTEL was made to keep track of these information. Rather than having these information stored inside the airflow's database, these so called `events` can be published out from the airflow into sophisticated third party monitoring tools that can better monitor these sort of information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #25254: Add Task Instance Lifecycle History table

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1194565953

   > also... could consider just more regimented debug logging through normal logger, so that we could conceivably derive the state changes after the fact?
   
   Yep. This is precisely what OTel integration and especially Traces should provide. And they could provide it on much better lavel. For example - you will be able to trace way more than jusst task instance state changes - but also all the "deferred" behaviours in Triggerer. I think asking triggerer for example to save all the possible state changes to the meta-data DB is quite counter-productive, when you look at the usage patterns of Triggerrer - everything is asynchronous there except deferring/resuming a task. But with OTEL I think we could trace more events related to a particular task instance (and asynchronously, much more efficiently send it to remote telemetry service). Some Triggers might have multiple states between "defer" and "resume"  - @howardyoo I think your comments might be cool too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #25254: Add Task Instance Lifecycle History table

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #25254:
URL: https://github.com/apache/airflow/issues/25254#issuecomment-1419249575

   > I am not sure if keeping it in airflow MetaData DB makes sense, This will put ENORMOUS pressure on the database.
   
   I agree. Airflow is application. Application don't store data that is not needed for their defined functionality. That information is usually streamed out from Application to BI services.
   
   > you could set up CDC on your database for this table
   
   Totally agree! user can set [debezium](https://debezium.io/) for that.
   
   
   I tend to close this request as won't fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org