You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/01 23:36:56 UTC

[GitHub] [airflow] mjpieters commented on issue #12771: Instrument Airflow with Opentracing/Opentelemetry

mjpieters commented on issue #12771:
URL: https://github.com/apache/airflow/issues/12771#issuecomment-753404477


   Interesting little test; but note it generates a trace from the tasks after the fact without support for additional spans created __inside__ each task invocation, and you the tracing context is not shared overy any RPC / REST / etc. calls to other services to inherit the tracing context.
   
   I integrated Jaeger tracing (opentracing) support into a production Airflow setup using Celery. Specific challenges I had to overcome:
   
   - The dagrun span is 'virtual', in that it exists from when the dagrun is created until the last task is completed. The scheduler will then update the dagrun state in the database. But tasks need a valid parent span to attach their own spans to.
   
     I solved this by creating a span in a monkey-patch to `DAG.create_dagrun()`, injecting the span info in to the dagrun configuration together with the start time, then _discarding the span_.  Then, in `DagRun.set_state()`, when the state changes to a finished state, I create a `jaeger_client.span.Span()` object from scratch using the dagrun conf-stored data, and submit that to Jaeger.
   
   - Tasks inherit the parent (dagrun) span context from the dagrun config; I patched `Task.run_raw_task()` to run the actual code under a `tracer.start_active_span()` context manager. This captures timing and any exceptions.
   
   - You need an active tracer for traces to be captured and sent on to the tracer agent. So I registered code to run in the `cli_action_loggers.register_pre_exec_callback()` hook when the `scheduler` or `dag_trigger` sub-commands run, which then registers a closer with `cli_action_loggers.register_post_exec_callback`. Closing a tracer in `dag_trigger` takes careful work with the asyncio / tornado loop used by the Jaeger client, you'll lose traces if you don't watch out. I found that you had to go hunt for the I/O loop attached to the trace reporter object and call `tracer.close()` from a callback sent to that loop as the only fail-proof method of getting the last traces out. I don't know if opentracing needs this level of awareness of the implementation details.
   
   But, with that work in place, we now get traces in mostly real time, with full support for tracing contexts being shared with other services called from Airflow tasks. We can trace a job through the frontend, submitting a job to Airflow, then follow any calls from tasks to further REST APIs, all as one system.
   
   I'd prefer it if the tracing context was not shoehorned into the dagrun configuration; I'd have created additional database tables or columns for this in the Airflow database model if I had to do this inside the Airflow project itself.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org