You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/01 23:45:57 UTC

[GitHub] [airflow] mjpieters edited a comment on issue #12771: Instrument Airflow with Opentracing/Opentelemetry

mjpieters edited a comment on issue #12771:
URL: https://github.com/apache/airflow/issues/12771#issuecomment-753404477


   Interesting little test; but note it generates a trace from the tasks after the fact without support for additional spans created __inside__ each task invocation, and you the tracing context is not shared overy any RPC / REST / etc. calls to other services to inherit the tracing context.
   
   I integrated Jaeger tracing (opentracing) support into a production Airflow setup using Celery. Specific challenges I had to overcome:
   
   - The dagrun span is 'virtual', in that it exists from when the dagrun is created until the last task is completed. The scheduler will then update the dagrun state in the database. But tasks need a valid parent span to attach their own spans to.
   
     I solved this by creating a span in a monkey-patch to `DAG.create_dagrun()`, injecting the span info in to the dagrun configuration together with the start time, then _discarding the span_.  Then, in `DagRun.set_state()`, when the state changes to a finished state, I create a `jaeger_client.span.Span()` object from scratch using the dagrun conf-stored data, and submit that to Jaeger.
   
   - Tasks inherit the parent (dagrun) span context from the dagrun config; I patched `Task.run_raw_task()` to run the actual code under a `tracer.start_active_span()` context manager. This captures timing and any exceptions.
   
   - You need an active tracer for traces to be captured and sent on to the tracer agent. So I registered code to run in the `cli_action_loggers.register_pre_exec_callback()` hook when the `scheduler` or `dag_trigger` sub-commands run, which then registers a closer with `cli_action_loggers.register_post_exec_callback`. Closing a tracer in `dag_trigger` takes careful work with the asyncio / tornado loop used by the Jaeger client, you'll lose traces if you don't watch out. I found that you had to go hunt for the I/O loop attached to the trace reporter object and call `tracer.close()` from a callback sent to that loop as the only fail-proof method of getting the last traces out. I don't know if opentracing needs this level of awareness of the implementation details.
   
   - You generally want to start a trace when the Airflow webserver receives a trigger, so instrument the Flask layer too. This is where the sampling decision needs to be taken too; tracing often only instruments a subset of all jobs, but in this project I set the sampling frequency to 100% at this level.
   
   - I added a custom configuration section to airflow.cfg to allow me to tweak Jaeger tracing parameters.
   
   But, with that work in place, we now get traces in mostly real time, with full support for tracing contexts being shared with other services called from Airflow tasks. We can trace a job through the frontend, submitting a job to Airflow, then follow any calls from tasks to further REST APIs, all as one system.
   
   I'd prefer it if the tracing context was not shoehorned into the dagrun configuration; I'd have created additional database tables or columns for this in the Airflow database model if I had to do this inside the Airflow project itself.
   
   Note that I did *not* use the Celery task hooks here to track the task spans, because Celery has its own overhead that we wanted to keep separate. The `opentracing_instrumentation` already has Celery hooks you can use, but  I needed the timings to be closer to the actual task invocation.
   
   Anothing thing to consider is per-DAG or per-task tagging you want to add to the spans. For this project I needed to track specific data from the submitted DAG config so we can compare task runs using the same input configuration.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org