You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by David Li <li...@apache.org> on 2021/07/12 13:47:27 UTC

Re: [C++] Adopting a library for (distributed) tracing

A quick update on this, I don't think this will happen for 5.0; the upstream library still hasn't reached 1.0, and I don't want to cram this in at the end of a cycle, especially as each of their release candidates has needed an upstream fix in order to keep all our CI platforms working. Furthermore there appear to be some issues with their exporters (or in our usage of them) that I'd like to resolve. Finally, I'd like to have a more complete example here, especially one using an existing tool like Jaeger instead of an ad-hoc visualization.

-David

On Wed, Jun 9, 2021, at 13:01, David Li wrote:
> I just updated the PR with support for exporting to Jaeger[1], which
> has a built in trace viewer.
> 
> 1. Download and run the all-in-one Jaeger binary locally[2] (or their
>    Docker image)
> 2. Build Arrow with `-DARROW_WITH_OPENTELEMETRY=ON -DARROW_THRIFT=ON`
> 3. Run your application with `env ARROW_TRACING_BACKEND=jaeger`
> 4. Visit http://localhost:16686 and search for "unknown_service".
> 
> This gives you a variety of ways to drill into the captured data. Let
> me know what you think if you get a chance.
> 
> Now, while this is convenient, I'm not so sure about bundling it with
> Arrow; as a library, we should be leaving all this config up to the
> end-user application. But since this is all in C++, it can be
> hard/annoying to configure in PyArrow and this is helpful for
> development and debugging. At the very least, it's behind an optional
> build flag so it won't ship by default.
> 
> Also I see that Kibana (with proprietary xpack) and Grafana have trace
> viewers now; OpenTelemetry doens't include exporters for trace data to
> those backends (only metrics/logs) but that could be another option.
> 
> Best,
> David
> 
> [1]: https://www.jaegertracing.io/
> [2]: https://www.jaegertracing.io/docs/1.22/getting-started/#all-in-one
> 
> On 2021/06/08 19:30:06, David Li <li...@apache.org> wrote: 
> > I'll have to do some more digging into that and get back to you. So
> > far I've been using a quick-and-dirty tool that I whipped up using
> > Vega-Lite but that's probably not something we want to maintain. I
> > tried the Chrome trace viewer ("Catapult") but it's not quite built
> > for this kind of trace; I hear Jaeger's trace viewer can be used
> > standalone but needs some setup.
> > 
> > Though that does raise a good point: we should eventually have
> > documentation on this knob and how to use it.
> > 
> > -David
> > 
> > On 2021/06/08 19:21:16, Weston Pace <we...@gmail.com> wrote: 
> > > FWIW, I tried this out yesterday since I was profiling the execution
> > > of the async API reader.  It worked great so +1 from me on that basis.
> > > I did struggle finding a good simple visualization tool.  Do you have
> > > any good recommendations on that front?
> > > 
> > > On Mon, Jun 7, 2021 at 10:50 AM David Li <li...@apache.org> wrote:
> > > >
> > > > Just to give an update on where this stands:
> > > >
> > > > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> > > > use it. This contains a few fixes I submitted for the platforms our
> > > > various CI jobs use, as well as an explicit build flag to support
> > > > header-only use - I think this should alleviate any concerns over it
> > > > adding to our build too much. I'm hopeful this means it can make it
> > > > into 5.0.0, at least with minimal functionality.
> > > >
> > > > For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> > > > have a chance to look through the PR and see if there's any places
> > > > where adding tracing may be useful.
> > > >
> > > > I also touched base with upstream about Python/C++ interop[2] - it
> > > > turns out upstream has thought about this before but doesn't have the
> > > > resources to pursue it at the moment, as the idea is to write an
> > > > API-compatible binding of the C++ library for Python (and presumably
> > > > R, Ruby, etc.) which is more work.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > [1]: https://github.com/apache/arrow/pull/10260
> > > > [2]: https://github.com/open-telemetry/community/discussions/734
> > > >
> > > > On 2021/05/06 18:23:05, David Li <li...@apache.org> wrote:
> > > > > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > > > > [2]; I'd appreciate any feedback, particularly from anyone already
> > > > > trying to use OpenTelemetry/Tracing/Census with Arrow.
> > > > >
> > > > > For dependencies: now we use OpenTelemetry as header-only by
> > > > > default. I also slimmed down the build, avoiding making the build wait
> > > > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > > > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > > > > can be toggled via environment variable.
> > > > >
> > > > > For Python: the PR includes basic integration with Flight/Python. The
> > > > > C++ side will start a span, then propagate it to Python. Spans in
> > > > > Python will not propagate back to C++, and Python/C++ need to both set
> > > > > up their respective exporters. I plan to poke the upstream community
> > > > > about if there's a good solution to this kind of issue.
> > > > >
> > > > > For ABI compatibility: this will be an issue until upstream reaches
> > > > > 1.0. Even currently, there's an unreleased change on their main branch
> > > > > which will break the current PR when it's released. Hopefully, they
> > > > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > > > > to avoid shipping this until there is a 1.0. I have confirmed that
> > > > > linking an application which itself links OpenTelemetry to Arrow
> > > > > works.
> > > > >
> > > > > As for the overhead: I measured the impact on a dataset scan recording
> > > > > ~900 spans per iteration and there was no discernible effect on
> > > > > runtime compared to an uninstrumented scan (though again, this is not
> > > > > that many spans).
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > > > > [2]: https://github.com/apache/arrow/pull/10260
> > > > >
> > > > > On 2021/05/01 19:53:45, "David Li" <li...@apache.org> wrote:
> > > > > > Thanks everyone for all the comments. Responding to a few things:
> > > > > >
> > > > > > > It seems to me it would be fairly implementation dependent -- so each
> > > > > > > language implementation would choose if it made sense for them and then
> > > > > > > implement the appropriate connection to that language's open telemetry
> > > > > > > ecosystem.
> > > > > >
> > > > > > Agreed - I think the important thing is to agree on using OpenTelemetry itself so that the various Flight implementations, for instance, can all contribute compatible trace data. And there will be details like naming of keys for extra metadata we might want to attach, or trying to make (some) span names consistent.
> > > > > >
> > > > > > > My main question is: does integrating OpenTracing complicate our build
> > > > > > > procedure?  Is it header-only as long as you use the no-op tracer?  Or
> > > > > > > do you have to build it and link with it nonetheless?
> > > > > >
> > > > > > I need to look into this more and will follow up. I believe we can use it header-only. It's fairly simple to depend on (and has no required dependencies), but it is a synchronous build step (you must build it to have its headers available) - perhaps that could be resolved upstream or I am configuring CMake wrongly. Right now, I've linked in OpenTelemetry to provide a few utilities (e.g. logging data to stdout as JSON), but that could be split out into a libarrow_tracing.so if we keep them.
> > > > > >
> > > > > > > Also, are there ABI issues that may complicate integration into
> > > > > > > applications that were compiled against another version of OpenTracing?
> > > > > >
> > > > > > Upstream already seems to be considering ABI compatibility. However, until they reach 1.0, of course they need not keep any promises, and that is a worry depending on their timeline. As pointed out already, they are moving quickly, but they are behind the other languages' OpenTelemetry implementations.
> > > > > >
> > > > > > > I'm not sure what the overhead is when disabled--I think it is probably minimal or else it wouldn't be used so widely. But if we're not ready to jump right in, we could introduce our own @WithSpan annotation which by default is a no-op. To build an instrumented Arrow lib, you'd hook it up with a shim.
> > > > > >
> > > > > > I am focusing on C++ here but of course the other languages come into play. A similar idea for C++ may be useful if we need to have OpenTelemetry be optional to avoid ABI worries. A branch may also work, but I'd like to avoid that if possible.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote:
> > > > > > > I agree that OpenTelemetry is the future; I have been following the observability space off and on and I knew about OpenTracing; I just realized that OpenTelemetry is its successor. [1]
> > > > > > > I have found tracing to be a very powerful approach; at one point, I did a POC of a trace recorder inside a Java webapp, which shed light on some nasty bottlenecks. If integrated properly, it can be left on all the time, so it's valuable for doing root-cause analysis in production. At least in Java, there are already a lot of packages with OpenTelemetry hooks built in. [2]
> > > > > > > I'm not sure what the overhead is when disabled--I think it is probably minimal or else it wouldn't be used so widely. But if we're not ready to jump right in, we could introduce our own @WithSpan annotation which by default is a no-op. To build an instrumented Arrow lib, you'd hook it up with a shim. Or you could just maintain a branch with instrumentation for people to try it out.
> > > > > > >
> > > > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/
> > > > > > > [2] https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md
> > > > > > >
> > > > > > > On 2021/04/30 22:18:46, Evan Chan <evan@urbanlogiq.com <mailto:evan%40urbanlogiq.com>> wrote:
> > > > > > > > Dear David,
> > > > > > > >
> > > > > > > > OpenTelemetry tracing is definitely the future, I guess the question is how far down the stack we want to put it.   I think it would be useful for flight and other higher level modules, and for DataFusion for example it would be really useful.
> > > > > > > > As for being alpha, I don’t think it will stay that way very long, there is a ton of industry momentum behind OpenTelemetry.
> > > > > > > >
> > > > > > > > -Evan
> > > > > > > >
> > > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidavidm@apache.org <mailto:lidavidm%40apache.org>> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > For Arrow Datasets, I've been working to instrument the scanner to find
> > > > > > > > > bottlenecks. For example, here's a demo comparing the current async
> > > > > > > > > scanner, which doesn't truly read asynchronously, to one that does; it
> > > > > > > > > should be fairly evident where the bottleneck is:
> > > > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > > > > > > > >
> > > > > > > > > I'd like to upstream this, but I'd like to run some questions by
> > > > > > > > > everyone first:
> > > > > > > > > - Does this look useful to developers working on other sub-projects?
> > > > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we
> > > > > > > > >  comfortable with adopting it? Is the overhead acceptable?
> > > > > > > > > - Is there anyone using Arrow to build services, that would find more
> > > > > > > > >  general integration useful?
> > > > > > > > >
> > > > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a "span"
> > > > > > > > > for operations like reading a single record batch. The data is saved as
> > > > > > > > > JSON, then rendered by some JavaScript. The branch is at [2].
> > > > > > > > >
> > > > > > > > > As a quick summary, OpenTelemetry implements distributed tracing, in
> > > > > > > > > which a request is tracked as a directed acyclic graph of spans. A span
> > > > > > > > > is just metadata (name, ID, start/end time, parent span, ...) about an
> > > > > > > > > operation (function call, network request, ...). Typically, it's used in
> > > > > > > > > services. Spans can reference each other across machines, so you can
> > > > > > > > > track a request across multiple services (e.g. finding which service
> > > > > > > > > failed/is unusually slow in a chain of services that call each other).
> > > > > > > > >
> > > > > > > > > As opposed to a (sampling) profiler, this gives you application-level
> > > > > > > > > metadata, like filenames or S3 download rates, that you can use in
> > > > > > > > > analysis (as in the demo). It's also something you'd always keep turned
> > > > > > > > > on (at least when running a service). If integrated with Flight,
> > > > > > > > > OpenTelemetry would also give us a performance picture across multiple
> > > > > > > > > machines - speculatively, something like making a request to a Flight
> > > > > > > > > service and being able to trace all the requests it makes to S3.
> > > > > > > > >
> > > > > > > > > It does have some overhead; you wouldn't annotate every function in a
> > > > > > > > > codebase. This is rather anecdotal, but for the demo above, there was
> > > > > > > > > essentially zero impact on runtime. Of course, that demo records very
> > > > > > > > > little data overall, so it's not very representative.
> > > > > > > > >
> > > > > > > > > Alternatives:
> > > > > > > > > - Add a simple Span class of our own, and defer Flight until later.
> > > > > > > > > - Integrate OpenTelemetry in such a way that it gets compiled out if not
> > > > > > > > >  enabled at build time. This would be messier but should alleviate any
> > > > > > > > >  performance questions.
> > > > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own
> > > > > > > > >  caveats (e.g. XRay is LLVM-specific) and aren't intended for the
> > > > > > > > >  multi-machine use case, but would otherwise work. I haven't looked
> > > > > > > > >  into these much, but could evaluate them, especially if they seem more
> > > > > > > > >  fit for purpose for use in other Arrow subprojects.
> > > > > > > > >
> > > > > > > > > If people aren't super enthused, I'll most likely go with adding a
> > > > > > > > > custom Span class for Datasets, and defer the question of whether we
> > > > > > > > > should integrate Flight/Datasets with OpenTelemetry until another use
> > > > > > > > > case arises. But recently we have seen interest in this - so I see this
> > > > > > > > > as perhaps a chance to take care of two problems at once.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > [1]: https://opentelemetry.io/
> > > > > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry
> > > > > > > > > [3]: https://perfetto.dev/
> > > > > > > > > [4]: https://llvm.org/docs/XRay.html
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > 
> > 
> 

Re: [C++] Adopting a library for (distributed) tracing

Posted by David Li <li...@apache.org>.
Ah, sorry, I meant Perfetto solely as an example of related library and not something that we had actually evaluated. Thanks for the details on the use cases.

I think the PR should be ready now, and then we can start instrumenting the engine/Flight and seeing how we can best make use of this. The PR now doesn't enable OpenTelemetry at all unless a build flag is passed.

For the visualization, Phillip showed how to use the "native" OTel collector in that PR and I confirmed it works for me. That at least handles collecting data, but visualization will need more work.

-David

On Wed, Nov 17, 2021, at 16:22, Weston Pace wrote:
> Hmm, I see the mention but I don't recall actually working with
> Perfetto (though, it's entirely possible I did and just forgot).  My
> goal isn't entirely identifying code bottlenecks however.  I'd divide
> it into two:
> 
> Improving Arrow's C++ engine: OT is very helpful here, especially when
> working on threading / scheduling type concerns, because it isn't so
> much a "am I computing XYZ as fast as possible?" but more "are we
> working on the correct tasks and utilizing the cores efficiently?"  I
> have found OT is necessary but not sufficient as OT doesn't handle
> analysis / visualization.  I experimented a bit with different
> visualization tools (maybe I mentioned Perfetto then) but I've yet to
> successfully get one configured (you and I encountered issues with
> Jaeger and I haven't tried since then but I think you fixed the
> issues).  So the latest (though not great) workflow I've been using is
> OT + python notebook + perf/vtune/etc.  This sort of task is a
> development-focused task.  Perfetto might be useful here, I can't say.
> 
> Query visibility: This task is less of a "improving the C++ engine"
> and more "introducing visibility into the engine for consumers".  For
> example, people might wonder why a particular query is running slowly
> and need to be able to trace down further.  The resulting fix _might_
> be a JIRA on the C++ engine but it also might be a realization that
> the user has an inefficient query and the user switches to some other
> query.  This case isn't a development use case but more of a user use
> case.  I don't think Perfetto would fit this use case very well.
> 
> -Weston
> 
> On Wed, Nov 17, 2021 at 10:21 AM David Li <li...@apache.org> wrote:
> >
> > Ah, right - I'm not suggesting we use Perfetto, rather I'm just generally curious about people's experience with these kinds of tools.
> >
> > -David
> >
> > On Wed, Nov 17, 2021, at 13:00, Antoine Pitrou wrote:
> > >
> > > Le 16/11/2021 à 17:18, David Li a écrit :
> > > > Following up here: I'm hoping we can enable this in 7.0.0 and am still working on getting all the builds passing (currently RPM packages fail to build with it enabled). OpenTelemetry released their v1.0.0 recently so that should not be a problem anymore.
> > > >
> > > > Some changes in approach:
> > > >   * For now, I've removed integration with Flight and any other components, focusing on just getting the builds working. I'll file follow-up issues for the Flight integration.
> > > >   * Unlike before, I'll change this to be built only when enabled, instead of always. Flight will implicitly enable OpenTelemetry once integrated. (Thanks to @Kou for questioning this.)
> > > >   * I'm now looking at using this for evaluating performance issues/bottlenecks in the C++ query engine, instead of/in addition to the original use case in Flight. I'm curious if others have used OpenTelemetry or similar libraries for this purpose before. I know tools like Perfetto [1] are similar in concept if not approach, and @Weston was experimenting with it for this purpose as well earlier in the thread.
> > > > [1]: https://perfetto.dev/
> > >
> > > Isn't OpenTelemetry language-agnostic while Perfetto is a C++-only
> > > library? (or are the two interoperable?)
> > >
> > > It seems that being language-agnostic would make OpenTracing a better
> > > fit for Arrow (ideally, one could mingle C++, Rust or Java calls and
> > > trace them together).
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> 

Re: [C++] Adopting a library for (distributed) tracing

Posted by Weston Pace <we...@gmail.com>.
Hmm, I see the mention but I don't recall actually working with
Perfetto (though, it's entirely possible I did and just forgot).  My
goal isn't entirely identifying code bottlenecks however.  I'd divide
it into two:

Improving Arrow's C++ engine: OT is very helpful here, especially when
working on threading / scheduling type concerns, because it isn't so
much a "am I computing XYZ as fast as possible?" but more "are we
working on the correct tasks and utilizing the cores efficiently?"  I
have found OT is necessary but not sufficient as OT doesn't handle
analysis / visualization.  I experimented a bit with different
visualization tools (maybe I mentioned Perfetto then) but I've yet to
successfully get one configured (you and I encountered issues with
Jaeger and I haven't tried since then but I think you fixed the
issues).  So the latest (though not great) workflow I've been using is
OT + python notebook + perf/vtune/etc.  This sort of task is a
development-focused task.  Perfetto might be useful here, I can't say.

Query visibility: This task is less of a "improving the C++ engine"
and more "introducing visibility into the engine for consumers".  For
example, people might wonder why a particular query is running slowly
and need to be able to trace down further.  The resulting fix _might_
be a JIRA on the C++ engine but it also might be a realization that
the user has an inefficient query and the user switches to some other
query.  This case isn't a development use case but more of a user use
case.  I don't think Perfetto would fit this use case very well.

-Weston

On Wed, Nov 17, 2021 at 10:21 AM David Li <li...@apache.org> wrote:
>
> Ah, right - I'm not suggesting we use Perfetto, rather I'm just generally curious about people's experience with these kinds of tools.
>
> -David
>
> On Wed, Nov 17, 2021, at 13:00, Antoine Pitrou wrote:
> >
> > Le 16/11/2021 à 17:18, David Li a écrit :
> > > Following up here: I'm hoping we can enable this in 7.0.0 and am still working on getting all the builds passing (currently RPM packages fail to build with it enabled). OpenTelemetry released their v1.0.0 recently so that should not be a problem anymore.
> > >
> > > Some changes in approach:
> > >   * For now, I've removed integration with Flight and any other components, focusing on just getting the builds working. I'll file follow-up issues for the Flight integration.
> > >   * Unlike before, I'll change this to be built only when enabled, instead of always. Flight will implicitly enable OpenTelemetry once integrated. (Thanks to @Kou for questioning this.)
> > >   * I'm now looking at using this for evaluating performance issues/bottlenecks in the C++ query engine, instead of/in addition to the original use case in Flight. I'm curious if others have used OpenTelemetry or similar libraries for this purpose before. I know tools like Perfetto [1] are similar in concept if not approach, and @Weston was experimenting with it for this purpose as well earlier in the thread.
> > > [1]: https://perfetto.dev/
> >
> > Isn't OpenTelemetry language-agnostic while Perfetto is a C++-only
> > library? (or are the two interoperable?)
> >
> > It seems that being language-agnostic would make OpenTracing a better
> > fit for Arrow (ideally, one could mingle C++, Rust or Java calls and
> > trace them together).
> >
> > Regards
> >
> > Antoine.
> >

Re: [C++] Adopting a library for (distributed) tracing

Posted by David Li <li...@apache.org>.
Ah, right - I'm not suggesting we use Perfetto, rather I'm just generally curious about people's experience with these kinds of tools.

-David

On Wed, Nov 17, 2021, at 13:00, Antoine Pitrou wrote:
> 
> Le 16/11/2021 à 17:18, David Li a écrit :
> > Following up here: I'm hoping we can enable this in 7.0.0 and am still working on getting all the builds passing (currently RPM packages fail to build with it enabled). OpenTelemetry released their v1.0.0 recently so that should not be a problem anymore.
> > 
> > Some changes in approach:
> >   * For now, I've removed integration with Flight and any other components, focusing on just getting the builds working. I'll file follow-up issues for the Flight integration.
> >   * Unlike before, I'll change this to be built only when enabled, instead of always. Flight will implicitly enable OpenTelemetry once integrated. (Thanks to @Kou for questioning this.)
> >   * I'm now looking at using this for evaluating performance issues/bottlenecks in the C++ query engine, instead of/in addition to the original use case in Flight. I'm curious if others have used OpenTelemetry or similar libraries for this purpose before. I know tools like Perfetto [1] are similar in concept if not approach, and @Weston was experimenting with it for this purpose as well earlier in the thread.
> > [1]: https://perfetto.dev/
> 
> Isn't OpenTelemetry language-agnostic while Perfetto is a C++-only 
> library? (or are the two interoperable?)
> 
> It seems that being language-agnostic would make OpenTracing a better 
> fit for Arrow (ideally, one could mingle C++, Rust or Java calls and 
> trace them together).
> 
> Regards
> 
> Antoine.
> 

Re: [C++] Adopting a library for (distributed) tracing

Posted by Antoine Pitrou <an...@python.org>.
Le 16/11/2021 à 17:18, David Li a écrit :
> Following up here: I'm hoping we can enable this in 7.0.0 and am still working on getting all the builds passing (currently RPM packages fail to build with it enabled). OpenTelemetry released their v1.0.0 recently so that should not be a problem anymore.
> 
> Some changes in approach:
>   * For now, I've removed integration with Flight and any other components, focusing on just getting the builds working. I'll file follow-up issues for the Flight integration.
>   * Unlike before, I'll change this to be built only when enabled, instead of always. Flight will implicitly enable OpenTelemetry once integrated. (Thanks to @Kou for questioning this.)
>   * I'm now looking at using this for evaluating performance issues/bottlenecks in the C++ query engine, instead of/in addition to the original use case in Flight. I'm curious if others have used OpenTelemetry or similar libraries for this purpose before. I know tools like Perfetto [1] are similar in concept if not approach, and @Weston was experimenting with it for this purpose as well earlier in the thread.
> [1]: https://perfetto.dev/

Isn't OpenTelemetry language-agnostic while Perfetto is a C++-only 
library? (or are the two interoperable?)

It seems that being language-agnostic would make OpenTracing a better 
fit for Arrow (ideally, one could mingle C++, Rust or Java calls and 
trace them together).

Regards

Antoine.

Re: [C++] Adopting a library for (distributed) tracing

Posted by David Li <li...@apache.org>.
Following up here: I'm hoping we can enable this in 7.0.0 and am still working on getting all the builds passing (currently RPM packages fail to build with it enabled). OpenTelemetry released their v1.0.0 recently so that should not be a problem anymore.

Some changes in approach:
 * For now, I've removed integration with Flight and any other components, focusing on just getting the builds working. I'll file follow-up issues for the Flight integration.
 * Unlike before, I'll change this to be built only when enabled, instead of always. Flight will implicitly enable OpenTelemetry once integrated. (Thanks to @Kou for questioning this.)
 * I'm now looking at using this for evaluating performance issues/bottlenecks in the C++ query engine, instead of/in addition to the original use case in Flight. I'm curious if others have used OpenTelemetry or similar libraries for this purpose before. I know tools like Perfetto [1] are similar in concept if not approach, and @Weston was experimenting with it for this purpose as well earlier in the thread.
[1]: https://perfetto.dev/

-David

On Mon, Jul 12, 2021, at 09:47, David Li wrote:
> A quick update on this, I don't think this will happen for 5.0; the upstream library still hasn't reached 1.0, and I don't want to cram this in at the end of a cycle, especially as each of their release candidates has needed an upstream fix in order to keep all our CI platforms working. Furthermore there appear to be some issues with their exporters (or in our usage of them) that I'd like to resolve. Finally, I'd like to have a more complete example here, especially one using an existing tool like Jaeger instead of an ad-hoc visualization.
> 
> -David
> 
> On Wed, Jun 9, 2021, at 13:01, David Li wrote:
> > I just updated the PR with support for exporting to Jaeger[1], which
> > has a built in trace viewer.
> > 
> > 1. Download and run the all-in-one Jaeger binary locally[2] (or their
> >    Docker image)
> > 2. Build Arrow with `-DARROW_WITH_OPENTELEMETRY=ON -DARROW_THRIFT=ON`
> > 3. Run your application with `env ARROW_TRACING_BACKEND=jaeger`
> > 4. Visit http://localhost:16686 and search for "unknown_service".
> > 
> > This gives you a variety of ways to drill into the captured data. Let
> > me know what you think if you get a chance.
> > 
> > Now, while this is convenient, I'm not so sure about bundling it with
> > Arrow; as a library, we should be leaving all this config up to the
> > end-user application. But since this is all in C++, it can be
> > hard/annoying to configure in PyArrow and this is helpful for
> > development and debugging. At the very least, it's behind an optional
> > build flag so it won't ship by default.
> > 
> > Also I see that Kibana (with proprietary xpack) and Grafana have trace
> > viewers now; OpenTelemetry doens't include exporters for trace data to
> > those backends (only metrics/logs) but that could be another option.
> > 
> > Best,
> > David
> > 
> > [1]: https://www.jaegertracing.io/
> > [2]: https://www.jaegertracing.io/docs/1.22/getting-started/#all-in-one
> > 
> > On 2021/06/08 19:30:06, David Li <li...@apache.org> wrote: 
> > > I'll have to do some more digging into that and get back to you. So
> > > far I've been using a quick-and-dirty tool that I whipped up using
> > > Vega-Lite but that's probably not something we want to maintain. I
> > > tried the Chrome trace viewer ("Catapult") but it's not quite built
> > > for this kind of trace; I hear Jaeger's trace viewer can be used
> > > standalone but needs some setup.
> > > 
> > > Though that does raise a good point: we should eventually have
> > > documentation on this knob and how to use it.
> > > 
> > > -David
> > > 
> > > On 2021/06/08 19:21:16, Weston Pace <we...@gmail.com> wrote: 
> > > > FWIW, I tried this out yesterday since I was profiling the execution
> > > > of the async API reader.  It worked great so +1 from me on that basis.
> > > > I did struggle finding a good simple visualization tool.  Do you have
> > > > any good recommendations on that front?
> > > > 
> > > > On Mon, Jun 7, 2021 at 10:50 AM David Li <li...@apache.org> wrote:
> > > > >
> > > > > Just to give an update on where this stands:
> > > > >
> > > > > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> > > > > use it. This contains a few fixes I submitted for the platforms our
> > > > > various CI jobs use, as well as an explicit build flag to support
> > > > > header-only use - I think this should alleviate any concerns over it
> > > > > adding to our build too much. I'm hopeful this means it can make it
> > > > > into 5.0.0, at least with minimal functionality.
> > > > >
> > > > > For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> > > > > have a chance to look through the PR and see if there's any places
> > > > > where adding tracing may be useful.
> > > > >
> > > > > I also touched base with upstream about Python/C++ interop[2] - it
> > > > > turns out upstream has thought about this before but doesn't have the
> > > > > resources to pursue it at the moment, as the idea is to write an
> > > > > API-compatible binding of the C++ library for Python (and presumably
> > > > > R, Ruby, etc.) which is more work.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/10260
> > > > > [2]: https://github.com/open-telemetry/community/discussions/734
> > > > >
> > > > > On 2021/05/06 18:23:05, David Li <li...@apache.org> wrote:
> > > > > > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > > > > > [2]; I'd appreciate any feedback, particularly from anyone already
> > > > > > trying to use OpenTelemetry/Tracing/Census with Arrow.
> > > > > >
> > > > > > For dependencies: now we use OpenTelemetry as header-only by
> > > > > > default. I also slimmed down the build, avoiding making the build wait
> > > > > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > > > > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > > > > > can be toggled via environment variable.
> > > > > >
> > > > > > For Python: the PR includes basic integration with Flight/Python. The
> > > > > > C++ side will start a span, then propagate it to Python. Spans in
> > > > > > Python will not propagate back to C++, and Python/C++ need to both set
> > > > > > up their respective exporters. I plan to poke the upstream community
> > > > > > about if there's a good solution to this kind of issue.
> > > > > >
> > > > > > For ABI compatibility: this will be an issue until upstream reaches
> > > > > > 1.0. Even currently, there's an unreleased change on their main branch
> > > > > > which will break the current PR when it's released. Hopefully, they
> > > > > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > > > > > to avoid shipping this until there is a 1.0. I have confirmed that
> > > > > > linking an application which itself links OpenTelemetry to Arrow
> > > > > > works.
> > > > > >
> > > > > > As for the overhead: I measured the impact on a dataset scan recording
> > > > > > ~900 spans per iteration and there was no discernible effect on
> > > > > > runtime compared to an uninstrumented scan (though again, this is not
> > > > > > that many spans).
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > > > > > [2]: https://github.com/apache/arrow/pull/10260
> > > > > >
> > > > > > On 2021/05/01 19:53:45, "David Li" <li...@apache.org> wrote:
> > > > > > > Thanks everyone for all the comments. Responding to a few things:
> > > > > > >
> > > > > > > > It seems to me it would be fairly implementation dependent -- so each
> > > > > > > > language implementation would choose if it made sense for them and then
> > > > > > > > implement the appropriate connection to that language's open telemetry
> > > > > > > > ecosystem.
> > > > > > >
> > > > > > > Agreed - I think the important thing is to agree on using OpenTelemetry itself so that the various Flight implementations, for instance, can all contribute compatible trace data. And there will be details like naming of keys for extra metadata we might want to attach, or trying to make (some) span names consistent.
> > > > > > >
> > > > > > > > My main question is: does integrating OpenTracing complicate our build
> > > > > > > > procedure?  Is it header-only as long as you use the no-op tracer?  Or
> > > > > > > > do you have to build it and link with it nonetheless?
> > > > > > >
> > > > > > > I need to look into this more and will follow up. I believe we can use it header-only. It's fairly simple to depend on (and has no required dependencies), but it is a synchronous build step (you must build it to have its headers available) - perhaps that could be resolved upstream or I am configuring CMake wrongly. Right now, I've linked in OpenTelemetry to provide a few utilities (e.g. logging data to stdout as JSON), but that could be split out into a libarrow_tracing.so if we keep them.
> > > > > > >
> > > > > > > > Also, are there ABI issues that may complicate integration into
> > > > > > > > applications that were compiled against another version of OpenTracing?
> > > > > > >
> > > > > > > Upstream already seems to be considering ABI compatibility. However, until they reach 1.0, of course they need not keep any promises, and that is a worry depending on their timeline. As pointed out already, they are moving quickly, but they are behind the other languages' OpenTelemetry implementations.
> > > > > > >
> > > > > > > > I'm not sure what the overhead is when disabled--I think it is probably minimal or else it wouldn't be used so widely. But if we're not ready to jump right in, we could introduce our own @WithSpan annotation which by default is a no-op. To build an instrumented Arrow lib, you'd hook it up with a shim.
> > > > > > >
> > > > > > > I am focusing on C++ here but of course the other languages come into play. A similar idea for C++ may be useful if we need to have OpenTelemetry be optional to avoid ABI worries. A branch may also work, but I'd like to avoid that if possible.
> > > > > > >
> > > > > > > Best,
> > > > > > > David
> > > > > > >
> > > > > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote:
> > > > > > > > I agree that OpenTelemetry is the future; I have been following the observability space off and on and I knew about OpenTracing; I just realized that OpenTelemetry is its successor. [1]
> > > > > > > > I have found tracing to be a very powerful approach; at one point, I did a POC of a trace recorder inside a Java webapp, which shed light on some nasty bottlenecks. If integrated properly, it can be left on all the time, so it's valuable for doing root-cause analysis in production. At least in Java, there are already a lot of packages with OpenTelemetry hooks built in. [2]
> > > > > > > > I'm not sure what the overhead is when disabled--I think it is probably minimal or else it wouldn't be used so widely. But if we're not ready to jump right in, we could introduce our own @WithSpan annotation which by default is a no-op. To build an instrumented Arrow lib, you'd hook it up with a shim. Or you could just maintain a branch with instrumentation for people to try it out.
> > > > > > > >
> > > > > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/
> > > > > > > > [2] https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md
> > > > > > > >
> > > > > > > > On 2021/04/30 22:18:46, Evan Chan <evan@urbanlogiq.com <mailto:evan%40urbanlogiq.com>> wrote:
> > > > > > > > > Dear David,
> > > > > > > > >
> > > > > > > > > OpenTelemetry tracing is definitely the future, I guess the question is how far down the stack we want to put it.   I think it would be useful for flight and other higher level modules, and for DataFusion for example it would be really useful.
> > > > > > > > > As for being alpha, I don’t think it will stay that way very long, there is a ton of industry momentum behind OpenTelemetry.
> > > > > > > > >
> > > > > > > > > -Evan
> > > > > > > > >
> > > > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidavidm@apache.org <mailto:lidavidm%40apache.org>> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > For Arrow Datasets, I've been working to instrument the scanner to find
> > > > > > > > > > bottlenecks. For example, here's a demo comparing the current async
> > > > > > > > > > scanner, which doesn't truly read asynchronously, to one that does; it
> > > > > > > > > > should be fairly evident where the bottleneck is:
> > > > > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > > > > > > > > >
> > > > > > > > > > I'd like to upstream this, but I'd like to run some questions by
> > > > > > > > > > everyone first:
> > > > > > > > > > - Does this look useful to developers working on other sub-projects?
> > > > > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we
> > > > > > > > > >  comfortable with adopting it? Is the overhead acceptable?
> > > > > > > > > > - Is there anyone using Arrow to build services, that would find more
> > > > > > > > > >  general integration useful?
> > > > > > > > > >
> > > > > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a "span"
> > > > > > > > > > for operations like reading a single record batch. The data is saved as
> > > > > > > > > > JSON, then rendered by some JavaScript. The branch is at [2].
> > > > > > > > > >
> > > > > > > > > > As a quick summary, OpenTelemetry implements distributed tracing, in
> > > > > > > > > > which a request is tracked as a directed acyclic graph of spans. A span
> > > > > > > > > > is just metadata (name, ID, start/end time, parent span, ...) about an
> > > > > > > > > > operation (function call, network request, ...). Typically, it's used in
> > > > > > > > > > services. Spans can reference each other across machines, so you can
> > > > > > > > > > track a request across multiple services (e.g. finding which service
> > > > > > > > > > failed/is unusually slow in a chain of services that call each other).
> > > > > > > > > >
> > > > > > > > > > As opposed to a (sampling) profiler, this gives you application-level
> > > > > > > > > > metadata, like filenames or S3 download rates, that you can use in
> > > > > > > > > > analysis (as in the demo). It's also something you'd always keep turned
> > > > > > > > > > on (at least when running a service). If integrated with Flight,
> > > > > > > > > > OpenTelemetry would also give us a performance picture across multiple
> > > > > > > > > > machines - speculatively, something like making a request to a Flight
> > > > > > > > > > service and being able to trace all the requests it makes to S3.
> > > > > > > > > >
> > > > > > > > > > It does have some overhead; you wouldn't annotate every function in a
> > > > > > > > > > codebase. This is rather anecdotal, but for the demo above, there was
> > > > > > > > > > essentially zero impact on runtime. Of course, that demo records very
> > > > > > > > > > little data overall, so it's not very representative.
> > > > > > > > > >
> > > > > > > > > > Alternatives:
> > > > > > > > > > - Add a simple Span class of our own, and defer Flight until later.
> > > > > > > > > > - Integrate OpenTelemetry in such a way that it gets compiled out if not
> > > > > > > > > >  enabled at build time. This would be messier but should alleviate any
> > > > > > > > > >  performance questions.
> > > > > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own
> > > > > > > > > >  caveats (e.g. XRay is LLVM-specific) and aren't intended for the
> > > > > > > > > >  multi-machine use case, but would otherwise work. I haven't looked
> > > > > > > > > >  into these much, but could evaluate them, especially if they seem more
> > > > > > > > > >  fit for purpose for use in other Arrow subprojects.
> > > > > > > > > >
> > > > > > > > > > If people aren't super enthused, I'll most likely go with adding a
> > > > > > > > > > custom Span class for Datasets, and defer the question of whether we
> > > > > > > > > > should integrate Flight/Datasets with OpenTelemetry until another use
> > > > > > > > > > case arises. But recently we have seen interest in this - so I see this
> > > > > > > > > > as perhaps a chance to take care of two problems at once.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > > [1]: https://opentelemetry.io/
> > > > > > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry
> > > > > > > > > > [3]: https://perfetto.dev/
> > > > > > > > > > [4]: https://llvm.org/docs/XRay.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > 
> > > 
> > 
>