You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Rion Williams <ri...@gmail.com> on 2020/04/17 15:38:07 UTC

Distributed Tracing in Apache Beam

Hi all,

I'm reaching out today to inquire if Apache Beam has any support or mechanisms to support some type of distributed tracing similar to something like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java SDK, however due to the nature of Beam working with transforms that yield immutable collections, I wasn't sure what the best avenue would be to correlate various transforms against a particular element would be?

Coming from a Kafka Streams background, this process was pretty trivial as I'd simply store my correlation identifier within the message headers and those would be persisted as the element traveled through Kafka into various applications and topics. I'm hoping to still leverage some of that in Beam if at all possible or see what, if any, recommended approaches there are out there.

My current approach involves the creation of a "Tracing Context" which would just be a wrapper for each of my elements that had their own associated trace with them and instead of just passing around a PCollection<X> I would use a PCollection<Tracable<X>> that would just be a wrapper for the tracer and the underlying element so that I could access the tracer during any element-wise operations in the pipeline.

Any recommendations or suggestions are more than welcome! I'm very new to the Beam ecosystem, so I'd love to leverage anything out there that might help me from reinventing the wheel.

Thanks much!

Rion

Re: Distributed Tracing in Apache Beam

Posted by Kenneth Knowles <ke...@apache.org>.

+dev <de...@beam.apache.org>

I don't have a ton of time to dig in to this, but I wanted to say that this
is very cool and just drop a couple pointers (which you may already know
about) like Explaining Outputs in Modern Data Analytics [1] which was
covered by The Morning Paper [2]. This just happens to be something I read
a few years ago - following citations of the paper or pingbacks on the blog
post yields a lot more work, some of which may be helpful. There seems to
be a slight difference of emphasis between tracing in an arbitrary
distributed system versus explaining big data results. I would expect
general tracing (which Jaeger is?) to be more complex and expensive to run,
but that's just an intuition.

Kenn

[1] http://www.vldb.org/pvldb/vol9/p1137-chothia.pdf
[2]
https://blog.acolyer.org/2017/02/01/explaining-outputs-in-modern-data-analytics/

On Fri, Apr 17, 2020 at 10:56 AM Rion Williams <ri...@gmail.com>
wrote:

> Hi Alexey,
>
> I think you’re right about the wrapper, it’s likely unnecessary as I think
> I’d have enough information in the headers to rehydrate my “tracer” that
> communicates the traces/spans to Jaeger as needed. I’d love to not have to
> touch those or muddy the waters with a wrapper class, additional conversion
> steps, custom coder, etc.
>
> Speaking of conversions, I agree entirely with the unified interface for
> reading/writing to Kafka. I’ll openly admit I spent far too long fighting
> with it before discovering that `withoutMetadata()` existed. So if those
> were unified and writeRecords could accept a Kafka one, that’d be great.
>
>
> > On Apr 17, 2020, at 12:47 PM, Alexey Romanenko <ar...@gmail.com>
> wrote:
> >
> > Hi Rion,
> >
> > In general, yes, it sounds reasonable to me. I just do not see why you
> need to have extra Traceable wrapper? Do you need to keep some temporary
> information there that you don’t want to store in Kafka record headers?
> >
> > PS: Now I started to think that we probably have to change an interface
> of KafkaIO.writeRecords() from ProducerRecord to the same KafkaRecord as we
> use for read. In this case we won’t expose Kafka API and use only own
> wrapper.
> > Also, user won’t need to convert types between Read and Write (like in
> this topic case).
> >
> >> On 17 Apr 2020, at 19:28, Rion Williams <ri...@gmail.com> wrote:
> >>
> >> Hi Alexey,
> >>
> >> So this is currently the approach that I'm taking. Basically creating a
> wrapper Traceable<K,V> class that will contain all of my record information
> as well as the data necessary to update the traces for that record. It
> requires an extra step and will likely mean persisting something along side
> each record as it comes in, but I'm not sure if there's another way around
> it.
> >>
> >> My current approach goes something like this:
> >> - Read records via KafkaIO (with metadata)
> >> - Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V>
> instances (which just contain a Tracer object as well as the original
> KV<K,V> record itself)
> >> - Pass this Traceable through all of the appropriate transforms,
> creating new spans for the trace as necessary via the tracer element on the
> Traceable<K,V> object.
> >> - Prior to output to a Kafka topic, transform the Traceable<K, V>
> object in to a ProducerRecord<K,V> that contains the key, value, and
> headers (from the tracer) prior to writing back to Kafka
> >>
> >> I think this will work, but it'll likely take quite a bit of
> experimentation to verify. Does this sound reasonable?
> >>
> >> Thanks,
> >>
> >> Rion
> >>
> >>> On 2020/04/17 17:14:58, Alexey Romanenko <ar...@gmail.com>
> wrote:
> >>> Not sure if it will help, but KafkaIO allows to keep all meta
> information while reading (using KafkaRecord) and writing (using
> ProducerRecord).
> >>> So, you can keep your tracing id in the record headers as you did with
> Kafka Streams.
> >>>
> >>>> On 17 Apr 2020, at 18:58, Rion Williams <ri...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> As mentioned before, I'm in the process of migrating a pipeline of
> several Kafka Streams applications over to Apache Beam and I'm hoping to
> leverage the tracing infrastructure that I had established using Jaeger
> whenever I can, but specifically to trace an element as it flows through a
> pipeline or potentially multiple pipelines.
> >>>>
> >>>> An example might go something like this:
> >>>>
> >>>> - An event is produced from some service and sent to a Kafka Topic
> (with a tracing id in the headers)
> >>>> - The event enters my pipeline (Beam reads from that topic) and
> begins applying a series of transforms that evaluate the element itself
> (e.g. does it have any users associated with it, IP addresses, other
> interesting information).
> >>>> - When interesting information is encountered on the element (or
> errors), I'd like to be able to associate them with the trace (e.g. a user
> was found, this is some information about the user, this is the unique
> identifier associated with them, or there was an error because the user had
> a malformed e-mail address)
> >>>> - The traces themselves would be cumulative, so if an event was
> processed through one pipeline, it would contain all the necessary tracing
> headers in the message so if another pipeline picked it up from its
> destination topic (e.g. the destination of the first pipeline), the trace
> could be continued.
> >>>>
> >>>> I think that being able to pick up interactive systems would be a
> nice to have (e.g. this record is being sent to Parquet, Mongo, or some
> other topic), but I'm just trying to focus on being able to add to the
> trace at the ParDo/element level for now.
> >>>>
> >>>> I hope that helps.
> >>>>
> >>>> Rion
> >>>>
> >>>> On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote:
> >>>>> Can you explain a bit more of what you want to achieve here?
> >>>>>
> >>>>> Do you want to trace how your elements go to the pipeline or do you
> want to
> >>>>> see how every ParDo interacts with external systems?
> >>>>>
> >>>>> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I'm reaching out today to inquire if Apache Beam has any support or
> >>>>>> mechanisms to support some type of distributed tracing similar to
> something
> >>>>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a
> Java
> >>>>>> SDK, however due to the nature of Beam working with transforms that
> yield
> >>>>>> immutable collections, I wasn't sure what the best avenue would be
> to
> >>>>>> correlate various transforms against a particular element would be?
> >>>>>>
> >>>>>> Coming from a Kafka Streams background, this process was pretty
> trivial as
> >>>>>> I'd simply store my correlation identifier within the message
> headers and
> >>>>>> those would be persisted as the element traveled through Kafka into
> various
> >>>>>> applications and topics. I'm hoping to still leverage some of that
> in Beam
> >>>>>> if at all possible or see what, if any, recommended approaches
> there are
> >>>>>> out there.
> >>>>>>
> >>>>>> My current approach involves the creation of a "Tracing Context"
> which
> >>>>>> would just be a wrapper for each of my elements that had their own
> >>>>>> associated trace with them and instead of just passing around a
> >>>>>> PCollection<X> I would use a PCollection<Tracable<X>> that would
> just be a
> >>>>>> wrapper for the tracer and the underlying element so that I could
> access
> >>>>>> the tracer during any element-wise operations in the pipeline.
> >>>>>>
> >>>>>> Any recommendations or suggestions are more than welcome! I'm very
> new to
> >>>>>> the Beam ecosystem, so I'd love to leverage anything out there that
> might
> >>>>>> help me from reinventing the wheel.
> >>>>>>
> >>>>>> Thanks much!
> >>>>>>
> >>>>>> Rion
> >>>>>>
> >>>>>
> >>>
> >>>
> >
>

Re: Distributed Tracing in Apache Beam

Posted by Kenneth Knowles <ke...@apache.org>.

+dev <de...@beam.apache.org>

I don't have a ton of time to dig in to this, but I wanted to say that this
is very cool and just drop a couple pointers (which you may already know
about) like Explaining Outputs in Modern Data Analytics [1] which was
covered by The Morning Paper [2]. This just happens to be something I read
a few years ago - following citations of the paper or pingbacks on the blog
post yields a lot more work, some of which may be helpful. There seems to
be a slight difference of emphasis between tracing in an arbitrary
distributed system versus explaining big data results. I would expect
general tracing (which Jaeger is?) to be more complex and expensive to run,
but that's just an intuition.

Kenn

[1] http://www.vldb.org/pvldb/vol9/p1137-chothia.pdf
[2]
https://blog.acolyer.org/2017/02/01/explaining-outputs-in-modern-data-analytics/

On Fri, Apr 17, 2020 at 10:56 AM Rion Williams <ri...@gmail.com>
wrote:

> Hi Alexey,
>
> I think you’re right about the wrapper, it’s likely unnecessary as I think
> I’d have enough information in the headers to rehydrate my “tracer” that
> communicates the traces/spans to Jaeger as needed. I’d love to not have to
> touch those or muddy the waters with a wrapper class, additional conversion
> steps, custom coder, etc.
>
> Speaking of conversions, I agree entirely with the unified interface for
> reading/writing to Kafka. I’ll openly admit I spent far too long fighting
> with it before discovering that `withoutMetadata()` existed. So if those
> were unified and writeRecords could accept a Kafka one, that’d be great.
>
>
> > On Apr 17, 2020, at 12:47 PM, Alexey Romanenko <ar...@gmail.com>
> wrote:
> >
> > Hi Rion,
> >
> > In general, yes, it sounds reasonable to me. I just do not see why you
> need to have extra Traceable wrapper? Do you need to keep some temporary
> information there that you don’t want to store in Kafka record headers?
> >
> > PS: Now I started to think that we probably have to change an interface
> of KafkaIO.writeRecords() from ProducerRecord to the same KafkaRecord as we
> use for read. In this case we won’t expose Kafka API and use only own
> wrapper.
> > Also, user won’t need to convert types between Read and Write (like in
> this topic case).
> >
> >> On 17 Apr 2020, at 19:28, Rion Williams <ri...@gmail.com> wrote:
> >>
> >> Hi Alexey,
> >>
> >> So this is currently the approach that I'm taking. Basically creating a
> wrapper Traceable<K,V> class that will contain all of my record information
> as well as the data necessary to update the traces for that record. It
> requires an extra step and will likely mean persisting something along side
> each record as it comes in, but I'm not sure if there's another way around
> it.
> >>
> >> My current approach goes something like this:
> >> - Read records via KafkaIO (with metadata)
> >> - Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V>
> instances (which just contain a Tracer object as well as the original
> KV<K,V> record itself)
> >> - Pass this Traceable through all of the appropriate transforms,
> creating new spans for the trace as necessary via the tracer element on the
> Traceable<K,V> object.
> >> - Prior to output to a Kafka topic, transform the Traceable<K, V>
> object in to a ProducerRecord<K,V> that contains the key, value, and
> headers (from the tracer) prior to writing back to Kafka
> >>
> >> I think this will work, but it'll likely take quite a bit of
> experimentation to verify. Does this sound reasonable?
> >>
> >> Thanks,
> >>
> >> Rion
> >>
> >>> On 2020/04/17 17:14:58, Alexey Romanenko <ar...@gmail.com>
> wrote:
> >>> Not sure if it will help, but KafkaIO allows to keep all meta
> information while reading (using KafkaRecord) and writing (using
> ProducerRecord).
> >>> So, you can keep your tracing id in the record headers as you did with
> Kafka Streams.
> >>>
> >>>> On 17 Apr 2020, at 18:58, Rion Williams <ri...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> As mentioned before, I'm in the process of migrating a pipeline of
> several Kafka Streams applications over to Apache Beam and I'm hoping to
> leverage the tracing infrastructure that I had established using Jaeger
> whenever I can, but specifically to trace an element as it flows through a
> pipeline or potentially multiple pipelines.
> >>>>
> >>>> An example might go something like this:
> >>>>
> >>>> - An event is produced from some service and sent to a Kafka Topic
> (with a tracing id in the headers)
> >>>> - The event enters my pipeline (Beam reads from that topic) and
> begins applying a series of transforms that evaluate the element itself
> (e.g. does it have any users associated with it, IP addresses, other
> interesting information).
> >>>> - When interesting information is encountered on the element (or
> errors), I'd like to be able to associate them with the trace (e.g. a user
> was found, this is some information about the user, this is the unique
> identifier associated with them, or there was an error because the user had
> a malformed e-mail address)
> >>>> - The traces themselves would be cumulative, so if an event was
> processed through one pipeline, it would contain all the necessary tracing
> headers in the message so if another pipeline picked it up from its
> destination topic (e.g. the destination of the first pipeline), the trace
> could be continued.
> >>>>
> >>>> I think that being able to pick up interactive systems would be a
> nice to have (e.g. this record is being sent to Parquet, Mongo, or some
> other topic), but I'm just trying to focus on being able to add to the
> trace at the ParDo/element level for now.
> >>>>
> >>>> I hope that helps.
> >>>>
> >>>> Rion
> >>>>
> >>>> On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote:
> >>>>> Can you explain a bit more of what you want to achieve here?
> >>>>>
> >>>>> Do you want to trace how your elements go to the pipeline or do you
> want to
> >>>>> see how every ParDo interacts with external systems?
> >>>>>
> >>>>> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I'm reaching out today to inquire if Apache Beam has any support or
> >>>>>> mechanisms to support some type of distributed tracing similar to
> something
> >>>>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a
> Java
> >>>>>> SDK, however due to the nature of Beam working with transforms that
> yield
> >>>>>> immutable collections, I wasn't sure what the best avenue would be
> to
> >>>>>> correlate various transforms against a particular element would be?
> >>>>>>
> >>>>>> Coming from a Kafka Streams background, this process was pretty
> trivial as
> >>>>>> I'd simply store my correlation identifier within the message
> headers and
> >>>>>> those would be persisted as the element traveled through Kafka into
> various
> >>>>>> applications and topics. I'm hoping to still leverage some of that
> in Beam
> >>>>>> if at all possible or see what, if any, recommended approaches
> there are
> >>>>>> out there.
> >>>>>>
> >>>>>> My current approach involves the creation of a "Tracing Context"
> which
> >>>>>> would just be a wrapper for each of my elements that had their own
> >>>>>> associated trace with them and instead of just passing around a
> >>>>>> PCollection<X> I would use a PCollection<Tracable<X>> that would
> just be a
> >>>>>> wrapper for the tracer and the underlying element so that I could
> access
> >>>>>> the tracer during any element-wise operations in the pipeline.
> >>>>>>
> >>>>>> Any recommendations or suggestions are more than welcome! I'm very
> new to
> >>>>>> the Beam ecosystem, so I'd love to leverage anything out there that
> might
> >>>>>> help me from reinventing the wheel.
> >>>>>>
> >>>>>> Thanks much!
> >>>>>>
> >>>>>> Rion
> >>>>>>
> >>>>>
> >>>
> >>>
> >
>

Re: Distributed Tracing in Apache Beam

Posted by Rion Williams <ri...@gmail.com>.

Hi Alexey, 

I think you’re right about the wrapper, it’s likely unnecessary as I think I’d have enough information in the headers to rehydrate my “tracer” that communicates the traces/spans to Jaeger as needed. I’d love to not have to touch those or muddy the waters with a wrapper class, additional conversion steps, custom coder, etc.

Speaking of conversions, I agree entirely with the unified interface for reading/writing to Kafka. I’ll openly admit I spent far too long fighting with it before discovering that `withoutMetadata()` existed. So if those were unified and writeRecords could accept a Kafka one, that’d be great.


> On Apr 17, 2020, at 12:47 PM, Alexey Romanenko <ar...@gmail.com> wrote:
> 
> Hi Rion,
> 
> In general, yes, it sounds reasonable to me. I just do not see why you need to have extra Traceable wrapper? Do you need to keep some temporary information there that you don’t want to store in Kafka record headers? 
> 
> PS: Now I started to think that we probably have to change an interface of KafkaIO.writeRecords() from ProducerRecord to the same KafkaRecord as we use for read. In this case we won’t expose Kafka API and use only own wrapper. 
> Also, user won’t need to convert types between Read and Write (like in this topic case).
> 
>> On 17 Apr 2020, at 19:28, Rion Williams <ri...@gmail.com> wrote:
>> 
>> Hi Alexey,
>> 
>> So this is currently the approach that I'm taking. Basically creating a wrapper Traceable<K,V> class that will contain all of my record information as well as the data necessary to update the traces for that record. It requires an extra step and will likely mean persisting something along side each record as it comes in, but I'm not sure if there's another way around it.
>> 
>> My current approach goes something like this:
>> - Read records via KafkaIO (with metadata)
>> - Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V> instances (which just contain a Tracer object as well as the original KV<K,V> record itself)
>> - Pass this Traceable through all of the appropriate transforms, creating new spans for the trace as necessary via the tracer element on the Traceable<K,V> object.
>> - Prior to output to a Kafka topic, transform the Traceable<K, V> object in to a ProducerRecord<K,V> that contains the key, value, and headers (from the tracer) prior to writing back to Kafka
>> 
>> I think this will work, but it'll likely take quite a bit of experimentation to verify. Does this sound reasonable?
>> 
>> Thanks,
>> 
>> Rion
>> 
>>> On 2020/04/17 17:14:58, Alexey Romanenko <ar...@gmail.com> wrote: 
>>> Not sure if it will help, but KafkaIO allows to keep all meta information while reading (using KafkaRecord) and writing (using ProducerRecord). 
>>> So, you can keep your tracing id in the record headers as you did with Kafka Streams. 
>>> 
>>>> On 17 Apr 2020, at 18:58, Rion Williams <ri...@gmail.com> wrote:
>>>> 
>>>> Hi Alex,
>>>> 
>>>> As mentioned before, I'm in the process of migrating a pipeline of several Kafka Streams applications over to Apache Beam and I'm hoping to leverage the tracing infrastructure that I had established using Jaeger whenever I can, but specifically to trace an element as it flows through a pipeline or potentially multiple pipelines.
>>>> 
>>>> An example might go something like this:
>>>> 
>>>> - An event is produced from some service and sent to a Kafka Topic (with a tracing id in the headers)
>>>> - The event enters my pipeline (Beam reads from that topic) and begins applying a series of transforms that evaluate the element itself (e.g. does it have any users associated with it, IP addresses, other interesting information).
>>>> - When interesting information is encountered on the element (or errors), I'd like to be able to associate them with the trace (e.g. a user was found, this is some information about the user, this is the unique identifier associated with them, or there was an error because the user had a malformed e-mail address)
>>>> - The traces themselves would be cumulative, so if an event was processed through one pipeline, it would contain all the necessary tracing headers in the message so if another pipeline picked it up from its destination topic (e.g. the destination of the first pipeline), the trace could be continued.
>>>> 
>>>> I think that being able to pick up interactive systems would be a nice to have (e.g. this record is being sent to Parquet, Mongo, or some other topic), but I'm just trying to focus on being able to add to the trace at the ParDo/element level for now.
>>>> 
>>>> I hope that helps.
>>>> 
>>>> Rion
>>>> 
>>>> On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote: 
>>>>> Can you explain a bit more of what you want to achieve here?
>>>>> 
>>>>> Do you want to trace how your elements go to the pipeline or do you want to
>>>>> see how every ParDo interacts with external systems?
>>>>> 
>>>>> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> I'm reaching out today to inquire if Apache Beam has any support or
>>>>>> mechanisms to support some type of distributed tracing similar to something
>>>>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java
>>>>>> SDK, however due to the nature of Beam working with transforms that yield
>>>>>> immutable collections, I wasn't sure what the best avenue would be to
>>>>>> correlate various transforms against a particular element would be?
>>>>>> 
>>>>>> Coming from a Kafka Streams background, this process was pretty trivial as
>>>>>> I'd simply store my correlation identifier within the message headers and
>>>>>> those would be persisted as the element traveled through Kafka into various
>>>>>> applications and topics. I'm hoping to still leverage some of that in Beam
>>>>>> if at all possible or see what, if any, recommended approaches there are
>>>>>> out there.
>>>>>> 
>>>>>> My current approach involves the creation of a "Tracing Context" which
>>>>>> would just be a wrapper for each of my elements that had their own
>>>>>> associated trace with them and instead of just passing around a
>>>>>> PCollection<X> I would use a PCollection<Tracable<X>> that would just be a
>>>>>> wrapper for the tracer and the underlying element so that I could access
>>>>>> the tracer during any element-wise operations in the pipeline.
>>>>>> 
>>>>>> Any recommendations or suggestions are more than welcome! I'm very new to
>>>>>> the Beam ecosystem, so I'd love to leverage anything out there that might
>>>>>> help me from reinventing the wheel.
>>>>>> 
>>>>>> Thanks much!
>>>>>> 
>>>>>> Rion
>>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Distributed Tracing in Apache Beam

Posted by Alexey Romanenko <ar...@gmail.com>.

Hi Rion,

In general, yes, it sounds reasonable to me. I just do not see why you need to have extra Traceable wrapper? Do you need to keep some temporary information there that you don’t want to store in Kafka record headers? 

PS: Now I started to think that we probably have to change an interface of KafkaIO.writeRecords() from ProducerRecord to the same KafkaRecord as we use for read. In this case we won’t expose Kafka API and use only own wrapper. 
Also, user won’t need to convert types between Read and Write (like in this topic case).

> On 17 Apr 2020, at 19:28, Rion Williams <ri...@gmail.com> wrote:
> 
> Hi Alexey,
> 
> So this is currently the approach that I'm taking. Basically creating a wrapper Traceable<K,V> class that will contain all of my record information as well as the data necessary to update the traces for that record. It requires an extra step and will likely mean persisting something along side each record as it comes in, but I'm not sure if there's another way around it.
> 
> My current approach goes something like this:
> - Read records via KafkaIO (with metadata)
> - Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V> instances (which just contain a Tracer object as well as the original KV<K,V> record itself)
> - Pass this Traceable through all of the appropriate transforms, creating new spans for the trace as necessary via the tracer element on the Traceable<K,V> object.
> - Prior to output to a Kafka topic, transform the Traceable<K, V> object in to a ProducerRecord<K,V> that contains the key, value, and headers (from the tracer) prior to writing back to Kafka
> 
> I think this will work, but it'll likely take quite a bit of experimentation to verify. Does this sound reasonable?
> 
> Thanks,
> 
> Rion
> 
> On 2020/04/17 17:14:58, Alexey Romanenko <ar...@gmail.com> wrote: 
>> Not sure if it will help, but KafkaIO allows to keep all meta information while reading (using KafkaRecord) and writing (using ProducerRecord). 
>> So, you can keep your tracing id in the record headers as you did with Kafka Streams. 
>> 
>>> On 17 Apr 2020, at 18:58, Rion Williams <ri...@gmail.com> wrote:
>>> 
>>> Hi Alex,
>>> 
>>> As mentioned before, I'm in the process of migrating a pipeline of several Kafka Streams applications over to Apache Beam and I'm hoping to leverage the tracing infrastructure that I had established using Jaeger whenever I can, but specifically to trace an element as it flows through a pipeline or potentially multiple pipelines.
>>> 
>>> An example might go something like this:
>>> 
>>> - An event is produced from some service and sent to a Kafka Topic (with a tracing id in the headers)
>>> - The event enters my pipeline (Beam reads from that topic) and begins applying a series of transforms that evaluate the element itself (e.g. does it have any users associated with it, IP addresses, other interesting information).
>>> - When interesting information is encountered on the element (or errors), I'd like to be able to associate them with the trace (e.g. a user was found, this is some information about the user, this is the unique identifier associated with them, or there was an error because the user had a malformed e-mail address)
>>> - The traces themselves would be cumulative, so if an event was processed through one pipeline, it would contain all the necessary tracing headers in the message so if another pipeline picked it up from its destination topic (e.g. the destination of the first pipeline), the trace could be continued.
>>> 
>>> I think that being able to pick up interactive systems would be a nice to have (e.g. this record is being sent to Parquet, Mongo, or some other topic), but I'm just trying to focus on being able to add to the trace at the ParDo/element level for now.
>>> 
>>> I hope that helps.
>>> 
>>> Rion
>>> 
>>> On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote: 
>>>> Can you explain a bit more of what you want to achieve here?
>>>> 
>>>> Do you want to trace how your elements go to the pipeline or do you want to
>>>> see how every ParDo interacts with external systems?
>>>> 
>>>> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm reaching out today to inquire if Apache Beam has any support or
>>>>> mechanisms to support some type of distributed tracing similar to something
>>>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java
>>>>> SDK, however due to the nature of Beam working with transforms that yield
>>>>> immutable collections, I wasn't sure what the best avenue would be to
>>>>> correlate various transforms against a particular element would be?
>>>>> 
>>>>> Coming from a Kafka Streams background, this process was pretty trivial as
>>>>> I'd simply store my correlation identifier within the message headers and
>>>>> those would be persisted as the element traveled through Kafka into various
>>>>> applications and topics. I'm hoping to still leverage some of that in Beam
>>>>> if at all possible or see what, if any, recommended approaches there are
>>>>> out there.
>>>>> 
>>>>> My current approach involves the creation of a "Tracing Context" which
>>>>> would just be a wrapper for each of my elements that had their own
>>>>> associated trace with them and instead of just passing around a
>>>>> PCollection<X> I would use a PCollection<Tracable<X>> that would just be a
>>>>> wrapper for the tracer and the underlying element so that I could access
>>>>> the tracer during any element-wise operations in the pipeline.
>>>>> 
>>>>> Any recommendations or suggestions are more than welcome! I'm very new to
>>>>> the Beam ecosystem, so I'd love to leverage anything out there that might
>>>>> help me from reinventing the wheel.
>>>>> 
>>>>> Thanks much!
>>>>> 
>>>>> Rion
>>>>> 
>>>> 
>> 
>>

Re: Distributed Tracing in Apache Beam

Posted by Rion Williams <ri...@gmail.com>.

Hi Alexey,

So this is currently the approach that I'm taking. Basically creating a wrapper Traceable<K,V> class that will contain all of my record information as well as the data necessary to update the traces for that record. It requires an extra step and will likely mean persisting something along side each record as it comes in, but I'm not sure if there's another way around it.

My current approach goes something like this:
- Read records via KafkaIO (with metadata)
- Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V> instances (which just contain a Tracer object as well as the original KV<K,V> record itself)
- Pass this Traceable through all of the appropriate transforms, creating new spans for the trace as necessary via the tracer element on the Traceable<K,V> object.
- Prior to output to a Kafka topic, transform the Traceable<K, V> object in to a ProducerRecord<K,V> that contains the key, value, and headers (from the tracer) prior to writing back to Kafka

I think this will work, but it'll likely take quite a bit of experimentation to verify. Does this sound reasonable?

Thanks,

Rion

On 2020/04/17 17:14:58, Alexey Romanenko <ar...@gmail.com> wrote: 
> Not sure if it will help, but KafkaIO allows to keep all meta information while reading (using KafkaRecord) and writing (using ProducerRecord). 
> So, you can keep your tracing id in the record headers as you did with Kafka Streams. 
> 
> > On 17 Apr 2020, at 18:58, Rion Williams <ri...@gmail.com> wrote:
> > 
> > Hi Alex,
> > 
> > As mentioned before, I'm in the process of migrating a pipeline of several Kafka Streams applications over to Apache Beam and I'm hoping to leverage the tracing infrastructure that I had established using Jaeger whenever I can, but specifically to trace an element as it flows through a pipeline or potentially multiple pipelines.
> > 
> > An example might go something like this:
> > 
> > - An event is produced from some service and sent to a Kafka Topic (with a tracing id in the headers)
> > - The event enters my pipeline (Beam reads from that topic) and begins applying a series of transforms that evaluate the element itself (e.g. does it have any users associated with it, IP addresses, other interesting information).
> > - When interesting information is encountered on the element (or errors), I'd like to be able to associate them with the trace (e.g. a user was found, this is some information about the user, this is the unique identifier associated with them, or there was an error because the user had a malformed e-mail address)
> > - The traces themselves would be cumulative, so if an event was processed through one pipeline, it would contain all the necessary tracing headers in the message so if another pipeline picked it up from its destination topic (e.g. the destination of the first pipeline), the trace could be continued.
> > 
> > I think that being able to pick up interactive systems would be a nice to have (e.g. this record is being sent to Parquet, Mongo, or some other topic), but I'm just trying to focus on being able to add to the trace at the ParDo/element level for now.
> > 
> > I hope that helps.
> > 
> > Rion
> > 
> > On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote: 
> >> Can you explain a bit more of what you want to achieve here?
> >> 
> >> Do you want to trace how your elements go to the pipeline or do you want to
> >> see how every ParDo interacts with external systems?
> >> 
> >> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com> wrote:
> >> 
> >>> Hi all,
> >>> 
> >>> I'm reaching out today to inquire if Apache Beam has any support or
> >>> mechanisms to support some type of distributed tracing similar to something
> >>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java
> >>> SDK, however due to the nature of Beam working with transforms that yield
> >>> immutable collections, I wasn't sure what the best avenue would be to
> >>> correlate various transforms against a particular element would be?
> >>> 
> >>> Coming from a Kafka Streams background, this process was pretty trivial as
> >>> I'd simply store my correlation identifier within the message headers and
> >>> those would be persisted as the element traveled through Kafka into various
> >>> applications and topics. I'm hoping to still leverage some of that in Beam
> >>> if at all possible or see what, if any, recommended approaches there are
> >>> out there.
> >>> 
> >>> My current approach involves the creation of a "Tracing Context" which
> >>> would just be a wrapper for each of my elements that had their own
> >>> associated trace with them and instead of just passing around a
> >>> PCollection<X> I would use a PCollection<Tracable<X>> that would just be a
> >>> wrapper for the tracer and the underlying element so that I could access
> >>> the tracer during any element-wise operations in the pipeline.
> >>> 
> >>> Any recommendations or suggestions are more than welcome! I'm very new to
> >>> the Beam ecosystem, so I'd love to leverage anything out there that might
> >>> help me from reinventing the wheel.
> >>> 
> >>> Thanks much!
> >>> 
> >>> Rion
> >>> 
> >> 
> 
>

Re: Distributed Tracing in Apache Beam

Posted by Alexey Romanenko <ar...@gmail.com>.

Not sure if it will help, but KafkaIO allows to keep all meta information while reading (using KafkaRecord) and writing (using ProducerRecord). 
So, you can keep your tracing id in the record headers as you did with Kafka Streams. 

> On 17 Apr 2020, at 18:58, Rion Williams <ri...@gmail.com> wrote:
> 
> Hi Alex,
> 
> As mentioned before, I'm in the process of migrating a pipeline of several Kafka Streams applications over to Apache Beam and I'm hoping to leverage the tracing infrastructure that I had established using Jaeger whenever I can, but specifically to trace an element as it flows through a pipeline or potentially multiple pipelines.
> 
> An example might go something like this:
> 
> - An event is produced from some service and sent to a Kafka Topic (with a tracing id in the headers)
> - The event enters my pipeline (Beam reads from that topic) and begins applying a series of transforms that evaluate the element itself (e.g. does it have any users associated with it, IP addresses, other interesting information).
> - When interesting information is encountered on the element (or errors), I'd like to be able to associate them with the trace (e.g. a user was found, this is some information about the user, this is the unique identifier associated with them, or there was an error because the user had a malformed e-mail address)
> - The traces themselves would be cumulative, so if an event was processed through one pipeline, it would contain all the necessary tracing headers in the message so if another pipeline picked it up from its destination topic (e.g. the destination of the first pipeline), the trace could be continued.
> 
> I think that being able to pick up interactive systems would be a nice to have (e.g. this record is being sent to Parquet, Mongo, or some other topic), but I'm just trying to focus on being able to add to the trace at the ParDo/element level for now.
> 
> I hope that helps.
> 
> Rion
> 
> On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote: 
>> Can you explain a bit more of what you want to achieve here?
>> 
>> Do you want to trace how your elements go to the pipeline or do you want to
>> see how every ParDo interacts with external systems?
>> 
>> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> I'm reaching out today to inquire if Apache Beam has any support or
>>> mechanisms to support some type of distributed tracing similar to something
>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java
>>> SDK, however due to the nature of Beam working with transforms that yield
>>> immutable collections, I wasn't sure what the best avenue would be to
>>> correlate various transforms against a particular element would be?
>>> 
>>> Coming from a Kafka Streams background, this process was pretty trivial as
>>> I'd simply store my correlation identifier within the message headers and
>>> those would be persisted as the element traveled through Kafka into various
>>> applications and topics. I'm hoping to still leverage some of that in Beam
>>> if at all possible or see what, if any, recommended approaches there are
>>> out there.
>>> 
>>> My current approach involves the creation of a "Tracing Context" which
>>> would just be a wrapper for each of my elements that had their own
>>> associated trace with them and instead of just passing around a
>>> PCollection<X> I would use a PCollection<Tracable<X>> that would just be a
>>> wrapper for the tracer and the underlying element so that I could access
>>> the tracer during any element-wise operations in the pipeline.
>>> 
>>> Any recommendations or suggestions are more than welcome! I'm very new to
>>> the Beam ecosystem, so I'd love to leverage anything out there that might
>>> help me from reinventing the wheel.
>>> 
>>> Thanks much!
>>> 
>>> Rion
>>> 
>>

Re: Distributed Tracing in Apache Beam

Posted by Rion Williams <ri...@gmail.com>.

Hi Alex,

As mentioned before, I'm in the process of migrating a pipeline of several Kafka Streams applications over to Apache Beam and I'm hoping to leverage the tracing infrastructure that I had established using Jaeger whenever I can, but specifically to trace an element as it flows through a pipeline or potentially multiple pipelines.

An example might go something like this:

- An event is produced from some service and sent to a Kafka Topic (with a tracing id in the headers)
- The event enters my pipeline (Beam reads from that topic) and begins applying a series of transforms that evaluate the element itself (e.g. does it have any users associated with it, IP addresses, other interesting information).
- When interesting information is encountered on the element (or errors), I'd like to be able to associate them with the trace (e.g. a user was found, this is some information about the user, this is the unique identifier associated with them, or there was an error because the user had a malformed e-mail address)
- The traces themselves would be cumulative, so if an event was processed through one pipeline, it would contain all the necessary tracing headers in the message so if another pipeline picked it up from its destination topic (e.g. the destination of the first pipeline), the trace could be continued.

I think that being able to pick up interactive systems would be a nice to have (e.g. this record is being sent to Parquet, Mongo, or some other topic), but I'm just trying to focus on being able to add to the trace at the ParDo/element level for now.

I hope that helps.

Rion

On 2020/04/17 16:30:14, Alex Van Boxel <al...@vanboxel.be> wrote: 
> Can you explain a bit more of what you want to achieve here?
> 
> Do you want to trace how your elements go to the pipeline or do you want to
> see how every ParDo interacts with external systems?
> 
> On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com> wrote:
> 
> > Hi all,
> >
> > I'm reaching out today to inquire if Apache Beam has any support or
> > mechanisms to support some type of distributed tracing similar to something
> > like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java
> > SDK, however due to the nature of Beam working with transforms that yield
> > immutable collections, I wasn't sure what the best avenue would be to
> > correlate various transforms against a particular element would be?
> >
> > Coming from a Kafka Streams background, this process was pretty trivial as
> > I'd simply store my correlation identifier within the message headers and
> > those would be persisted as the element traveled through Kafka into various
> > applications and topics. I'm hoping to still leverage some of that in Beam
> > if at all possible or see what, if any, recommended approaches there are
> > out there.
> >
> > My current approach involves the creation of a "Tracing Context" which
> > would just be a wrapper for each of my elements that had their own
> > associated trace with them and instead of just passing around a
> > PCollection<X> I would use a PCollection<Tracable<X>> that would just be a
> > wrapper for the tracer and the underlying element so that I could access
> > the tracer during any element-wise operations in the pipeline.
> >
> > Any recommendations or suggestions are more than welcome! I'm very new to
> > the Beam ecosystem, so I'd love to leverage anything out there that might
> > help me from reinventing the wheel.
> >
> > Thanks much!
> >
> > Rion
> >
>

Re: Distributed Tracing in Apache Beam

Posted by Alex Van Boxel <al...@vanboxel.be>.

Can you explain a bit more of what you want to achieve here?

Do you want to trace how your elements go to the pipeline or do you want to
see how every ParDo interacts with external systems?

On Fri, Apr 17, 2020, 17:38 Rion Williams <ri...@gmail.com> wrote:

> Hi all,
>
> I'm reaching out today to inquire if Apache Beam has any support or
> mechanisms to support some type of distributed tracing similar to something
> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a Java
> SDK, however due to the nature of Beam working with transforms that yield
> immutable collections, I wasn't sure what the best avenue would be to
> correlate various transforms against a particular element would be?
>
> Coming from a Kafka Streams background, this process was pretty trivial as
> I'd simply store my correlation identifier within the message headers and
> those would be persisted as the element traveled through Kafka into various
> applications and topics. I'm hoping to still leverage some of that in Beam
> if at all possible or see what, if any, recommended approaches there are
> out there.
>
> My current approach involves the creation of a "Tracing Context" which
> would just be a wrapper for each of my elements that had their own
> associated trace with them and instead of just passing around a
> PCollection<X> I would use a PCollection<Tracable<X>> that would just be a
> wrapper for the tracer and the underlying element so that I could access
> the tracer during any element-wise operations in the pipeline.
>
> Any recommendations or suggestions are more than welcome! I'm very new to
> the Beam ecosystem, so I'd love to leverage anything out there that might
> help me from reinventing the wheel.
>
> Thanks much!
>
> Rion
>