You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Maxwell Pospischil <ma...@cloudkitchens.com> on 2022/10/31 23:34:04 UTC

Passive lineage collection?

Hey all,

We've got a number of flink jobs deployed and I'm wrestling with how best
to collect data lineage information from them. Our most common connector on
either end is Kafka, as either a source or a sink. I've found that I can
subclass the Kafka record de/serializers and collect topics off of the
messages on the way in/out for the jobs that use the DataStream api but I'm
finding it significantly more cumbersome to do similar for the ones going
through the sql connector.

I've taken a step-back from that approach and am wondering if maybe there's
a passive way I can collect this information by observing the flink job
externally? I've found my source topics sprinkled throughout the logs,
checkpoint files and in metric names on the job graph and if source topics
were all that I was worried about, I think I could collect them from any of
those places. I'm having a hard time finding sink topics by any passive
means though, is it possible to get that information anywhere?

Thanks!