You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Sherin Thomas <sh...@gmail.com> on 2021/02/03 23:28:02 UTC
[DISCUSS] End to End latency measurement

Hello Folks,

We have some pipelines that contain multiple hops of Flink jobs with Kafka
transport layer in between. Data is finally written to analytical stores.
We want to measure e-2-e from the first source all the way to the last
sink(that writes to the analytical stores) and possibly also at other hops
in the middle.

Here are some strategies I'm thinking about, would love your thoughts on
the various approaches:
1. *<Processing_time - Event_time>* at various hops. Each data will contain
event_time based on the when it is written to the first kafka source ->
When there are windowed aggregations and such it's tricky to translate
correct event time to the derived event. So this is tightly coupled with
user logic and hence not favorable.
2. *Latency markers introduced in the Kafka stream *that will be consumed
by the Flink jobs -> We can potentially introduce latency markers along
with regular data, this will share the same data envelope schema so it can
travel with the regular data. Operators will need to identify it and
forward it appropriately and also exclude it from aggregations and such
which makes this approach complex. Unless there is an elegant way to
piggyback on the internal Flink latency marker movement for e-2-e latency
tracking? *Would love to hear your thoughts about this.*
3. *Sum of Kafka consumer lag* across all the Kafka topics in the pipeline
- Will give us tail latencies. We would ideally love to get a histogram of
latencies across all the events.
4. *Global minimum watermark *- In this approach, I'm thinking about
periodically checking the global minimum watermark and using that to
determine tail latency - this would probably not give a good histogram of
latencies across all partitions and data. But so far this approach seems
like the easiest to generalize across all types of operators and
aggregations. But would love to hear your thoughts on the feasibility of
this.


Let me know what you think. And if there are better ways to measure
end-2-end latency that would be lovely!


Thanks,
Sherin