You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Marcin Kuthan <ma...@gmail.com> on 2020/07/01 10:52:59 UTC
Watermark lag for GenerateSequence on Dataflow is always ~40 minutes
Hi
My processing pipeline utilizes GenerateSequence transform to query
BigQuery periodically.
It works, the query is performed exactly with rate defined by
GenerateSequence but the watermark is always ~40 minutes behind generated
timestamps.
The code looks as follows (Beam: 2.19, Scio: 0.8.4):
val sc: ScioContext = ...
val startTime = Instant.now()
val interval = Duration.standardMinutes(10)
val sequence = sc.customInput(
GenerateSequence.from(1)
.withRate(1, interval)
.withTimestampFn(i => startTime.plus(interval.multipliedBy(i))))
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(1)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES))
.withTimestamp
.map { case (_, timestamp) => timestamp }
val finalStream = sequence.flatMap { timestamp => // load data from BQ, the
timestamps from the sequence are preserved }
I'm looking for the reason for the GenerateSequence watermark lag when the
code is running on Dataflow runner. Any clue how to debug the issue further?
Marcin