You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Marcin Kuthan <ma...@gmail.com> on 2020/07/01 10:52:59 UTC

Watermark lag for GenerateSequence on Dataflow is always ~40 minutes

Hi

My processing pipeline utilizes GenerateSequence transform to query
BigQuery periodically.
It works, the query is performed exactly with rate defined by
GenerateSequence but the watermark is always ~40 minutes behind generated
timestamps.

The code looks as follows (Beam: 2.19, Scio: 0.8.4):

val sc: ScioContext = ...

val startTime = Instant.now()
val interval = Duration.standardMinutes(10)

val sequence = sc.customInput(
  GenerateSequence.from(1)
    .withRate(1, interval)
    .withTimestampFn(i => startTime.plus(interval.multipliedBy(i))))
  .withGlobalWindow(WindowOptions(
    trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(1)),
    accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES))
  .withTimestamp
  .map { case (_, timestamp) => timestamp }

val finalStream = sequence.flatMap { timestamp => // load data from BQ, the
timestamps from the sequence are preserved }

I'm looking for the reason for the GenerateSequence watermark lag when the
code is running on Dataflow runner. Any clue how to debug the issue further?

Marcin