You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Shen Li <cs...@gmail.com> on 2017/04/28 17:09:49 UTC

How to control watermark when using BoundedSource

Hi,

Say I want to replay a data trace of last week using fixed windows. The
data trace is read from a file using TextIO. In order to trigger windows at
right times, how can I control the watermark emitted by the BoundedSource?

Thanks,

Shen

Re: How to control watermark when using BoundedSource

Posted by Thomas Groh <tg...@google.com.INVALID>.
This is very much runner dependent, but generally the gradual output will
not occur. The timestamps of elements within the pipeline should be used to
describe when the event the element represents occurs, and a watermark
describes when all of the data that occurred before a timestamp have
arrived. As a result, the watermark advances based on the data you have
read, not the movement of "wall clock time", which means that as processing
speeds up, you can have elements emitted significantly faster than you
would if the processing occurred instantaneously in real time - once you
have all of the data, you know you cannot receive more data - so as soon as
you've processed everything you have, you're done (and can emit all of your
output).

For the latter - the runner can read the entirety of the bounded source,
which permits it to advance the watermark of downstream transforms to the
timestamp of the earliest unprocessed element. That in turn permits it to
emit output for a window when it knows it has processed the entirety of the
data for that window (because all of the elements have been read, the
source's watermark is at positive infinity, and does not hold back any
downstream computation). It is not, however, required to do this - the
runner can buffer as much as it thinks is "reasonable" - which permits some
optimizations when processing only bounded data.

On Fri, Apr 28, 2017 at 10:41 AM, Shen Li <cs...@gmail.com> wrote:

> Hi Thomas,
>
> Thanks for the explanation. Does it mean I cannot reproduce the real-time
> behavior of the replayed trace?  Say the watermarks are perfect and
> FixedWindows groups elements into 1-minute windows, will the watermarks
> trigger the FixedWindows to fire roughly every minute?
>
> I am a little confused about the "when available" behavior of the runner.
> Since the watermarks emitted by the BoundedSource will always be
> BoundedWindow.TIMESTAMP_MIN_VALUE except for the last watermark, how could
> the runner know when to trigger the computation on a window?
>
> Thanks,
>
> Shen
>
> On Fri, Apr 28, 2017 at 1:13 PM, Thomas Groh <tg...@google.com.invalid>
> wrote:
>
> > You can't directly control the watermark that a BoundedSource emits.
> > Windowing into FixedWindows will still work as you expect, however: your
> > elements will be assigned to their windows based on the time the event
> > occurred. Depending on the runner, triggers may be run either "when
> > available" or after all the work is completed, but your output data will
> be
> > as if you had a perfect watermark.
> >
> > On Fri, Apr 28, 2017 at 10:09 AM, Shen Li <cs...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Say I want to replay a data trace of last week using fixed windows. The
> > > data trace is read from a file using TextIO. In order to trigger
> windows
> > at
> > > right times, how can I control the watermark emitted by the
> > BoundedSource?
> > >
> > > Thanks,
> > >
> > > Shen
> > >
> >
>

Re: How to control watermark when using BoundedSource

Posted by Shen Li <cs...@gmail.com>.
Hi Thomas,

Thanks for the explanation. Does it mean I cannot reproduce the real-time
behavior of the replayed trace?  Say the watermarks are perfect and
FixedWindows groups elements into 1-minute windows, will the watermarks
trigger the FixedWindows to fire roughly every minute?

I am a little confused about the "when available" behavior of the runner.
Since the watermarks emitted by the BoundedSource will always be
BoundedWindow.TIMESTAMP_MIN_VALUE except for the last watermark, how could
the runner know when to trigger the computation on a window?

Thanks,

Shen

On Fri, Apr 28, 2017 at 1:13 PM, Thomas Groh <tg...@google.com.invalid>
wrote:

> You can't directly control the watermark that a BoundedSource emits.
> Windowing into FixedWindows will still work as you expect, however: your
> elements will be assigned to their windows based on the time the event
> occurred. Depending on the runner, triggers may be run either "when
> available" or after all the work is completed, but your output data will be
> as if you had a perfect watermark.
>
> On Fri, Apr 28, 2017 at 10:09 AM, Shen Li <cs...@gmail.com> wrote:
>
> > Hi,
> >
> > Say I want to replay a data trace of last week using fixed windows. The
> > data trace is read from a file using TextIO. In order to trigger windows
> at
> > right times, how can I control the watermark emitted by the
> BoundedSource?
> >
> > Thanks,
> >
> > Shen
> >
>

Re: How to control watermark when using BoundedSource

Posted by Thomas Groh <tg...@google.com.INVALID>.
You can't directly control the watermark that a BoundedSource emits.
Windowing into FixedWindows will still work as you expect, however: your
elements will be assigned to their windows based on the time the event
occurred. Depending on the runner, triggers may be run either "when
available" or after all the work is completed, but your output data will be
as if you had a perfect watermark.

On Fri, Apr 28, 2017 at 10:09 AM, Shen Li <cs...@gmail.com> wrote:

> Hi,
>
> Say I want to replay a data trace of last week using fixed windows. The
> data trace is read from a file using TextIO. In order to trigger windows at
> right times, how can I control the watermark emitted by the BoundedSource?
>
> Thanks,
>
> Shen
>