You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Shen Li <cs...@gmail.com> on 2018/03/06 21:06:39 UTC

When should ParDo advance output watermarks?

Hi,

Should ParDo advance output watermarks based on only main input or all
inputs? Say if the watermark from a side input falls behind, should it
block the output watermark of the ParDo.

If there are pushed back elements, should the ParDo hold back its output
watermarks until corresponding pushed back elements are all processed?

Thanks,
Shen

Re: When should ParDo advance output watermarks?

Posted by Shen Li <cs...@gmail.com>.
Hi Kenn,

Thank you!

Shen

On Tue, Mar 6, 2018 at 5:21 PM, Kenneth Knowles <kl...@google.com> wrote:

> On Tue, Mar 6, 2018 at 1:06 PM Shen Li <cs...@gmail.com> wrote:
>
>> Hi,
>>
>> Should ParDo advance output watermarks based on only main input or all
>> inputs? Say if the watermark from a side input falls behind, should it
>> block the output watermark of the ParDo.
>>
>
> The rule is that if the user's DoFn might output data with a timestamp,
> that timestamp should be a bound on the output watermark. For side inputs,
> I don't think this is the case. The readiness of the side input plus the
> info in the WindowMappingFn will determine which main elements must be
> pushed back, and this will bound the output watermark.
>
> The exception to the rule is that if data is behind the watermark it is
> "already late" it is OK to let the watermark advance because it doesn't
> make it "more late". Instead, then apply all the same holding rules to GC
> time so the data doesn't become droppable. The reason for this is that a
> large influx of late data could cause a backlog that prevents more recent
> data from achieving good latency.
>
> If there are pushed back elements, should the ParDo hold back its output
>> watermarks until corresponding pushed back elements are all processed?
>>
>
> Yes, it should hold the watermark for these.
>
> Kenn
>
>
>>
>> Thanks,
>> Shen
>>
>

Re: When should ParDo advance output watermarks?

Posted by Kenneth Knowles <kl...@google.com>.
On Tue, Mar 6, 2018 at 1:06 PM Shen Li <cs...@gmail.com> wrote:

> Hi,
>
> Should ParDo advance output watermarks based on only main input or all
> inputs? Say if the watermark from a side input falls behind, should it
> block the output watermark of the ParDo.
>

The rule is that if the user's DoFn might output data with a timestamp,
that timestamp should be a bound on the output watermark. For side inputs,
I don't think this is the case. The readiness of the side input plus the
info in the WindowMappingFn will determine which main elements must be
pushed back, and this will bound the output watermark.

The exception to the rule is that if data is behind the watermark it is
"already late" it is OK to let the watermark advance because it doesn't
make it "more late". Instead, then apply all the same holding rules to GC
time so the data doesn't become droppable. The reason for this is that a
large influx of late data could cause a backlog that prevents more recent
data from achieving good latency.

If there are pushed back elements, should the ParDo hold back its output
> watermarks until corresponding pushed back elements are all processed?
>

Yes, it should hold the watermark for these.

Kenn


>
> Thanks,
> Shen
>