You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Rui Wang <ru...@google.com> on 2019/01/17 18:43:04 UTC

Confusing sentence in Windowing section in Beam programming guide

Hi Community,

In Beam programming guide [1], there is a sentence: "Data that arrives with
a timestamp after the watermark is considered *late data*"

Seems like people get confused by it. For example, see Stackoverflow
comment [2]. Basically it makes people think that a event timestamp that is
bigger than watermark is considered late (due to that "after").

Although there is a example right after this sentence to explain late data,
seems to me that this sentence is incomplete. The complete sentence to me
can be: "The watermark consistently advances from -inf to +inf. Data that
arrives with a timestamp after the watermark is considered late data."

Am I understand correctly? Is there better description for the order of
late data and watermark? I would happy to send PR to update Beam
documentation.

-Rui

[1]: https://beam.apache.org/documentation/programming-guide/#windowing
[2]:
https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Rui Wang <ru...@google.com>.
Thanks all! The documentation has been updated.

-Rui

On Tue, Jan 22, 2019 at 12:24 PM Reuven Lax <re...@google.com> wrote:

> Ah yes, Kenn is correct, and i forget we made that change.
>
> To clarify - Beam does not expose late elements as a concept, rather it
> exposes late panes on its triggering API. The reason we made the change was
> not just because we wanted to include as much data as possible. but also
> because we wanted to preserve reasonable invariants around late data (e.g.
> one invariant should be that the result of processing an on-time pane
> should not produce late data). Simply checking element timestamps against
> the watermark is non deterministic (since the watermark advances
> asynchronously), so we moved to a different definition of late data.
>
> Reuven
>
> On Fri, Jan 18, 2019 at 11:24 AM Kenneth Knowles <kl...@google.com> wrote:
>
>> That is correct. For global window there is no such thing as late data.
>>
>> Kenn
>>
>> On Fri, Jan 18, 2019, 11:13 Ruoyun Huang <ruoyun@google.com wrote:
>>
>>> Very helpful discussion (and the fixing PR).
>>>
>>> To make sure my take-way is correct. The status quo is a) "for a Global
>>> Window, then there is *no possible scenario* where data is identified
>>> as late".  Rather than b) "for a global window we *no longer* compare
>>> watermark to identify late data, but *there is still other criteria*
>>> that determines data late".
>>>
>>> a) is correct and b) is not.  Is that so?
>>>
>>> On Thu, Jan 17, 2019 at 8:57 PM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> Actually, Reuven, that's no longer the case.
>>>>
>>>> It used to be that incoming data was compared to the watermark but it
>>>> is not today. Instead, Jeff's first phrasing is perfect.
>>>>
>>>> One way to see it is the think about what are the consequences of late
>>>> data: if there is a grouping/aggregation by key+window, the window
>>>> determines when the grouping is complete. We go ahead and include any data
>>>> that shows up before the window is complete. And if you set up allowed
>>>> lateness it matches exactly: any data that arrives before the ON_TIME
>>>> output gets to be in that output.
>>>>
>>>> Previously, when we compared incoming elements to the watermark
>>>> directly, you could have a window that was still being aggregated but the
>>>> elements that fell in the window were dropped. There was no technical
>>>> benefit to losing this data, so we stopped dropping it. We also had lots of
>>>> tricky bugs and hard-to-manage code related to what we do if an element
>>>> arrives after the watermark. And you could have an ON_TIME firing that
>>>> included a bunch of "late" data which is confusing.
>>>>
>>>> Now it is simple: if the window is still alive, the element goes into
>>>> it.
>>>>
>>>> I very rarely use the term "late data" when describing Beam's semantics
>>>> anyhow. I always found the term / definition a bit arbitrary.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ru...@google.com> wrote:
>>>>
>>>>> I created this PR: https://github.com/apache/beam/pull/7556
>>>>>
>>>>> Feel free to review/comment it.
>>>>>
>>>>> -Rui
>>>>>
>>>>> On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ru...@google.com> wrote:
>>>>>
>>>>>> It might be better to keep something like "watermark usually
>>>>>> consistently moves forward". But "Elements that arrive with a smaller
>>>>>> timestamp than the current watermark are considered late data." has already
>>>>>> given the order of late data ts and watermark.
>>>>>>
>>>>>>
>>>>>> -Rui
>>>>>>
>>>>>> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Reuven - I don't think I realized it was possible to have late data
>>>>>>> with the global window, so I'm definitely learning things through this
>>>>>>> discussion.
>>>>>>>
>>>>>>> New suggested wording, then:
>>>>>>>
>>>>>>>     Elements that arrive with a smaller timestamp than the current
>>>>>>> watermark are considered late data.
>>>>>>>
>>>>>>> That says basically the same thing as the wording currently in the
>>>>>>> guide, but uses "smaller" (which implies a less-than-watermark comparison)
>>>>>>> rather than "later" (which folks have interpreted as a
>>>>>>> greater-than-watermark comparison).
>>>>>>>
>>>>>>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>>>>>>>
>>>>>>>> Though it's not tied to window. You could be in the global window,
>>>>>>>> so the watermark never advances past the end of the window, yet still get
>>>>>>>> late data.
>>>>>>>>
>>>>>>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> How about: "Once the watermark progresses past the end of a
>>>>>>>>> window, any further elements that arrive with a timestamp in that window
>>>>>>>>> are considered late data."
>>>>>>>>>
>>>>>>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Community,
>>>>>>>>>>
>>>>>>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>>>>>>> arrives with a timestamp after the watermark is considered *late
>>>>>>>>>> data*"
>>>>>>>>>>
>>>>>>>>>> Seems like people get confused by it. For example, see
>>>>>>>>>> Stackoverflow comment [2]. Basically it makes people think that a event
>>>>>>>>>> timestamp that is bigger than watermark is considered late (due to that
>>>>>>>>>> "after").
>>>>>>>>>>
>>>>>>>>>> Although there is a example right after this sentence to explain
>>>>>>>>>> late data, seems to me that this sentence is incomplete. The complete
>>>>>>>>>> sentence to me can be: "The watermark consistently advances from -inf to
>>>>>>>>>> +inf. Data that arrives with a timestamp after the watermark is considered
>>>>>>>>>> late data."
>>>>>>>>>>
>>>>>>>>>> Am I understand correctly? Is there better description for the
>>>>>>>>>> order of late data and watermark? I would happy to send PR to update Beam
>>>>>>>>>> documentation.
>>>>>>>>>>
>>>>>>>>>> -Rui
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>>>>>>> [2]:
>>>>>>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>
>>> --
>>> ================
>>> Ruoyun  Huang
>>>
>>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Reuven Lax <re...@google.com>.
Ah yes, Kenn is correct, and i forget we made that change.

To clarify - Beam does not expose late elements as a concept, rather it
exposes late panes on its triggering API. The reason we made the change was
not just because we wanted to include as much data as possible. but also
because we wanted to preserve reasonable invariants around late data (e.g.
one invariant should be that the result of processing an on-time pane
should not produce late data). Simply checking element timestamps against
the watermark is non deterministic (since the watermark advances
asynchronously), so we moved to a different definition of late data.

Reuven

On Fri, Jan 18, 2019 at 11:24 AM Kenneth Knowles <kl...@google.com> wrote:

> That is correct. For global window there is no such thing as late data.
>
> Kenn
>
> On Fri, Jan 18, 2019, 11:13 Ruoyun Huang <ruoyun@google.com wrote:
>
>> Very helpful discussion (and the fixing PR).
>>
>> To make sure my take-way is correct. The status quo is a) "for a Global
>> Window, then there is *no possible scenario* where data is identified as
>> late".  Rather than b) "for a global window we *no longer* compare
>> watermark to identify late data, but *there is still other criteria*
>> that determines data late".
>>
>> a) is correct and b) is not.  Is that so?
>>
>> On Thu, Jan 17, 2019 at 8:57 PM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> Actually, Reuven, that's no longer the case.
>>>
>>> It used to be that incoming data was compared to the watermark but it is
>>> not today. Instead, Jeff's first phrasing is perfect.
>>>
>>> One way to see it is the think about what are the consequences of late
>>> data: if there is a grouping/aggregation by key+window, the window
>>> determines when the grouping is complete. We go ahead and include any data
>>> that shows up before the window is complete. And if you set up allowed
>>> lateness it matches exactly: any data that arrives before the ON_TIME
>>> output gets to be in that output.
>>>
>>> Previously, when we compared incoming elements to the watermark
>>> directly, you could have a window that was still being aggregated but the
>>> elements that fell in the window were dropped. There was no technical
>>> benefit to losing this data, so we stopped dropping it. We also had lots of
>>> tricky bugs and hard-to-manage code related to what we do if an element
>>> arrives after the watermark. And you could have an ON_TIME firing that
>>> included a bunch of "late" data which is confusing.
>>>
>>> Now it is simple: if the window is still alive, the element goes into it.
>>>
>>> I very rarely use the term "late data" when describing Beam's semantics
>>> anyhow. I always found the term / definition a bit arbitrary.
>>>
>>> Kenn
>>>
>>> On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ru...@google.com> wrote:
>>>
>>>> I created this PR: https://github.com/apache/beam/pull/7556
>>>>
>>>> Feel free to review/comment it.
>>>>
>>>> -Rui
>>>>
>>>> On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ru...@google.com> wrote:
>>>>
>>>>> It might be better to keep something like "watermark usually
>>>>> consistently moves forward". But "Elements that arrive with a smaller
>>>>> timestamp than the current watermark are considered late data." has already
>>>>> given the order of late data ts and watermark.
>>>>>
>>>>>
>>>>> -Rui
>>>>>
>>>>> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com>
>>>>> wrote:
>>>>>
>>>>>> Reuven - I don't think I realized it was possible to have late data
>>>>>> with the global window, so I'm definitely learning things through this
>>>>>> discussion.
>>>>>>
>>>>>> New suggested wording, then:
>>>>>>
>>>>>>     Elements that arrive with a smaller timestamp than the current
>>>>>> watermark are considered late data.
>>>>>>
>>>>>> That says basically the same thing as the wording currently in the
>>>>>> guide, but uses "smaller" (which implies a less-than-watermark comparison)
>>>>>> rather than "later" (which folks have interpreted as a
>>>>>> greater-than-watermark comparison).
>>>>>>
>>>>>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> Though it's not tied to window. You could be in the global window,
>>>>>>> so the watermark never advances past the end of the window, yet still get
>>>>>>> late data.
>>>>>>>
>>>>>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How about: "Once the watermark progresses past the end of a window,
>>>>>>>> any further elements that arrive with a timestamp in that window are
>>>>>>>> considered late data."
>>>>>>>>
>>>>>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Community,
>>>>>>>>>
>>>>>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>>>>>> arrives with a timestamp after the watermark is considered *late
>>>>>>>>> data*"
>>>>>>>>>
>>>>>>>>> Seems like people get confused by it. For example, see
>>>>>>>>> Stackoverflow comment [2]. Basically it makes people think that a event
>>>>>>>>> timestamp that is bigger than watermark is considered late (due to that
>>>>>>>>> "after").
>>>>>>>>>
>>>>>>>>> Although there is a example right after this sentence to explain
>>>>>>>>> late data, seems to me that this sentence is incomplete. The complete
>>>>>>>>> sentence to me can be: "The watermark consistently advances from -inf to
>>>>>>>>> +inf. Data that arrives with a timestamp after the watermark is considered
>>>>>>>>> late data."
>>>>>>>>>
>>>>>>>>> Am I understand correctly? Is there better description for the
>>>>>>>>> order of late data and watermark? I would happy to send PR to update Beam
>>>>>>>>> documentation.
>>>>>>>>>
>>>>>>>>> -Rui
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>>>>>> [2]:
>>>>>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> ================
>> Ruoyun  Huang
>>
>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Kenneth Knowles <kl...@google.com>.
That is correct. For global window there is no such thing as late data.

Kenn

On Fri, Jan 18, 2019, 11:13 Ruoyun Huang <ruoyun@google.com wrote:

> Very helpful discussion (and the fixing PR).
>
> To make sure my take-way is correct. The status quo is a) "for a Global
> Window, then there is *no possible scenario* where data is identified as
> late".  Rather than b) "for a global window we *no longer* compare
> watermark to identify late data, but *there is still other criteria* that
> determines data late".
>
> a) is correct and b) is not.  Is that so?
>
> On Thu, Jan 17, 2019 at 8:57 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> Actually, Reuven, that's no longer the case.
>>
>> It used to be that incoming data was compared to the watermark but it is
>> not today. Instead, Jeff's first phrasing is perfect.
>>
>> One way to see it is the think about what are the consequences of late
>> data: if there is a grouping/aggregation by key+window, the window
>> determines when the grouping is complete. We go ahead and include any data
>> that shows up before the window is complete. And if you set up allowed
>> lateness it matches exactly: any data that arrives before the ON_TIME
>> output gets to be in that output.
>>
>> Previously, when we compared incoming elements to the watermark directly,
>> you could have a window that was still being aggregated but the elements
>> that fell in the window were dropped. There was no technical benefit to
>> losing this data, so we stopped dropping it. We also had lots of tricky
>> bugs and hard-to-manage code related to what we do if an element arrives
>> after the watermark. And you could have an ON_TIME firing that included a
>> bunch of "late" data which is confusing.
>>
>> Now it is simple: if the window is still alive, the element goes into it.
>>
>> I very rarely use the term "late data" when describing Beam's semantics
>> anyhow. I always found the term / definition a bit arbitrary.
>>
>> Kenn
>>
>> On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ru...@google.com> wrote:
>>
>>> I created this PR: https://github.com/apache/beam/pull/7556
>>>
>>> Feel free to review/comment it.
>>>
>>> -Rui
>>>
>>> On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ru...@google.com> wrote:
>>>
>>>> It might be better to keep something like "watermark usually
>>>> consistently moves forward". But "Elements that arrive with a smaller
>>>> timestamp than the current watermark are considered late data." has already
>>>> given the order of late data ts and watermark.
>>>>
>>>>
>>>> -Rui
>>>>
>>>> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com>
>>>> wrote:
>>>>
>>>>> Reuven - I don't think I realized it was possible to have late data
>>>>> with the global window, so I'm definitely learning things through this
>>>>> discussion.
>>>>>
>>>>> New suggested wording, then:
>>>>>
>>>>>     Elements that arrive with a smaller timestamp than the current
>>>>> watermark are considered late data.
>>>>>
>>>>> That says basically the same thing as the wording currently in the
>>>>> guide, but uses "smaller" (which implies a less-than-watermark comparison)
>>>>> rather than "later" (which folks have interpreted as a
>>>>> greater-than-watermark comparison).
>>>>>
>>>>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> Though it's not tied to window. You could be in the global window, so
>>>>>> the watermark never advances past the end of the window, yet still get late
>>>>>> data.
>>>>>>
>>>>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com
>>>>>> wrote:
>>>>>>
>>>>>>> How about: "Once the watermark progresses past the end of a window,
>>>>>>> any further elements that arrive with a timestamp in that window are
>>>>>>> considered late data."
>>>>>>>
>>>>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>>>>>>
>>>>>>>> Hi Community,
>>>>>>>>
>>>>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>>>>> arrives with a timestamp after the watermark is considered *late
>>>>>>>> data*"
>>>>>>>>
>>>>>>>> Seems like people get confused by it. For example, see
>>>>>>>> Stackoverflow comment [2]. Basically it makes people think that a event
>>>>>>>> timestamp that is bigger than watermark is considered late (due to that
>>>>>>>> "after").
>>>>>>>>
>>>>>>>> Although there is a example right after this sentence to explain
>>>>>>>> late data, seems to me that this sentence is incomplete. The complete
>>>>>>>> sentence to me can be: "The watermark consistently advances from -inf to
>>>>>>>> +inf. Data that arrives with a timestamp after the watermark is considered
>>>>>>>> late data."
>>>>>>>>
>>>>>>>> Am I understand correctly? Is there better description for the
>>>>>>>> order of late data and watermark? I would happy to send PR to update Beam
>>>>>>>> documentation.
>>>>>>>>
>>>>>>>> -Rui
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>>>>> [2]:
>>>>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>>>>
>>>>>>>>
>>>>>>>>
>
> --
> ================
> Ruoyun  Huang
>
>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Ruoyun Huang <ru...@google.com>.
Very helpful discussion (and the fixing PR).

To make sure my take-way is correct. The status quo is a) "for a Global
Window, then there is *no possible scenario* where data is identified as
late".  Rather than b) "for a global window we *no longer* compare
watermark to identify late data, but *there is still other criteria* that
determines data late".

a) is correct and b) is not.  Is that so?

On Thu, Jan 17, 2019 at 8:57 PM Kenneth Knowles <kl...@google.com> wrote:

> Actually, Reuven, that's no longer the case.
>
> It used to be that incoming data was compared to the watermark but it is
> not today. Instead, Jeff's first phrasing is perfect.
>
> One way to see it is the think about what are the consequences of late
> data: if there is a grouping/aggregation by key+window, the window
> determines when the grouping is complete. We go ahead and include any data
> that shows up before the window is complete. And if you set up allowed
> lateness it matches exactly: any data that arrives before the ON_TIME
> output gets to be in that output.
>
> Previously, when we compared incoming elements to the watermark directly,
> you could have a window that was still being aggregated but the elements
> that fell in the window were dropped. There was no technical benefit to
> losing this data, so we stopped dropping it. We also had lots of tricky
> bugs and hard-to-manage code related to what we do if an element arrives
> after the watermark. And you could have an ON_TIME firing that included a
> bunch of "late" data which is confusing.
>
> Now it is simple: if the window is still alive, the element goes into it.
>
> I very rarely use the term "late data" when describing Beam's semantics
> anyhow. I always found the term / definition a bit arbitrary.
>
> Kenn
>
> On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ru...@google.com> wrote:
>
>> I created this PR: https://github.com/apache/beam/pull/7556
>>
>> Feel free to review/comment it.
>>
>> -Rui
>>
>> On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ru...@google.com> wrote:
>>
>>> It might be better to keep something like "watermark usually
>>> consistently moves forward". But "Elements that arrive with a smaller
>>> timestamp than the current watermark are considered late data." has already
>>> given the order of late data ts and watermark.
>>>
>>>
>>> -Rui
>>>
>>> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com> wrote:
>>>
>>>> Reuven - I don't think I realized it was possible to have late data
>>>> with the global window, so I'm definitely learning things through this
>>>> discussion.
>>>>
>>>> New suggested wording, then:
>>>>
>>>>     Elements that arrive with a smaller timestamp than the current
>>>> watermark are considered late data.
>>>>
>>>> That says basically the same thing as the wording currently in the
>>>> guide, but uses "smaller" (which implies a less-than-watermark comparison)
>>>> rather than "later" (which folks have interpreted as a
>>>> greater-than-watermark comparison).
>>>>
>>>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> Though it's not tied to window. You could be in the global window, so
>>>>> the watermark never advances past the end of the window, yet still get late
>>>>> data.
>>>>>
>>>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com wrote:
>>>>>
>>>>>> How about: "Once the watermark progresses past the end of a window,
>>>>>> any further elements that arrive with a timestamp in that window are
>>>>>> considered late data."
>>>>>>
>>>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>>>>>
>>>>>>> Hi Community,
>>>>>>>
>>>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>>>> arrives with a timestamp after the watermark is considered *late
>>>>>>> data*"
>>>>>>>
>>>>>>> Seems like people get confused by it. For example, see Stackoverflow
>>>>>>> comment [2]. Basically it makes people think that a event timestamp that is
>>>>>>> bigger than watermark is considered late (due to that "after").
>>>>>>>
>>>>>>> Although there is a example right after this sentence to explain
>>>>>>> late data, seems to me that this sentence is incomplete. The complete
>>>>>>> sentence to me can be: "The watermark consistently advances from -inf to
>>>>>>> +inf. Data that arrives with a timestamp after the watermark is considered
>>>>>>> late data."
>>>>>>>
>>>>>>> Am I understand correctly? Is there better description for the order
>>>>>>> of late data and watermark? I would happy to send PR to update Beam
>>>>>>> documentation.
>>>>>>>
>>>>>>> -Rui
>>>>>>>
>>>>>>> [1]:
>>>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>>>> [2]:
>>>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>>>
>>>>>>>
>>>>>>>

-- 
================
Ruoyun  Huang

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Kenneth Knowles <kl...@google.com>.
Actually, Reuven, that's no longer the case.

It used to be that incoming data was compared to the watermark but it is
not today. Instead, Jeff's first phrasing is perfect.

One way to see it is the think about what are the consequences of late
data: if there is a grouping/aggregation by key+window, the window
determines when the grouping is complete. We go ahead and include any data
that shows up before the window is complete. And if you set up allowed
lateness it matches exactly: any data that arrives before the ON_TIME
output gets to be in that output.

Previously, when we compared incoming elements to the watermark directly,
you could have a window that was still being aggregated but the elements
that fell in the window were dropped. There was no technical benefit to
losing this data, so we stopped dropping it. We also had lots of tricky
bugs and hard-to-manage code related to what we do if an element arrives
after the watermark. And you could have an ON_TIME firing that included a
bunch of "late" data which is confusing.

Now it is simple: if the window is still alive, the element goes into it.

I very rarely use the term "late data" when describing Beam's semantics
anyhow. I always found the term / definition a bit arbitrary.

Kenn

On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ru...@google.com> wrote:

> I created this PR: https://github.com/apache/beam/pull/7556
>
> Feel free to review/comment it.
>
> -Rui
>
> On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ru...@google.com> wrote:
>
>> It might be better to keep something like "watermark usually consistently
>> moves forward". But "Elements that arrive with a smaller timestamp than the
>> current watermark are considered late data." has already given the order of
>> late data ts and watermark.
>>
>>
>> -Rui
>>
>> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>>> Reuven - I don't think I realized it was possible to have late data with
>>> the global window, so I'm definitely learning things through this
>>> discussion.
>>>
>>> New suggested wording, then:
>>>
>>>     Elements that arrive with a smaller timestamp than the current
>>> watermark are considered late data.
>>>
>>> That says basically the same thing as the wording currently in the
>>> guide, but uses "smaller" (which implies a less-than-watermark comparison)
>>> rather than "later" (which folks have interpreted as a
>>> greater-than-watermark comparison).
>>>
>>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> Though it's not tied to window. You could be in the global window, so
>>>> the watermark never advances past the end of the window, yet still get late
>>>> data.
>>>>
>>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com wrote:
>>>>
>>>>> How about: "Once the watermark progresses past the end of a window,
>>>>> any further elements that arrive with a timestamp in that window are
>>>>> considered late data."
>>>>>
>>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>>>>
>>>>>> Hi Community,
>>>>>>
>>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>>> arrives with a timestamp after the watermark is considered *late
>>>>>> data*"
>>>>>>
>>>>>> Seems like people get confused by it. For example, see Stackoverflow
>>>>>> comment [2]. Basically it makes people think that a event timestamp that is
>>>>>> bigger than watermark is considered late (due to that "after").
>>>>>>
>>>>>> Although there is a example right after this sentence to explain late
>>>>>> data, seems to me that this sentence is incomplete. The complete sentence
>>>>>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>>>>>> that arrives with a timestamp after the watermark is considered late data."
>>>>>>
>>>>>> Am I understand correctly? Is there better description for the order
>>>>>> of late data and watermark? I would happy to send PR to update Beam
>>>>>> documentation.
>>>>>>
>>>>>> -Rui
>>>>>>
>>>>>> [1]:
>>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>>> [2]:
>>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>>
>>>>>>
>>>>>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Rui Wang <ru...@google.com>.
I created this PR: https://github.com/apache/beam/pull/7556

Feel free to review/comment it.

-Rui

On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ru...@google.com> wrote:

> It might be better to keep something like "watermark usually consistently
> moves forward". But "Elements that arrive with a smaller timestamp than the
> current watermark are considered late data." has already given the order of
> late data ts and watermark.
>
>
> -Rui
>
> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> Reuven - I don't think I realized it was possible to have late data with
>> the global window, so I'm definitely learning things through this
>> discussion.
>>
>> New suggested wording, then:
>>
>>     Elements that arrive with a smaller timestamp than the current
>> watermark are considered late data.
>>
>> That says basically the same thing as the wording currently in the guide,
>> but uses "smaller" (which implies a less-than-watermark comparison) rather
>> than "later" (which folks have interpreted as a greater-than-watermark
>> comparison).
>>
>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>>
>>> Though it's not tied to window. You could be in the global window, so
>>> the watermark never advances past the end of the window, yet still get late
>>> data.
>>>
>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com wrote:
>>>
>>>> How about: "Once the watermark progresses past the end of a window, any
>>>> further elements that arrive with a timestamp in that window are considered
>>>> late data."
>>>>
>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>>>
>>>>> Hi Community,
>>>>>
>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>> arrives with a timestamp after the watermark is considered *late data*
>>>>> "
>>>>>
>>>>> Seems like people get confused by it. For example, see Stackoverflow
>>>>> comment [2]. Basically it makes people think that a event timestamp that is
>>>>> bigger than watermark is considered late (due to that "after").
>>>>>
>>>>> Although there is a example right after this sentence to explain late
>>>>> data, seems to me that this sentence is incomplete. The complete sentence
>>>>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>>>>> that arrives with a timestamp after the watermark is considered late data."
>>>>>
>>>>> Am I understand correctly? Is there better description for the order
>>>>> of late data and watermark? I would happy to send PR to update Beam
>>>>> documentation.
>>>>>
>>>>> -Rui
>>>>>
>>>>> [1]:
>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>> [2]:
>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>
>>>>>
>>>>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Rui Wang <ru...@google.com>.
It might be better to keep something like "watermark usually consistently
moves forward". But "Elements that arrive with a smaller timestamp than the
current watermark are considered late data." has already given the order of
late data ts and watermark.


-Rui

On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jk...@mozilla.com> wrote:

> Reuven - I don't think I realized it was possible to have late data with
> the global window, so I'm definitely learning things through this
> discussion.
>
> New suggested wording, then:
>
>     Elements that arrive with a smaller timestamp than the current
> watermark are considered late data.
>
> That says basically the same thing as the wording currently in the guide,
> but uses "smaller" (which implies a less-than-watermark comparison) rather
> than "later" (which folks have interpreted as a greater-than-watermark
> comparison).
>
> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:
>
>> Though it's not tied to window. You could be in the global window, so the
>> watermark never advances past the end of the window, yet still get late
>> data.
>>
>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com wrote:
>>
>>> How about: "Once the watermark progresses past the end of a window, any
>>> further elements that arrive with a timestamp in that window are considered
>>> late data."
>>>
>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>>
>>>> Hi Community,
>>>>
>>>> In Beam programming guide [1], there is a sentence: "Data that arrives
>>>> with a timestamp after the watermark is considered *late data*"
>>>>
>>>> Seems like people get confused by it. For example, see Stackoverflow
>>>> comment [2]. Basically it makes people think that a event timestamp that is
>>>> bigger than watermark is considered late (due to that "after").
>>>>
>>>> Although there is a example right after this sentence to explain late
>>>> data, seems to me that this sentence is incomplete. The complete sentence
>>>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>>>> that arrives with a timestamp after the watermark is considered late data."
>>>>
>>>> Am I understand correctly? Is there better description for the order of
>>>> late data and watermark? I would happy to send PR to update Beam
>>>> documentation.
>>>>
>>>> -Rui
>>>>
>>>> [1]: https://beam.apache.org/documentation/programming-guide/#windowing
>>>> [2]:
>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>
>>>>
>>>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Jeff Klukas <jk...@mozilla.com>.
Reuven - I don't think I realized it was possible to have late data with
the global window, so I'm definitely learning things through this
discussion.

New suggested wording, then:

    Elements that arrive with a smaller timestamp than the current
watermark are considered late data.

That says basically the same thing as the wording currently in the guide,
but uses "smaller" (which implies a less-than-watermark comparison) rather
than "later" (which folks have interpreted as a greater-than-watermark
comparison).

On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote:

> Though it's not tied to window. You could be in the global window, so the
> watermark never advances past the end of the window, yet still get late
> data.
>
> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com wrote:
>
>> How about: "Once the watermark progresses past the end of a window, any
>> further elements that arrive with a timestamp in that window are considered
>> late data."
>>
>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>>
>>> Hi Community,
>>>
>>> In Beam programming guide [1], there is a sentence: "Data that arrives
>>> with a timestamp after the watermark is considered *late data*"
>>>
>>> Seems like people get confused by it. For example, see Stackoverflow
>>> comment [2]. Basically it makes people think that a event timestamp that is
>>> bigger than watermark is considered late (due to that "after").
>>>
>>> Although there is a example right after this sentence to explain late
>>> data, seems to me that this sentence is incomplete. The complete sentence
>>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>>> that arrives with a timestamp after the watermark is considered late data."
>>>
>>> Am I understand correctly? Is there better description for the order of
>>> late data and watermark? I would happy to send PR to update Beam
>>> documentation.
>>>
>>> -Rui
>>>
>>> [1]: https://beam.apache.org/documentation/programming-guide/#windowing
>>> [2]:
>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>
>>>
>>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Reuven Lax <re...@google.com>.
Though it's not tied to window. You could be in the global window, so the
watermark never advances past the end of the window, yet still get late
data.

On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklukas@mozilla.com wrote:

> How about: "Once the watermark progresses past the end of a window, any
> further elements that arrive with a timestamp in that window are considered
> late data."
>
> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>
>> Hi Community,
>>
>> In Beam programming guide [1], there is a sentence: "Data that arrives
>> with a timestamp after the watermark is considered *late data*"
>>
>> Seems like people get confused by it. For example, see Stackoverflow
>> comment [2]. Basically it makes people think that a event timestamp that is
>> bigger than watermark is considered late (due to that "after").
>>
>> Although there is a example right after this sentence to explain late
>> data, seems to me that this sentence is incomplete. The complete sentence
>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>> that arrives with a timestamp after the watermark is considered late data."
>>
>> Am I understand correctly? Is there better description for the order of
>> late data and watermark? I would happy to send PR to update Beam
>> documentation.
>>
>> -Rui
>>
>> [1]: https://beam.apache.org/documentation/programming-guide/#windowing
>> [2]:
>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>
>>
>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Rui Wang <ru...@google.com>.
It is better, and it also fits the following example.

-Rui

On Thu, Jan 17, 2019 at 11:14 AM Jeff Klukas <jk...@mozilla.com> wrote:

> How about: "Once the watermark progresses past the end of a window, any
> further elements that arrive with a timestamp in that window are considered
> late data."
>
> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:
>
>> Hi Community,
>>
>> In Beam programming guide [1], there is a sentence: "Data that arrives
>> with a timestamp after the watermark is considered *late data*"
>>
>> Seems like people get confused by it. For example, see Stackoverflow
>> comment [2]. Basically it makes people think that a event timestamp that is
>> bigger than watermark is considered late (due to that "after").
>>
>> Although there is a example right after this sentence to explain late
>> data, seems to me that this sentence is incomplete. The complete sentence
>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>> that arrives with a timestamp after the watermark is considered late data."
>>
>> Am I understand correctly? Is there better description for the order of
>> late data and watermark? I would happy to send PR to update Beam
>> documentation.
>>
>> -Rui
>>
>> [1]: https://beam.apache.org/documentation/programming-guide/#windowing
>> [2]:
>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>
>>
>>

Re: Confusing sentence in Windowing section in Beam programming guide

Posted by Jeff Klukas <jk...@mozilla.com>.
How about: "Once the watermark progresses past the end of a window, any
further elements that arrive with a timestamp in that window are considered
late data."

On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ru...@google.com> wrote:

> Hi Community,
>
> In Beam programming guide [1], there is a sentence: "Data that arrives
> with a timestamp after the watermark is considered *late data*"
>
> Seems like people get confused by it. For example, see Stackoverflow
> comment [2]. Basically it makes people think that a event timestamp that is
> bigger than watermark is considered late (due to that "after").
>
> Although there is a example right after this sentence to explain late
> data, seems to me that this sentence is incomplete. The complete sentence
> to me can be: "The watermark consistently advances from -inf to +inf. Data
> that arrives with a timestamp after the watermark is considered late data."
>
> Am I understand correctly? Is there better description for the order of
> late data and watermark? I would happy to send PR to update Beam
> documentation.
>
> -Rui
>
> [1]: https://beam.apache.org/documentation/programming-guide/#windowing
> [2]:
> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>
>
>