You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by pnowojski <gi...@git.apache.org> on 2018/05/25 07:56:43 UTC

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

GitHub user pnowojski opened a pull request:

    https://github.com/apache/flink/pull/6076

    [hotfix][docs] Specify operators behaviour on processing watermarks

    This PR is a simple documentation improvement

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pnowojski/flink docs3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6076.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6076
    
----
commit 97599bd0e547e7f413ceeab5d26b08a6a5684930
Author: Piotr Nowojski <pi...@...>
Date:   2018-05-25T07:55:04Z

    [hotfix][docs] Specify operators behaviour on processing watermarks

----


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194390932
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,33 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    +to be processed. That means in case of gap in the incoming data, even time will not progress and for
    +example window operator will not be triggered and thus existing windows will not be able produce any
    +output data.
    +
    +To circumvent this one can use periodic watermark assigners that don't only assign based on
    +element timestamps. Example solution could be an assigner that switches to using current processing time
    +as the time basis after not observing new events for a while.
     
     ## Debugging Watermarks
     
     Please refer to the [Debugging Windows & Event Time]({{ site.baseurl }}/monitoring/debugging_event_time.html) section for debugging
     watermarks at runtime.
     
    +## How operators are processing watermarks
    +
    +As a general rule, operators are required to completely process a given watermark before forwarding it downstream. For example,
    +`WindowOperator` will first evaluate which windows should be fired, and only after producing all of the output triggered by
    +the watermark will the watermark itself be sent downstream. In other words, all elements produced due to occurrence of a watermark
    +will be emitted before the watermark.
    +
    +The same rule applies to `TwoInputStreamOperator`. However, in this case the current watermark of the operator is defined as
    --- End diff --
    
    They are part of the public api, so I don't see a big problem referring to them. Especially that users are asking how to override the default behaviour defined by `TwoInputStreamOperator.processWatermark1` and `TwoInputStreamOperator.processWatermark2` (that was one of the reasons for this doc improvement).


---

[GitHub] flink issue #6076: [hotfix][docs] Specify operators behaviour on processing ...

Posted by eliaslevy <gi...@git.apache.org>.
Github user eliaslevy commented on the issue:

    https://github.com/apache/flink/pull/6076
  
    @pnowojski what about refactoring the doc into something like [this](https://docs.google.com/document/d/1b5d-hTdJQsPH3YD0zTB4ZqodinZVHFomKvt41FfUPMc/edit?usp=sharing). 


---

[GitHub] flink issue #6076: [hotfix][docs] Specify operators behaviour on processing ...

Posted by eliaslevy <gi...@git.apache.org>.
Github user eliaslevy commented on the issue:

    https://github.com/apache/flink/pull/6076
  
    Thanks, that helps.
    
    Regarding debugging watermarks, at least in 1.4.2 sources did not display their emitted watermarks, although this may have been fixed in 1.5.0.  In turn, that makes jobs with task chaining turned on more difficult to debug.  It may be useful to suggest turning off task chaining for debugging.
    
    Also, the Kafka connector page discusses the fact that the Kafka source only emits watermarks as the minimum of the per partition watermark, which means an idle partition may delay watermark emission. It may be good to reiterate that behavior here, as Kafka is the most commonly used connector.



---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194702541
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,36 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    +to be processed. That means in case of gap in the incoming data, even time will not progress and for
    +example window operator will not be triggered and thus existing windows will not be able produce any
    +output data.
    +
    +To circumvent this one can use periodic watermark assigners that don't only assign based on
    +element timestamps. Example solution could be an assigner that switches to using current processing time
    --- End diff --
    
    `An` example


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r190858003
  
    --- Diff: docs/dev/event_time.md ---
    @@ -219,4 +219,17 @@ with late elements in event time windows.
     Please refer to the [Debugging Windows & Event Time]({{ site.baseurl }}/monitoring/debugging_event_time.html) section for debugging
     watermarks at runtime.
     
    +## How operators are processing watermarks
    +
    +General rule is that operator are required to completely process a given watermark before forwarding it downstream. For example,
    +`WindowOperator` will first evaluate which windows should be fired and only after producing all of the output triggered by
    +the watermark, the watermark itself will be handled downstream. In other words, all elements produced due to occurrence of
    +the watermark will be emitted before such watermark.
    +
    +Same rule applies to `TwoInputStreamOperator`. However in this case current watermark of the operator is defined as a minimum
    +of both of it's inputs.
    +
    +Details of this behaviour is defined by implementations of methods `OneInputStreamOperator.processWatermark`,
    +`TwoInputStreamOperator.processWatermark1` and `TwoInputStreamOperator.processWatermark2`.
    +
    --- End diff --
    
    Thanks for the corrections! +1 for american version
    
    I didn't want to make a definitive statement one way or the other, since I think there always can be some special exceptions. Also, if user changes this semantic because of reasons in his custom operator, then Flink will not brake, it will have different semantic.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194702106
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,36 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    +to be processed. That means in case of gap in the incoming data, even time will not progress and for
    --- End diff --
    
    typo: event time


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194702812
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,36 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    --- End diff --
    
    remote comma


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by eliaslevy <gi...@git.apache.org>.
Github user eliaslevy commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r193619225
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,33 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    --- End diff --
    
    Good point.  Probably a good idea to include the here the description in [StreamStatus](https://ci.apache.org/projects/flink/flink-docs-release-1.5/api/java/org/apache/flink/streaming/runtime/streamstatus/StreamStatus.html).


---

[GitHub] flink issue #6076: [hotfix][docs] Specify operators behaviour on processing ...

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on the issue:

    https://github.com/apache/flink/pull/6076
  
    Thanks for the feedback @eliaslevy. Pushed one extra commit about idling sources.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by alpinegizmo <gi...@git.apache.org>.
Github user alpinegizmo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r190841061
  
    --- Diff: docs/dev/event_time.md ---
    @@ -219,4 +219,17 @@ with late elements in event time windows.
     Please refer to the [Debugging Windows & Event Time]({{ site.baseurl }}/monitoring/debugging_event_time.html) section for debugging
     watermarks at runtime.
     
    +## How operators are processing watermarks
    +
    +General rule is that operator are required to completely process a given watermark before forwarding it downstream. For example,
    +`WindowOperator` will first evaluate which windows should be fired and only after producing all of the output triggered by
    +the watermark, the watermark itself will be handled downstream. In other words, all elements produced due to occurrence of
    +the watermark will be emitted before such watermark.
    +
    +Same rule applies to `TwoInputStreamOperator`. However in this case current watermark of the operator is defined as a minimum
    +of both of it's inputs.
    +
    +Details of this behaviour is defined by implementations of methods `OneInputStreamOperator.processWatermark`,
    +`TwoInputStreamOperator.processWatermark1` and `TwoInputStreamOperator.processWatermark2`.
    +
    --- End diff --
    
    As a general rule, operators are required to completely process a given watermark before forwarding it downstream. For example,
    `WindowOperator` will first evaluate which windows should be fired, and only after producing all of the output triggered by
    the watermark will the watermark itself be sent downstream. In other words, all elements produced due to occurrence of a watermark will be emitted before the watermark.
    
    The same rule applies to `TwoInputStreamOperator`. However, in this case the current watermark of the operator is defined as the minimum
    of both of its inputs.
    
    The details of this behavior are defined by the implementations of the `OneInputStreamOperator.processWatermark`,
    `TwoInputStreamOperator.processWatermark1` and `TwoInputStreamOperator.processWatermark2` methods.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by twalthr <gi...@git.apache.org>.
Github user twalthr commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r192981069
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,33 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    +to be processed. That means in case of gap in the incoming data, even time will not progress and for
    +example window operator will not be triggered and thus existing windows will not be able produce any
    +output data.
    +
    +To circumvent this one can use periodic watermark assigners that don't only assign based on
    +element timestamps. Example solution could be an assigner that switches to using current processing time
    +as the time basis after not observing new events for a while.
     
     ## Debugging Watermarks
     
     Please refer to the [Debugging Windows & Event Time]({{ site.baseurl }}/monitoring/debugging_event_time.html) section for debugging
     watermarks at runtime.
     
    +## How operators are processing watermarks
    +
    +As a general rule, operators are required to completely process a given watermark before forwarding it downstream. For example,
    +`WindowOperator` will first evaluate which windows should be fired, and only after producing all of the output triggered by
    +the watermark will the watermark itself be sent downstream. In other words, all elements produced due to occurrence of a watermark
    +will be emitted before the watermark.
    +
    +The same rule applies to `TwoInputStreamOperator`. However, in this case the current watermark of the operator is defined as
    --- End diff --
    
    `TwoInputStreamOperator` and `OneInputStreamOperator` are internal classes. We should not mention them here but use a generic `operator with one input` term.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194702405
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,36 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    +to be processed. That means in case of gap in the incoming data, even time will not progress and for
    +example window operator will not be triggered and thus existing windows will not be able produce any
    +output data.
    +
    +To circumvent this one can use periodic watermark assigners that don't only assign based on
    +element timestamps. Example solution could be an assigner that switches to using current processing time
    +as the time basis after not observing new events for a while.
    +
    +Sources can be marked as idle using `SourceFunction.SourceContext#markAsTemporarilyIdle`. For details please refer to Javadoc of
    --- End diff --
    
    to `the` Javadoc


---

[GitHub] flink issue #6076: [hotfix][docs] Specify operators behaviour on processing ...

Posted by eliaslevy <gi...@git.apache.org>.
Github user eliaslevy commented on the issue:

    https://github.com/apache/flink/pull/6076
  
    Something else missing from this page is a discussion of what happens to the timestamp attached to records when they go through different operations.  E.g. what happens to a record timestamp when it goes through `map` or when multiple records are emitted from `flatMap`?  What is the timestamp of an record emitted by an operation on a window or an aggregation operation on non-windowed stream?  None of these are discussed anywhere in the documentation from what I can tell.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194393116
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,33 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    --- End diff --
    
    I have added a reference to both things


---

[GitHub] flink issue #6076: [hotfix][docs] Specify operators behaviour on processing ...

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on the issue:

    https://github.com/apache/flink/pull/6076
  
    I have added a sentence covering this, but it would be nice to merge it and not prolonging and ever extending this PR.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r194702199
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,36 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    +to be processed. That means in case of gap in the incoming data, even time will not progress and for
    +example window operator will not be triggered and thus existing windows will not be able produce any
    --- End diff --
    
    example `the` window
    able `to` produce


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by twalthr <gi...@git.apache.org>.
Github user twalthr commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r192984621
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,33 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    --- End diff --
    
    Source can be marked as temporarily idle. Maybe we should also mentioned it `SourceFunction.SourceContext#markAsTemporarilyIdle` here.


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r195004441
  
    --- Diff: docs/dev/event_time.md ---
    @@ -213,10 +213,36 @@ arrive after the system's event time clock (as signaled by the watermarks) has a
     timestamp. See [Allowed Lateness]({{ site.baseurl }}/dev/stream/operators/windows.html#allowed-lateness) for more information on how to work
     with late elements in event time windows.
     
    +## Idling sources
    +
    +Currently with pure event time watermarks generators, watermarks can not progress if there are no elements
    --- End diff --
    
    I have changed it to 
    >Currently, with pure event time watermarks generators, watermarks can not progress if there are no elements to be processed. 
    
    (added comma after currently)


---

[GitHub] flink pull request #6076: [hotfix][docs] Specify operators behaviour on proc...

Posted by alpinegizmo <gi...@git.apache.org>.
Github user alpinegizmo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6076#discussion_r190841926
  
    --- Diff: docs/dev/event_time.md ---
    @@ -219,4 +219,17 @@ with late elements in event time windows.
     Please refer to the [Debugging Windows & Event Time]({{ site.baseurl }}/monitoring/debugging_event_time.html) section for debugging
     watermarks at runtime.
     
    +## How operators are processing watermarks
    +
    +General rule is that operator are required to completely process a given watermark before forwarding it downstream. For example,
    +`WindowOperator` will first evaluate which windows should be fired and only after producing all of the output triggered by
    +the watermark, the watermark itself will be handled downstream. In other words, all elements produced due to occurrence of
    +the watermark will be emitted before such watermark.
    +
    +Same rule applies to `TwoInputStreamOperator`. However in this case current watermark of the operator is defined as a minimum
    +of both of it's inputs.
    +
    +Details of this behaviour is defined by implementations of methods `OneInputStreamOperator.processWatermark`,
    +`TwoInputStreamOperator.processWatermark1` and `TwoInputStreamOperator.processWatermark2`.
    +
    --- End diff --
    
    I offer some grammatical improvements. Also, is it correct to describe "operators are required to completely process a given watermark before forwarding it downstream" as a general rule, meaning that it might have exceptions, or should we simply say "operators are required ..." without adding this caveat?
    
    I changed behaviour to behavior because most of the docs seem to be using American spellings rather than English ones, but I'm not sure if we have a policy regarding this.


---