You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Yi Pan (Data Infrastructure) (JIRA)" <ji...@apache.org> on 2015/03/02 20:23:05 UTC

[jira] [Updated] (SAMZA-552) Tuple or time window semantics in physical operator

     [ https://issues.apache.org/jira/browse/SAMZA-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yi Pan (Data Infrastructure) updated SAMZA-552:
-----------------------------------------------
    Attachment: DESIGN-SAMZA-552-3.pdf
                DESIGN-SAMZA-552-3.md

Attaching a draft on window operator that we have been discussing internally in LinkedIn. I wrote this before reading through the MillWheel talk and here are a few comparisons to MillWheel strategy:
1. MillWheel "stream time" is documented as "system time" in this draft
2. The punctuation mark in the draft is equivalent to the "trigger" in MillWheel talk
3. The window size measurement in the draft is a specific example of "watermark" in MillWheel.
The followings is the interesting concept from MillWheel that we are planning to extend our model to support:
1. Re-emit window results for late arrivals. This would include two parts: a. extend the current window state to a store that keeps all past windows; b. add policies to allow late arrivals to trigger re-emitting results for a past window.
With the above extension, we will be able to fully incorporate MillWheel's model in the draft design.

> Tuple or time window semantics in physical operator
> ---------------------------------------------------
>
>                 Key: SAMZA-552
>                 URL: https://issues.apache.org/jira/browse/SAMZA-552
>             Project: Samza
>          Issue Type: Sub-task
>          Components: sql
>    Affects Versions: 0.9.0
>            Reporter: Yi Pan (Data Infrastructure)
>            Assignee: Yi Pan (Data Infrastructure)
>         Attachments: DESIGN-SAMZA-552-3.md, DESIGN-SAMZA-552-3.pdf
>
>
> The discussion is based on how to support tuple and/or time based window operators in Samza physical operator layer.
> Here are the few observations:
> # Tuple represents the “physical ordering” of events while time-based window has semantic meanings to users
> # Total ordering between tuples are possible within Samza/Kafka given a deterministic MessageSelector on all input streams and offsets within each stream
> # No matter whether tuple or time is used to measure the window size, the window termination condition is needed to close a window to avoid the job to be wedged forever
> The following questions have to be answered to fully implement a window operator:
> # how to determine that a window is closed and no new tuples will be added?
> ## For tuple based, how do we close the window if messages do not come or get delayed?
> ## For time based, how do we close the window if
> ### the messages are not strictly in order w/ the time?
> ### the message w/ timestamp greater than the window boundary does not come or gets delayed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)