You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thilo Schneider (JIRA)" <ji...@apache.org> on 2019/04/10 06:46:00 UTC

[jira] [Updated] (SPARK-27424) Joining of one stream against the most recent update in another stream

     [ https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thilo Schneider updated SPARK-27424:
------------------------------------
    Attachment: join-last-update-design.pdf

> Joining of one stream against the most recent update in another stream
> ----------------------------------------------------------------------
>
>                 Key: SPARK-27424
>                 URL: https://issues.apache.org/jira/browse/SPARK-27424
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.4.1
>            Reporter: Thilo Schneider
>            Priority: Major
>         Attachments: join-last-update-design.pdf
>
>
> Currently, adding the most recent update of a row with a given key to another stream is not possible. This situation arises if one wants to use the current state, of one object, for example when joining the room temperature with the current weather.
> This ticket covers creation of a {{stream_lead}} and modification of the streaming join logic (and state store) to additionally allow joins of the form 
> {code:sql}
> SELECT *
> FROM A, B
> WHERE 
>     A.key = B.key 
>     AND A.time >= B.time 
>     AND A.time < stream_lead(B.time)
> {code}
> The major aspect of this change is that we actually need a third watermark to cover how late updates may come. 
> A rough sketch may be found in the attached document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org