You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2023/03/07 04:45:00 UTC

[jira] [Resolved] (SPARK-42376) Introduce watermark propagation among operators

     [ https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim resolved SPARK-42376.
----------------------------------
    Fix Version/s: 3.5.0
       Resolution: Fixed

Issue resolved by pull request 39931
[https://github.com/apache/spark/pull/39931]

> Introduce watermark propagation among operators
> -----------------------------------------------
>
>                 Key: SPARK-42376
>                 URL: https://issues.apache.org/jira/browse/SPARK-42376
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.5.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>             Fix For: 3.5.0
>
>
> With introduction of SPARK-40925, we enabled workloads containing multiple stateful operators in a single streaming query.
> The JIRA ticket clearly described out-of-scope, "Here we propose fixing the late record filtering in stateful operators to allow chaining of stateful operators {*}which do not produce delayed records (like time-interval join or potentially flatMapGroupsWithState){*}".
> We identified production use case for stream-stream time-interval join followed by stateful operator (e.g. window aggregation), and propose to address such use case via this ticket.
> The design will be described in the PR, but the sketched idea is introducing simulation of watermark propagation among operators. As of now, Spark considers all stateful operators to have same input watermark and output watermark, which introduced the limitation. With this ticket, we construct the logic to simulate watermark propagation so that each operator can have its own (input watermark, output watermark). Operators introducing delayed records will produce delayed output watermark, and downstream operator can take the delay into account as input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org