You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/03/27 09:03:00 UTC

[jira] [Commented] (SPARK-42931) dropDuplicates within watermark

    [ https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705252#comment-17705252 ] 

ASF GitHub Bot commented on SPARK-42931:
----------------------------------------

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/40561

> dropDuplicates within watermark
> -------------------------------
>
>                 Key: SPARK-42931
>                 URL: https://issues.apache.org/jira/browse/SPARK-42931
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.5.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>         Attachments: [External] Mini design doc_ dropDuplicates within watermark.pdf
>
>
> We got many reports that dropDuplicates does not clean up the state even though they have set a watermark for the query. We document the behavior clearly that the event time column should be a part of the subset columns for deduplication to clean up the state, but it cannot be applied to the customers as timestamps are not exactly the same for duplicated events in their use cases.
> We propose to deduce a new API of dropDuplicates which has following different characteristics compared to existing dropDuplicates:
>  * Weaker constraints on the subset (key)
>  ** Does not require an event time column on the subset.
>  * Looser semantics on deduplication
>  ** Only guarantee to deduplicate events within the watermark.
> Since the new API leverages event time, the new API has following new requirements:
>  * The input must be streaming DataFrame.
>  * The watermark must be defined.
>  * The event time column must be defined in the input DataFrame.
> More specifically on the semantic, once the operator processes the first arrived event, events arriving within the watermark for the first event will be deduplicated.
> (Technically, the expiration time should be the “event time of the first arrived event + watermark delay threshold”, to match up with future events.)
> Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events. (If they are unsure, they can alternatively set the delay threshold large enough, e.g. 48 hours.)
> Longer design doc will be attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org