You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tathagata Das (Jira)" <ji...@apache.org> on 2020/02/01 00:36:00 UTC

[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

    [ https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027910#comment-17027910 ] 

Tathagata Das commented on SPARK-30657:
---------------------------------------

This fix by itself (separate from the fix for SPARK-30658) may be backported. The solution that I did to always inject StreamingLocalLimitExec is safe from correctness point of view, but is a little risky from the performance point of view (which I tried to minimize using the optimization). With 2.4.4+, unless this is a serious bug that affects many users, I dont think we should backport this. And i dont think limit on streaming is that extensively used such that this is big bug (it has not been reported for 1.5 years). 

What do you think [~zsxwing]

> Streaming limit after streaming dropDuplicates can throw error
> --------------------------------------------------------------
>
>                 Key: SPARK-30657
>                 URL: https://issues.apache.org/jira/browse/SPARK-30657
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>            Priority: Critical
>             Fix For: 3.0.0
>
>
> {{LocalLimitExec}} does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed). This leads to the next batch failing with {{java.lang.IllegalStateException: Error reading delta file .../N.delta does not exist}} as the state store delta file was never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org