You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/11/13 00:38:00 UTC

[jira] [Commented] (SPARK-26008) Structured Streaming Manual clock for simulation

    [ https://issues.apache.org/jira/browse/SPARK-26008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684560#comment-16684560 ] 

Hyukjin Kwon commented on SPARK-26008:
--------------------------------------

Discussing a rough idea or wish should ideally start from a dev mailing list. It is possible to propose new features as well. These are generally not helpful unless accompanied by detail, such as a design document and/or code change. If you're not going to work on this, please don't reopen but start it from mailing list.

> Structured Streaming Manual clock for simulation
> ------------------------------------------------
>
>                 Key: SPARK-26008
>                 URL: https://issues.apache.org/jira/browse/SPARK-26008
>             Project: Spark
>          Issue Type: Wish
>          Components: Structured Streaming
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>            Reporter: Tom Bar Yacov
>            Priority: Major
>
> Structured streaming Internal {color:#333333}StreamTest{color} class allows to test incremental logic and verify outputs between multiple triggers. It support changing the internal spark clock to get full deterministic simulation of the incremental state and APIs. This is not possible outside tests since {color:#333333}DataStreamWriter{color} hides the triggerClock parameter and is final.
> This can be very useful not only in unit test mode but also for a real running query. for example when you have all the Kafka historical data persisted to hdfs with its Kafka timestamp and you want to "play"  the data and simulate the streaming application output as if  running on this data in live streaming including incremental output between triggers.
> Currently I can simulate multiple triggers and incremental logic for some of the APIs, but for APIs that depend on the execution clock like {color:#333333}mapGroupsWithState{color} with execution based timeout I did not find a way to do this.
> I would like to allow passing an externally controlled clock as parameter to DataStreamWriter and to the query itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org