You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tom Bar Yacov (JIRA)" <ji...@apache.org> on 2018/11/12 20:41:00 UTC

[jira] [Comment Edited] (SPARK-26008) Structured Streaming Manual clock for simulation

    [ https://issues.apache.org/jira/browse/SPARK-26008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684341#comment-16684341 ] 

Tom Bar Yacov edited comment on SPARK-26008 at 11/12/18 8:40 PM:
-----------------------------------------------------------------

I think this should be a wish not a question. I reopened to allow review of the wish and discuss any technical risks of such implementation.


was (Author: tombarya):
I believe this is more a wish not a question. I reopened to allow review of the wish and discuss any technical risks of such implementation.

> Structured Streaming Manual clock for simulation
> ------------------------------------------------
>
>                 Key: SPARK-26008
>                 URL: https://issues.apache.org/jira/browse/SPARK-26008
>             Project: Spark
>          Issue Type: Wish
>          Components: Structured Streaming
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>            Reporter: Tom Bar Yacov
>            Priority: Major
>
> Structured streaming Internal {color:#333333}StreamTest{color} class allows to test incremental logic and verify outputs between multiple triggers. It support changing the internal spark clock to get full deterministic simulation of the incremental state and APIs. This is not possible outside tests since {color:#333333}DataStreamWriter{color} hides the triggerClock parameter and is final.
> This can be very useful not only in unit test mode but also for a real running query. for example when you have all the Kafka historical data persisted to hdfs with its Kafka timestamp and you want to "play"  the data and simulate the streaming application output as if  running on this data in live streaming including incremental output between triggers.
> Currently I can simulate multiple triggers and incremental logic for some of the APIs, but for APIs that depend on the execution clock like {color:#333333}mapGroupsWithState{color} with execution based timeout I did not find a way to do this.
> I would like to allow passing an externally controlled clock as parameter to DataStreamWriter and to the query itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org