You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tom Bar Yacov (JIRA)" <ji...@apache.org> on 2018/11/11 15:20:00 UTC

[jira] [Created] (SPARK-26008) Structured Streaming Manual clock for simulation

Tom Bar Yacov created SPARK-26008:
-------------------------------------

             Summary: Structured Streaming Manual clock for simulation
                 Key: SPARK-26008
                 URL: https://issues.apache.org/jira/browse/SPARK-26008
             Project: Spark
          Issue Type: Question
          Components: Structured Streaming
    Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
            Reporter: Tom Bar Yacov


Structured streaming Internal {color:#333333}StreamTest{color} class allows to test incremental logic and verify outputs between multiple triggers. It support changing the internal spark clock to get full deterministic simulation of the incremental state and APIs. This is not possible outside tests since {color:#333333}DataStreamWriter{color} hides the triggerClock parameter and is final.

This can be very useful not only in unit test mode but also for a real running query. for example when you have all the Kafka historical data persisted to hdfs with its Kafka timestamp and you want to "play"  the data and simulate the streaming application output as if  running on this data in live streaming including incremental output between triggers.

Today I can simulate multiple triggers and incremental logic for some of the APIs, but for APIs that depend on the execution clock like {color:#333333}mapGroupsWithState{color} with execution based timeout I did not find a way to do this.

Question is -  Is it a possible to support a similar solution like in StreamTest - Allow passing an external manual clock as parameter to DataStreamWriter and allowing the user an external control over this clock? what possible failures that can occur if running with manual clock in real cluster mode?

Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org