You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2021/10/19 20:46:00 UTC

[jira] [Created] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

Jungtaek Lim created SPARK-37062:
------------------------------------

             Summary: Introduce a new data source for providing consistent set of rows per microbatch
                 Key: SPARK-37062
                 URL: https://issues.apache.org/jira/browse/SPARK-37062
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 3.3.0
            Reporter: Jungtaek Lim


The "rate" data source has been known to be used as a benchmark for streaming query.

While this helps to put the query to the limit (how many rows the query could process per second), the rate data source doesn't provide consistent rows per batch into stream, which leads two environments be hard to compare with.

For example, in many cases, you may want to compare the metrics in the batches between test environments (like running same streaming query with different options). These metrics are strongly affected if the distribution of input rows in batches are changing, especially a micro-batch has been lagged (in any reason) and rate data source produces more input rows to the next batch.

Also, when you test against streaming aggregation, you may want the data source produces the same set of input rows per batch (deterministic), so that you can plan how these input rows will be aggregated and how state rows will be evicted, and craft the test query based on the plan.

The requirements of new data source would follow:

* it should produce a specific number of input rows as requested
* it should also include a timestamp (event time) into each row
** to make the input rows fully deterministic, timestamp should be configured as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org