You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2022/12/05 07:21:00 UTC

[jira] [Created] (SPARK-41387) Add assertion on end offset range for Kafka data source with Trigger.AvailableNow

Jungtaek Lim created SPARK-41387:
------------------------------------

             Summary: Add assertion on end offset range for Kafka data source with Trigger.AvailableNow
                 Key: SPARK-41387
                 URL: https://issues.apache.org/jira/browse/SPARK-41387
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.4.0
            Reporter: Jungtaek Lim


Although there are lots of benefits Trigger.AvailableNow provides, we figure out one caveat of Trigger.AvailableNow, very sensitive on the offset range.

Trigger.AvailableNow stops the query when the start offset and end offset are being same, producing no data from data source. Given the semantic of Trigger.AvailableNow, the implementation of data source is expected to retrieve the final offset at the start of the query, and gradually increase the offset range to eventually reach the final offset.

Any bug breaking this leads to infinity run of the query, hence all data source implementations supporting Trigger.AvailableNow are encouraged to have some assertion to prevent such case in prior.

For built-in data sources, only Kafka data source is something supporting Trigger.AvailableNow but don't have some assertion on the offset range. We'd like to add some assertion against Kafka data source, for Trigger.AvailableNow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org