You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2022/07/18 06:45:00 UTC

[jira] [Created] (SPARK-39805) Deprecate Trigger.Once and Promote Trigger.AvailableNow

Jungtaek Lim created SPARK-39805:
------------------------------------

             Summary: Deprecate Trigger.Once and Promote Trigger.AvailableNow
                 Key: SPARK-39805
                 URL: https://issues.apache.org/jira/browse/SPARK-39805
             Project: Spark
          Issue Type: Task
          Components: Structured Streaming
    Affects Versions: 3.4.0
            Reporter: Jungtaek Lim


Quoting the discussion in spark dev@: [link|https://lists.apache.org/thread/2xnxlxhw245cmspd8nd17cq5doj2c7hc]

Rationalization:

The expected behavior of Trigger.Once is like reading all available data after the last trigger and processing them. This holds true when the last run was gracefully terminated, but there are cases streaming queries to not be terminated gracefully. There is a possibility the last run may write the offset for the new batch before termination, then a new run of Trigger.Once only processes the data which was built in the latest unfinished batch and doesn't process new data.

The behavior is not deterministic from the users' point of view, as end users wouldn't know whether the last run wrote the offset or not, unless they look into the query's checkpoint by themselves.

While Trigger.AvailableNow came to solve the scalability issue on Trigger.Once, it also ensures that it tries to process all available data at the point of time it is triggered, which consistently works as expected behavior of Trigger.Once.

Another issue on Trigger.Once is that it does not trigger a no-data batch immediately. When the watermark is calculated in batch N, it takes effect in batch N + 1. If the query is scheduled to be run per day, you can see the output from the new watermark in the query run the next day. Thanks to the behavior of Trigger.AvailableNow, it handles no-data batch as well before termination of the query.

There was no strong feedback in the discussion thread, but accounting the fact we have very small number of contributors (including committers/PMC members) being active in SS area, we have to just go with lazy consensus.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org