You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jungtaek Lim <ka...@gmail.com> on 2021/12/08 07:16:35 UTC

[Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

Hi dev,

I would like to hear voices about deprecating Trigger.Once, and replacing
it with Trigger.AvailableNow [1] in Structured Streaming.

Rationalization:

The expected behavior of Trigger.Once is like reading all available data
after the last trigger and processing them. This holds true when the last
run was gracefully terminated, but there are cases streaming queries to not
be terminated gracefully. There is a possibility the last run may write the
offset (WAL) for the new batch before termination, then a new run of
Trigger.Once only processes the data which was built in the latest
unfinished batch, and doesn't process new data.

The behavior is not deterministic from the users' point of view, as end
users wouldn't know whether the last run wrote the offset or not, unless
they look into the query's checkpoint by themselves.

While Trigger.AvailableNow came to solve the scalability issue on
Trigger.Once, it also ensures that it tries to process all available data
at the point of time it is triggered, which consistently works as expected
behavior of Trigger.Once.

Proposed Plan:

- Deprecate Trigger.Once in Apache Spark 3.3
- Leave guidance to migrate to Trigger.AvailableNow in migration guide
- Replace all usages of Trigger.Once with Trigger.AvailableNow, except the
test cases of Trigger.Once itself

Please review the proposal and share your voice on this.

Thanks!
Jungtaek Lim

1. https://issues.apache.org/jira/browse/SPARK-36533

Re: [Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

Posted by Jungtaek Lim <ka...@gmail.com>.
Friendly reminder. I'll submit the proposed change if there is no objection
observed this week.

On Wed, Dec 8, 2021 at 4:16 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Hi dev,
>
> I would like to hear voices about deprecating Trigger.Once, and replacing
> it with Trigger.AvailableNow [1] in Structured Streaming.
>
> Rationalization:
>
> The expected behavior of Trigger.Once is like reading all available data
> after the last trigger and processing them. This holds true when the last
> run was gracefully terminated, but there are cases streaming queries to not
> be terminated gracefully. There is a possibility the last run may write the
> offset (WAL) for the new batch before termination, then a new run of
> Trigger.Once only processes the data which was built in the latest
> unfinished batch, and doesn't process new data.
>
> The behavior is not deterministic from the users' point of view, as end
> users wouldn't know whether the last run wrote the offset or not, unless
> they look into the query's checkpoint by themselves.
>
> While Trigger.AvailableNow came to solve the scalability issue on
> Trigger.Once, it also ensures that it tries to process all available data
> at the point of time it is triggered, which consistently works as expected
> behavior of Trigger.Once.
>
> Proposed Plan:
>
> - Deprecate Trigger.Once in Apache Spark 3.3
> - Leave guidance to migrate to Trigger.AvailableNow in migration guide
> - Replace all usages of Trigger.Once with Trigger.AvailableNow, except the
> test cases of Trigger.Once itself
>
> Please review the proposal and share your voice on this.
>
> Thanks!
> Jungtaek Lim
>
> 1. https://issues.apache.org/jira/browse/SPARK-36533
>