You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bryan Jeffrey <br...@gmail.com> on 2020/03/30 19:50:32 UTC

Data Source - State (SPARK-28190)

Hi, Jungtaek.

We've been investigating the use of Spark Structured Streaming to replace
our Spark Streaming operations.  We have several cases where we're using
mapWithState to maintain state across batches, often with high volumes of
data.  We took a look at the Structured Streaming stateful processing.
Structured Streaming state processing looks great, but has some
shortcomings:
1. State can only be hydrated from checkpoint, which means that
modification of the state is not possible.
2. You cannot cleanup or normalize state data after it has been processed.

These shortcomings appear to be potentially addressed by your
ticket SPARK-28190 - "Data Source - State".  I see little activity on this
ticket. Can you help me to understand where this feature currently stands?

Thank you,

Bryan Jeffrey

Re: Data Source - State (SPARK-28190)

Posted by Jungtaek Lim <ka...@gmail.com>.

Hi Bryan,

Thanks for the interest! Unfortunately there's lack of support on
committers for SPARK-28190 (I have been struggling with lack of support on
structured streaming contributions). I hope things will get better, but in
the meantime, could you please try out my own project instead?

https://github.com/HeartSaVioR/spark-state-tools

It's not super convenient to use as of now, as structured streaming doesn't
store schema for state. The schema should be provided manually, or from
actual query. The improvement is being proposed via SPARK-27237 but this is
also no activity right now due to lack of support as well.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Mar 31, 2020 at 4:50 AM Bryan Jeffrey <br...@gmail.com>
wrote:

> Hi, Jungtaek.
>
> We've been investigating the use of Spark Structured Streaming to replace
> our Spark Streaming operations.  We have several cases where we're using
> mapWithState to maintain state across batches, often with high volumes of
> data.  We took a look at the Structured Streaming stateful processing.
> Structured Streaming state processing looks great, but has some
> shortcomings:
> 1. State can only be hydrated from checkpoint, which means that
> modification of the state is not possible.
> 2. You cannot cleanup or normalize state data after it has been processed.
>
> These shortcomings appear to be potentially addressed by your
> ticket SPARK-28190 - "Data Source - State".  I see little activity on this
> ticket. Can you help me to understand where this feature currently stands?
>
> Thank you,
>
> Bryan Jeffrey
>