You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2019/06/27 21:28:00 UTC
[jira] [Created] (SPARK-28190) Data Source - State

Jungtaek Lim created SPARK-28190:
------------------------------------

             Summary: Data Source - State
                 Key: SPARK-28190
                 URL: https://issues.apache.org/jira/browse/SPARK-28190
             Project: Spark
          Issue Type: Umbrella
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


"State" is becoming one of most important data on most of streaming frameworks, which makes us getting continuous result of the query. In other words, query could be no longer valid once state is corrupted or lost.

Ideally we could run the query from the first of data to construct a brand-new state for current query, but in reality it may not be possible for many reasons, like input data source having retention, lots of resource waste to rerun from start, etc.

 

There're other cases which end users want to deal with state, like creating initial state from existing data via batch query (given batch query could be far more efficient and faster).

I'd like to propose a new data source which handles "state" in batch query, enabling read and write on state.

Allowing state read brings couple of benefits:
 * You can analyze the state from "outside" of your streaming query
 * It could be useful when there's something which can be derived from existing state of existing query - note that state is not designed to be shared among multiple queries

Allowing state (re)write brings couple of major benefits:
 * State can be repartitioned physically
 * Schema in state can be changed, which means you don't need to run the query from the start when the query should be changed
 * You can remove state rows if you want, like reducing size, removing corrupt, etc.
 * You can bootstrap state in your new query with existing data efficiently, don't need to run streaming query from the start point

Btw, basically I'm planning to contribute my own works ([https://github.com/HeartSaVioR/spark-state-tools]), so for many of sub-issues it would require not-too-much amount of efforts to submit patches. I'll try to apply new DSv2, so it could be a major effort while preparing to donate code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org