You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jungtaek Lim <ka...@gmail.com> on 2018/10/19 03:20:31 UTC

Plan on Structured Streaming in next major/minor release?

Hi devs,

While Spark 2.4.0 is still in progress of release votes, I'm seeing some
pull requests on non-SS are being reviewed and merged into master branch,
so I guess discussion about next release is OK.

Looks like there's a major TODO left on structured streaming: allowing
stateful operation in continuous mode (watermark, stateful exactly-once)
and no other major milestone is shared. (Please let me know if I'm missing
here!) As a structured streaming contributor's point of view, there're
another features we could discuss and see which are good to have, and
prioritize if possible (NOTE: just a brainstorming and some items might not
be valid for structured streaming):

* Native support on session window (SPARK-10816 [1])
  ** patch available
* Support delegation token on Kafka (SPARK-25501 [2])
  ** patch available
* Queryable State (SPARK-16738 [3])
  ** some discussion took place, but no action is taken yet
* End to end exactly-once with Kafka sink
  ** given Kafka is the first class on streaming source/sink nowadays
* Custom window / custom watermark
* Physically scale (up/down) streaming state
* State TTL (especially for non-watermark state)
  ** "timeout" in map/flatmapGroupsWithState fits it, but just to check
whether we want to have it for normal streaming aggregation
* Provide discarded events due to late via side output or similar feature
  ** for me it looks like tricky one, since Spark's RDD as well as SQL
semantic provide one output
* more?

Would like to hear others opinions about this. Please also share if
there're ongoing efforts on other items for structured streaming. Happy to
help out if it needs another hand.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-10816
2. https://issues.apache.org/jira/browse/SPARK-25501
3. https://issues.apache.org/jira/browse/SPARK-16738

Re: Plan on Structured Streaming in next major/minor release?

Posted by kant kodali <ka...@gmail.com>.

+1 For Raising all this.
+1 For Queryable State (SPARK-16738 [3])

On Thu, Oct 18, 2018 at 9:59 PM Jungtaek Lim <ka...@gmail.com> wrote:

> Small correction: "timeout" in map/flatmapGroupsWithState would not work
> similar as State TTL when event time and watermark is set. So timeout in
> map/flatmapGroupsWithState is to guarantee removal of state when the state
> will not be used, as similar as what we do with streaming aggregation,
> whereas State TTL is just work as its name is represented
> (self-explanatory). Hence State TTL looks valid for all the cases.
>
> 2018년 10월 19일 (금) 오후 12:20, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Hi devs,
>>
>> While Spark 2.4.0 is still in progress of release votes, I'm seeing some
>> pull requests on non-SS are being reviewed and merged into master branch,
>> so I guess discussion about next release is OK.
>>
>> Looks like there's a major TODO left on structured streaming: allowing
>> stateful operation in continuous mode (watermark, stateful exactly-once)
>> and no other major milestone is shared. (Please let me know if I'm missing
>> here!) As a structured streaming contributor's point of view, there're
>> another features we could discuss and see which are good to have, and
>> prioritize if possible (NOTE: just a brainstorming and some items might not
>> be valid for structured streaming):
>>
>> * Native support on session window (SPARK-10816 [1])
>>   ** patch available
>> * Support delegation token on Kafka (SPARK-25501 [2])
>>   ** patch available
>> * Queryable State (SPARK-16738 [3])
>>   ** some discussion took place, but no action is taken yet
>> * End to end exactly-once with Kafka sink
>>   ** given Kafka is the first class on streaming source/sink nowadays
>> * Custom window / custom watermark
>> * Physically scale (up/down) streaming state
>> * State TTL (especially for non-watermark state)
>>   ** "timeout" in map/flatmapGroupsWithState fits it, but just to check
>> whether we want to have it for normal streaming aggregation
>> * Provide discarded events due to late via side output or similar feature
>>   ** for me it looks like tricky one, since Spark's RDD as well as SQL
>> semantic provide one output
>> * more?
>>
>> Would like to hear others opinions about this. Please also share if
>> there're ongoing efforts on other items for structured streaming. Happy to
>> help out if it needs another hand.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-10816
>> 2. https://issues.apache.org/jira/browse/SPARK-25501
>> 3. https://issues.apache.org/jira/browse/SPARK-16738
>>
>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Stavros Kontopoulos <st...@lightbend.com>.

@Michael any update about queryable state?

Stavros

On Tue, Oct 30, 2018 at 10:43 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Thanks for bringing up some possible future directions for streaming. Here
> are some thoughts:
>  - I personally view all of the activity on Spark SQL also as activity on
> Structured Streaming. The great thing about building streaming on catalyst
> / tungsten is that continued improvement to these components improves
> streaming use cases as well.
>  - I think the biggest on-going project is DataSourceV2, whose goal is to
> provide a stable / performant API for streaming and batch data sources to
> plug in.  I think connectivity to many different systems is one of the most
> powerful aspects of Spark and right now there is no stable public API for
> streaming. A lot of committer / PMC time is being spent here at the moment.
>  - As you mention, 2.4.0 significantly improves the built in connectivity
> for Kafka, giving us the ability to read exactly once from a topic being
> written to transactional producers. I think projects to extend this
> guarantee to the Kafka Sink and also to improve authentication with Kafka
> are a great idea (and it seems like there is a lot of review activity on
> the latter).
>
> You bring up some other possible projects like session window support.
> This is an interesting project, but as far as I can tell it still there is
> still a lot of work that would need to be done before this feature could be
> merged.  We'd need to understand how it works with update mode amongst
> other things. Additionally, a 3000+ line patch is really time consuming to
> review. This coupled with the fact that all the users that I have
> interacted with need "session windows + some custom business logic"
> (usually implemented with flatMapGroupsWithState), mean that I'm more
> inclined to direct limited review bandwidth to incremental improvements in
> that feature than to something large/new. This is not to say that this
> feature isn't useful / shouldn't be merge, just a bit of explanation as to
> why there might be less activity here than you would hope.
>
> Similarly, multiple aggregations are an often requested feature.  However,
> fundamentally, this is going to be a fairly large investment (I think we'd
> need to combine the unsupported operation checker and the query planner and
> also create a high performance (i.e. whole stage code-gened) aggregation
> operator that understands negation).
>
> Thanks again for starting the discussion, and looking forward to hearing
> about what features are most requested!
>
> On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Adding more: again, it doesn't mean they're feasible to do. Just a kind
>> of brainstorming.
>>
>> * SPARK-20568: Delete files after processing in structured streaming
>>   * There hasn't been consensus regarding supporting this: there were
>> voices for both YES and NO.
>> * Support multiple levels of aggregations in structured streaming
>>   * There're plenty of questions in SO regarding this. While I don't
>> think it makes sense on structured streaming if it requires additional
>> shuffle, there might be another case: group by keys, apply aggregation,
>> apply aggregation on aggregated result (grouped keys don't change)
>>
>> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Yeah, the main intention of this thread is to collect interest on
>>> possible feature list for structured streaming. From what I can see in
>>> Spark community, most of the discussions as well as contributions are for
>>> SQL, and I'd wish to see similar activeness / efforts on structured
>>> streaming.
>>> (Unfortunately there's less effort to review others' works - design doc
>>> as well as pull request - most of efforts looks like being spent to their
>>> own works.)
>>>
>>> I respect the role of PMC member, so the final decision would be up to
>>> PMC members, but contributors as well as end users could show the interest
>>> as well as discuss about requirements on SPIP, which could be a good
>>> background to persuade PMC members.
>>>
>>> Before going into the deep I guess we could use this thread to discuss
>>> about possible use cases, and if we would like to move forward to
>>> individual thread we could initiate (or resurrect) its discussion thread.
>>>
>>> For queryable state, at least there seems no workaround in Spark to
>>> provide similar thing, especially state is getting bigger. I may have some
>>> concerns on the details, but I'll add my thought on the discussion thread.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <stavros.kontopoulos@
>>> lightbend.com>님이 작성:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> I just tried to start the discussion in the dev list along time ago.
>>>> I enumerated some uses cases as Michael proposed here
>>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4R9VOD+EBTy1FDgTQsxzGjguboX-k8arAUrp1w@mail.gmail.com%3E>.
>>>> The discussion didn't go further.
>>>>
>>>> If people find it useful we should start discussing it in detail again.
>>>>
>>>> Stavros
>>>>
>>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Stavros, if my memory is right, you were trying to drive queryable
>>>>> state, right?
>>>>>
>>>>> Could you summary the progress and the reason why the progress got
>>>>> stopped?
>>>>>
>>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <stavros.kontopoulos@
>>>>> lightbend.com>님이 작성:
>>>>>
>>>>>> That is a very interesting list thanks. I could create a design doc
>>>>>> as a starting pointing for discussion if this is a feature we would like to
>>>>>> have.
>>>>>>
>>>>>> Regards,
>>>>>> Stavros
>>>>>>
>>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>>>>>>
>>>>>>> Thanks for raising them.
>>>>>>>
>>>>>>> FYI, I believe this open issues could also be considered:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>>>
>>>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.
>>>>>>> nabble.com/
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by JackyLee <qc...@163.com>.

Can these things be added into this list?
1. [SPARK-24630] Support SQLStreaming in Spark
      This patch defines the Table API for StructStreaming
2. [SPARK-25937] Support user-defined schema in Kafka Source & Sink
      This patch make user easier to work with StructStreaming
3. SS supports dynamic partition scheduling 
       SS uses the serial execution engine, which means, SS can not catch up
with data output effectively when back pressure or computing speed is
reduced. If the dynamic partition scheduling for SS can be realized, the
partition number will be automatically increased when needed, then SS can
effectively catch up with the calculation speed.The main idea is to replace
time with computing resources.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

My 2 cents, "micro-batch" is the way how Spark handles stream, not a
semantic we are considering. Semantically and ideally, same SQL query
should provide same result between batch and streaming except late events
once the operations in query are supported.

2018년 11월 2일 (금) 오후 3:54, kant kodali <ka...@gmail.com>님이 작성:

> If I can add one thing to this list I would say stateless aggregations
> using Raw SQL.
>
> For example: As I read micro-batches from Kafka I want to do say a count
> of that micro batch and spit it out using Raw SQL . (No Count aggregation
> across batches.)
>
>
>
> On Tue, Oct 30, 2018 at 4:55 PM Jungtaek Lim <ka...@gmail.com> wrote:
>
>> OK thanks for clarifying. I guess it is one of major features in
>> streaming area and nice to add, but also agree it would require huge
>> investigation.
>>
>> 2018년 10월 31일 (수) 오전 8:06, Michael Armbrust <mi...@databricks.com>님이
>> 작성:
>>
>>> Agree. Just curious, could you explain what do you mean by "negation"?
>>>> Does it mean applying retraction on aggregated?
>>>>
>>>
>>> Yeah exactly.  Our current streaming aggregation assumes that the input
>>> is in append-mode and multiple aggregations break this.
>>>
>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by kant kodali <ka...@gmail.com>.

If I can add one thing to this list I would say stateless aggregations
using Raw SQL.

For example: As I read micro-batches from Kafka I want to do say a count of
that micro batch and spit it out using Raw SQL . (No Count aggregation
across batches.)

On Tue, Oct 30, 2018 at 4:55 PM Jungtaek Lim <ka...@gmail.com> wrote:

> OK thanks for clarifying. I guess it is one of major features in streaming
> area and nice to add, but also agree it would require huge investigation.
>
> 2018년 10월 31일 (수) 오전 8:06, Michael Armbrust <mi...@databricks.com>님이 작성:
>
>> Agree. Just curious, could you explain what do you mean by "negation"?
>>> Does it mean applying retraction on aggregated?
>>>
>>
>> Yeah exactly.  Our current streaming aggregation assumes that the input
>> is in append-mode and multiple aggregations break this.
>>
>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

OK thanks for clarifying. I guess it is one of major features in streaming
area and nice to add, but also agree it would require huge investigation.

2018년 10월 31일 (수) 오전 8:06, Michael Armbrust <mi...@databricks.com>님이 작성:

> Agree. Just curious, could you explain what do you mean by "negation"?
>> Does it mean applying retraction on aggregated?
>>
>
> Yeah exactly.  Our current streaming aggregation assumes that the input is
> in append-mode and multiple aggregations break this.
>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Michael Armbrust <mi...@databricks.com>.

>
> Agree. Just curious, could you explain what do you mean by "negation"?
> Does it mean applying retraction on aggregated?
>

Yeah exactly.  Our current streaming aggregation assumes that the input is
in append-mode and multiple aggregations break this.

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

Thanks Micheal for explaining activity on SS as well as giving opinion on
some items!

Replying inline.

2018년 10월 31일 (수) 오전 5:44, Michael Armbrust <mi...@databricks.com>님이 작성:

> Thanks for bringing up some possible future directions for streaming. Here
> are some thoughts:
>  - I personally view all of the activity on Spark SQL also as activity on
> Structured Streaming. The great thing about building streaming on catalyst
> / tungsten is that continued improvement to these components improves
> streaming use cases as well.
>

While I agree with you (in terms of performance and improvement on built-in
functions), like I enumerated, streaming area has its own features which
require another efforts. It would be also great if someone resumes putting
major efforts on continuous mode (Spark specific project): I guess we were
waiting on barrier execution. I'm happy to help on reviewing design doc,
taking up implementing part(s) of.


>  - I think the biggest on-going project is DataSourceV2, whose goal is to
> provide a stable / performant API for streaming and batch data sources to
> plug in.  I think connectivity to many different systems is one of the most
> powerful aspects of Spark and right now there is no stable public API for
> streaming. A lot of committer / PMC time is being spent here at the moment.
>

100% agree that DSv2 should be the thing to be stabilized sooner than
later, and understand major efforts are going there.


>  - As you mention, 2.4.0 significantly improves the built in connectivity
> for Kafka, giving us the ability to read exactly once from a topic being
> written to transactional producers. I think projects to extend this
> guarantee to the Kafka Sink and also to improve authentication with Kafka
> are a great idea (and it seems like there is a lot of review activity on
> the latter).
>

Actually I was spending time to design former, and realized that it should
give up either scalability or transactional to respect Spark's contract on
exactly-once. (Most of storages don't support transaction on multiple
connections so transaction can't be achieved among tasks. They also don't
support moving data without resending.) That's what I sent a mail in
different mail thread on lessening contract. I think it is related to DSv2
and need to be considered while discussing DSv2, since the issue is not
only for Kafka, but also most of external storages.


> You bring up some other possible projects like session window support.
> This is an interesting project, but as far as I can tell it still there is
> still a lot of work that would need to be done before this feature could be
> merged.  We'd need to understand how it works with update mode amongst
> other things. Additionally, a 3000+ line patch is really time consuming to
> review. This coupled with the fact that all the users that I have
> interacted with need "session windows + some custom business logic"
> (usually implemented with flatMapGroupsWithState), mean that I'm more
> inclined to direct limited review bandwidth to incremental improvements in
> that feature than to something large/new. This is not to say that this
> feature isn't useful / shouldn't be merge, just a bit of explanation as to
> why there might be less activity here than you would hope.
>

Yeah while I would like to get another feedbacks on session window stuff
(because without feedback I need to explore all possible paths by myself
without any help), I didn't intend to get attraction on session window in
this mail thread. The rationalization on this mail thread is to get
attraction on broader area: features support on streaming area.

Anyway, thanks for explaining! For individual contributors, determining
whether the proposal is (softly) rejected or not is very important in terms
of further investigation, and it helped much on understanding current
status.


> Similarly, multiple aggregations are an often requested feature.  However,
> fundamentally, this is going to be a fairly large investment (I think we'd
> need to combine the unsupported operation checker and the query planner and
> also create a high performance (i.e. whole stage code-gened) aggregation
> operator that understands negation).
>

Agree. Just curious, could you explain what do you mean by "negation"? Does
it mean applying retraction on aggregated?


> Thanks again for starting the discussion, and looking forward to hearing
> about what features are most requested!
>
> On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Adding more: again, it doesn't mean they're feasible to do. Just a kind
>> of brainstorming.
>>
>> * SPARK-20568: Delete files after processing in structured streaming
>>   * There hasn't been consensus regarding supporting this: there were
>> voices for both YES and NO.
>> * Support multiple levels of aggregations in structured streaming
>>   * There're plenty of questions in SO regarding this. While I don't
>> think it makes sense on structured streaming if it requires additional
>> shuffle, there might be another case: group by keys, apply aggregation,
>> apply aggregation on aggregated result (grouped keys don't change)
>>
>> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Yeah, the main intention of this thread is to collect interest on
>>> possible feature list for structured streaming. From what I can see in
>>> Spark community, most of the discussions as well as contributions are for
>>> SQL, and I'd wish to see similar activeness / efforts on structured
>>> streaming.
>>> (Unfortunately there's less effort to review others' works - design doc
>>> as well as pull request - most of efforts looks like being spent to their
>>> own works.)
>>>
>>> I respect the role of PMC member, so the final decision would be up to
>>> PMC members, but contributors as well as end users could show the interest
>>> as well as discuss about requirements on SPIP, which could be a good
>>> background to persuade PMC members.
>>>
>>> Before going into the deep I guess we could use this thread to discuss
>>> about possible use cases, and if we would like to move forward to
>>> individual thread we could initiate (or resurrect) its discussion thread.
>>>
>>> For queryable state, at least there seems no workaround in Spark to
>>> provide similar thing, especially state is getting bigger. I may have some
>>> concerns on the details, but I'll add my thought on the discussion thread.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
>>> stavros.kontopoulos@lightbend.com>님이 작성:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> I just tried to start the discussion in the dev list along time ago.
>>>> I enumerated some uses cases as Michael proposed here
>>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4R9VOD+EBTy1FDgTQsxzGjguboX-k8arAUrp1w@mail.gmail.com%3E>.
>>>> The discussion didn't go further.
>>>>
>>>> If people find it useful we should start discussing it in detail again.
>>>>
>>>> Stavros
>>>>
>>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Stavros, if my memory is right, you were trying to drive queryable
>>>>> state, right?
>>>>>
>>>>> Could you summary the progress and the reason why the progress got
>>>>> stopped?
>>>>>
>>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
>>>>> stavros.kontopoulos@lightbend.com>님이 작성:
>>>>>
>>>>>> That is a very interesting list thanks. I could create a design doc
>>>>>> as a starting pointing for discussion if this is a feature we would like to
>>>>>> have.
>>>>>>
>>>>>> Regards,
>>>>>> Stavros
>>>>>>
>>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>>>>>>
>>>>>>> Thanks for raising them.
>>>>>>>
>>>>>>> FYI, I believe this open issues could also be considered:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>>>
>>>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from:
>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Michael Armbrust <mi...@databricks.com>.

Thanks for bringing up some possible future directions for streaming. Here
are some thoughts:
 - I personally view all of the activity on Spark SQL also as activity on
Structured Streaming. The great thing about building streaming on catalyst
/ tungsten is that continued improvement to these components improves
streaming use cases as well.
 - I think the biggest on-going project is DataSourceV2, whose goal is to
provide a stable / performant API for streaming and batch data sources to
plug in.  I think connectivity to many different systems is one of the most
powerful aspects of Spark and right now there is no stable public API for
streaming. A lot of committer / PMC time is being spent here at the moment.
 - As you mention, 2.4.0 significantly improves the built in connectivity
for Kafka, giving us the ability to read exactly once from a topic being
written to transactional producers. I think projects to extend this
guarantee to the Kafka Sink and also to improve authentication with Kafka
are a great idea (and it seems like there is a lot of review activity on
the latter).

You bring up some other possible projects like session window support.
This is an interesting project, but as far as I can tell it still there is
still a lot of work that would need to be done before this feature could be
merged.  We'd need to understand how it works with update mode amongst
other things. Additionally, a 3000+ line patch is really time consuming to
review. This coupled with the fact that all the users that I have
interacted with need "session windows + some custom business logic"
(usually implemented with flatMapGroupsWithState), mean that I'm more
inclined to direct limited review bandwidth to incremental improvements in
that feature than to something large/new. This is not to say that this
feature isn't useful / shouldn't be merge, just a bit of explanation as to
why there might be less activity here than you would hope.

Similarly, multiple aggregations are an often requested feature.  However,
fundamentally, this is going to be a fairly large investment (I think we'd
need to combine the unsupported operation checker and the query planner and
also create a high performance (i.e. whole stage code-gened) aggregation
operator that understands negation).

Thanks again for starting the discussion, and looking forward to hearing
about what features are most requested!

On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <ka...@gmail.com> wrote:

> Adding more: again, it doesn't mean they're feasible to do. Just a kind of
> brainstorming.
>
> * SPARK-20568: Delete files after processing in structured streaming
>   * There hasn't been consensus regarding supporting this: there were
> voices for both YES and NO.
> * Support multiple levels of aggregations in structured streaming
>   * There're plenty of questions in SO regarding this. While I don't think
> it makes sense on structured streaming if it requires additional shuffle,
> there might be another case: group by keys, apply aggregation, apply
> aggregation on aggregated result (grouped keys don't change)
>
> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Yeah, the main intention of this thread is to collect interest on
>> possible feature list for structured streaming. From what I can see in
>> Spark community, most of the discussions as well as contributions are for
>> SQL, and I'd wish to see similar activeness / efforts on structured
>> streaming.
>> (Unfortunately there's less effort to review others' works - design doc
>> as well as pull request - most of efforts looks like being spent to their
>> own works.)
>>
>> I respect the role of PMC member, so the final decision would be up to
>> PMC members, but contributors as well as end users could show the interest
>> as well as discuss about requirements on SPIP, which could be a good
>> background to persuade PMC members.
>>
>> Before going into the deep I guess we could use this thread to discuss
>> about possible use cases, and if we would like to move forward to
>> individual thread we could initiate (or resurrect) its discussion thread.
>>
>> For queryable state, at least there seems no workaround in Spark to
>> provide similar thing, especially state is getting bigger. I may have some
>> concerns on the details, but I'll add my thought on the discussion thread.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
>> stavros.kontopoulos@lightbend.com>님이 작성:
>>
>>> Hi Jungtaek,
>>>
>>> I just tried to start the discussion in the dev list along time ago.
>>> I enumerated some uses cases as Michael proposed here
>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4R9VOD+EBTy1FDgTQsxzGjguboX-k8arAUrp1w@mail.gmail.com%3E>.
>>> The discussion didn't go further.
>>>
>>> If people find it useful we should start discussing it in detail again.
>>>
>>> Stavros
>>>
>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>>>
>>>> Stavros, if my memory is right, you were trying to drive queryable
>>>> state, right?
>>>>
>>>> Could you summary the progress and the reason why the progress got
>>>> stopped?
>>>>
>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
>>>> stavros.kontopoulos@lightbend.com>님이 작성:
>>>>
>>>>> That is a very interesting list thanks. I could create a design doc
>>>>> as a starting pointing for discussion if this is a feature we would like to
>>>>> have.
>>>>>
>>>>> Regards,
>>>>> Stavros
>>>>>
>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>>>>>
>>>>>> Thanks for raising them.
>>>>>>
>>>>>> FYI, I believe this open issues could also be considered:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>>
>>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

Adding more: again, it doesn't mean they're feasible to do. Just a kind of
brainstorming.

* SPARK-20568: Delete files after processing in structured streaming
  * There hasn't been consensus regarding supporting this: there were
voices for both YES and NO.
* Support multiple levels of aggregations in structured streaming
  * There're plenty of questions in SO regarding this. While I don't think
it makes sense on structured streaming if it requires additional shuffle,
there might be another case: group by keys, apply aggregation, apply
aggregation on aggregated result (grouped keys don't change)

2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Yeah, the main intention of this thread is to collect interest on possible
> feature list for structured streaming. From what I can see in Spark
> community, most of the discussions as well as contributions are for SQL,
> and I'd wish to see similar activeness / efforts on structured streaming.
> (Unfortunately there's less effort to review others' works - design doc as
> well as pull request - most of efforts looks like being spent to their own
> works.)
>
> I respect the role of PMC member, so the final decision would be up to PMC
> members, but contributors as well as end users could show the interest as
> well as discuss about requirements on SPIP, which could be a good
> background to persuade PMC members.
>
> Before going into the deep I guess we could use this thread to discuss
> about possible use cases, and if we would like to move forward to
> individual thread we could initiate (or resurrect) its discussion thread.
>
> For queryable state, at least there seems no workaround in Spark to
> provide similar thing, especially state is getting bigger. I may have some
> concerns on the details, but I'll add my thought on the discussion thread.
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
> stavros.kontopoulos@lightbend.com>님이 작성:
>
>> Hi Jungtaek,
>>
>> I just tried to start the discussion in the dev list along time ago.
>> I enumerated some uses cases as Michael proposed here
>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4R9VOD+EBTy1FDgTQsxzGjguboX-k8arAUrp1w@mail.gmail.com%3E>.
>> The discussion didn't go further.
>>
>> If people find it useful we should start discussing it in detail again.
>>
>> Stavros
>>
>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>>
>>> Stavros, if my memory is right, you were trying to drive queryable
>>> state, right?
>>>
>>> Could you summary the progress and the reason why the progress got
>>> stopped?
>>>
>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
>>> stavros.kontopoulos@lightbend.com>님이 작성:
>>>
>>>> That is a very interesting list thanks. I could create a design doc as
>>>> a starting pointing for discussion if this is a feature we would like to
>>>> have.
>>>>
>>>> Regards,
>>>> Stavros
>>>>
>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>>>>
>>>>> Thanks for raising them.
>>>>>
>>>>> FYI, I believe this open issues could also be considered:
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>
>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

Yeah, the main intention of this thread is to collect interest on possible
feature list for structured streaming. From what I can see in Spark
community, most of the discussions as well as contributions are for SQL,
and I'd wish to see similar activeness / efforts on structured streaming.
(Unfortunately there's less effort to review others' works - design doc as
well as pull request - most of efforts looks like being spent to their own
works.)

I respect the role of PMC member, so the final decision would be up to PMC
members, but contributors as well as end users could show the interest as
well as discuss about requirements on SPIP, which could be a good
background to persuade PMC members.

Before going into the deep I guess we could use this thread to discuss
about possible use cases, and if we would like to move forward to
individual thread we could initiate (or resurrect) its discussion thread.

For queryable state, at least there seems no workaround in Spark to provide
similar thing, especially state is getting bigger. I may have some concerns
on the details, but I'll add my thought on the discussion thread.

- Jungtaek Lim (HeartSaVioR)

2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
stavros.kontopoulos@lightbend.com>님이 작성:

> Hi Jungtaek,
>
> I just tried to start the discussion in the dev list along time ago.
> I enumerated some uses cases as Michael proposed here
> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4R9VOD+EBTy1FDgTQsxzGjguboX-k8arAUrp1w@mail.gmail.com%3E>.
> The discussion didn't go further.
>
> If people find it useful we should start discussing it in detail again.
>
> Stavros
>
> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Stavros, if my memory is right, you were trying to drive queryable state,
>> right?
>>
>> Could you summary the progress and the reason why the progress got
>> stopped?
>>
>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
>> stavros.kontopoulos@lightbend.com>님이 작성:
>>
>>> That is a very interesting list thanks. I could create a design doc as
>>> a starting pointing for discussion if this is a feature we would like to
>>> have.
>>>
>>> Regards,
>>> Stavros
>>>
>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>>>
>>>> Thanks for raising them.
>>>>
>>>> FYI, I believe this open issues could also be considered:
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>
>>>> An new ability to express Struct Streaming on pure SQL.
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Stavros Kontopoulos <st...@lightbend.com>.

Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago.
I enumerated some uses cases as Michael proposed here
<http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4R9VOD+EBTy1FDgTQsxzGjguboX-k8arAUrp1w@mail.gmail.com%3E>.
The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> Stavros, if my memory is right, you were trying to drive queryable state,
> right?
>
> Could you summary the progress and the reason why the progress got stopped?
>
> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
> stavros.kontopoulos@lightbend.com>님이 작성:
>
>> That is a very interesting list thanks. I could create a design doc as a
>> starting pointing for discussion if this is a feature we would like to have.
>>
>> Regards,
>> Stavros
>>
>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>>
>>> Thanks for raising them.
>>>
>>> FYI, I believe this open issues could also be considered:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24630
>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>
>>> An new ability to express Struct Streaming on pure SQL.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>>
>>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

Stavros, if my memory is right, you were trying to drive queryable state,
right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
stavros.kontopoulos@lightbend.com>님이 작성:

> That is a very interesting list thanks. I could create a design doc as a
> starting pointing for discussion if this is a feature we would like to have.
>
> Regards,
> Stavros
>
> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:
>
>> Thanks for raising them.
>>
>> FYI, I believe this open issues could also be considered:
>>
>> https://issues.apache.org/jira/browse/SPARK-24630
>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>
>> An new ability to express Struct Streaming on pure SQL.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>
>
>

Re: Plan on Structured Streaming in next major/minor release?

Posted by Stavros Kontopoulos <st...@lightbend.com>.

That is a very interesting list thanks. I could create a design doc as a
starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qc...@163.com> wrote:

> Thanks for raising them.
>
> FYI, I believe this open issues could also be considered:
>
> https://issues.apache.org/jira/browse/SPARK-24630
> <https://issues.apache.org/jira/browse/SPARK-24630>
>
> An new ability to express Struct Streaming on pure SQL.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Plan on Structured Streaming in next major/minor release?

Posted by JackyLee <qc...@163.com>.

Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630>  

An new ability to express Struct Streaming on pure SQL. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Plan on Structured Streaming in next major/minor release?

Posted by Jungtaek Lim <ka...@gmail.com>.

Small correction: "timeout" in map/flatmapGroupsWithState would not work
similar as State TTL when event time and watermark is set. So timeout in
map/flatmapGroupsWithState is to guarantee removal of state when the state
will not be used, as similar as what we do with streaming aggregation,
whereas State TTL is just work as its name is represented
(self-explanatory). Hence State TTL looks valid for all the cases.

2018년 10월 19일 (금) 오후 12:20, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Hi devs,
>
> While Spark 2.4.0 is still in progress of release votes, I'm seeing some
> pull requests on non-SS are being reviewed and merged into master branch,
> so I guess discussion about next release is OK.
>
> Looks like there's a major TODO left on structured streaming: allowing
> stateful operation in continuous mode (watermark, stateful exactly-once)
> and no other major milestone is shared. (Please let me know if I'm missing
> here!) As a structured streaming contributor's point of view, there're
> another features we could discuss and see which are good to have, and
> prioritize if possible (NOTE: just a brainstorming and some items might not
> be valid for structured streaming):
>
> * Native support on session window (SPARK-10816 [1])
>   ** patch available
> * Support delegation token on Kafka (SPARK-25501 [2])
>   ** patch available
> * Queryable State (SPARK-16738 [3])
>   ** some discussion took place, but no action is taken yet
> * End to end exactly-once with Kafka sink
>   ** given Kafka is the first class on streaming source/sink nowadays
> * Custom window / custom watermark
> * Physically scale (up/down) streaming state
> * State TTL (especially for non-watermark state)
>   ** "timeout" in map/flatmapGroupsWithState fits it, but just to check
> whether we want to have it for normal streaming aggregation
> * Provide discarded events due to late via side output or similar feature
>   ** for me it looks like tricky one, since Spark's RDD as well as SQL
> semantic provide one output
> * more?
>
> Would like to hear others opinions about this. Please also share if
> there're ongoing efforts on other items for structured streaming. Happy to
> help out if it needs another hand.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-10816
> 2. https://issues.apache.org/jira/browse/SPARK-25501
> 3. https://issues.apache.org/jira/browse/SPARK-16738
>
>