You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by 张万新 <ke...@gmail.com> on 2019/05/20 08:12:54 UTC

What's the root cause of not supporting multiple aggregations in structured streaming?

Hi there,

I'd like to know what's the root reason why multiple aggregations on
streaming dataframe is not allowed since it's a very useful feature, and
flink has supported it for a long time.

Thanks.

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Jungtaek Lim <ka...@gmail.com>.
To make clear, what Arun meant in old PR is, watermark and output mode are
not relevant. It's limited to the append mode in any way when we only deal
with watermark. So in this phase we don't (and shouldn't) bring output mode
in topic and make things complicated, unless we really have a solid plan to
introduce retraction.

On Fri, Nov 27, 2020 at 12:08 PM Yuanjian Li <xy...@gmail.com> wrote:

> Nice blog! Thanks for sharing, Etienne!
>
> Let's try to raise this discussion again after the 3.1 release. I do think
> more committers/contributors had realized the issue of global watermark per
> SPARK-24634 <https://issues.apache.org/jira/browse/SPARK-24634> and
> SPARK-33259 <https://issues.apache.org/jira/browse/SPARK-33259>.
>
> Leaving some thoughts on my end:
> 1. Compatibility: The per-operation watermark should be compatible with
> the original global one when there are no multi-aggregations.
> 2. Versioning: If we need to change checkpoints' format, versioning info
> should be added for the first time.
> 3. Fix more things together: We'd better fix more issues(e.g.
> per-operation output mode for multi-aggregations) together, which would
> require versioning changes in the same Spark version.
>
> Best,
> Yuanjian
>
>
> Etienne Chauchot <ec...@apache.org> 于2020年11月26日周四 下午5:29写道:
>
>> Hi,
>>
>> Regarding this subject I wrote a blog article that gives details about
>> the watermark architecture proposal that was discussed in the design doc
>> and in the PR:
>>
>>
>> https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
>>
>> Best
>>
>> Etienne
>> On 29/09/2020 03:24, Yuanjian Li wrote:
>>
>> Thanks for the great discussion!
>>
>> Also interested in this feature and did some investigation before. As
>> Arun mentioned, similar to the "update" mode, the "complete" mode also
>> needs more design. We might need an operation level output mode for the
>> complete mode support. That is to say, if we use "complete" mode for every
>> aggregation operators, the wrong result will return.
>>
>> SPARK-26655 would be a good start, which only considers about "append"
>> mode. Maybe we need more discussion on the watermark interface. I will take
>> a close look at the doc and PR. Hope we will have the first version with
>> limitations and fix/remove them gradually.
>>
>> Best,
>> Yuanjian
>>
>> Jungtaek Lim <ka...@gmail.com> 于2020年9月26日周六 上午10:31写道:
>>
>>> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
>>> sorry I forgot to send the reply (was in draft).
>>>
>>> Regarding investment in SS, well, unfortunately I don't know - I'm just
>>> an individual. There might be various reasons to do so, most probably
>>> "priority" among the stuff. There's not much I could change.
>>>
>>> I agree the workaround is sub-optimal, but unless I see sufficient
>>> support in the community I probably couldn't make it go forward. I'll just
>>> say there's an elephant in the room - as the project goes forward for more
>>> than 10 years, backward compatibility is a top priority concern in the
>>> project, even across the major versions along the features/APIs. It is
>>> great for end users to migrate the version easily, but also blocks devs to
>>> fix the bad design once it ships. I'm the one complaining about these
>>> issues in the dev list, and I don't see willingness to correct them.
>>>
>>>
>>> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <ec...@apache.org>
>>> wrote:
>>>
>>>> Hi Jungtaek Lim,
>>>>
>>>> Nice to hear from you again since last time we talked :) and congrats
>>>> on becoming a Spark committer in the meantime ! (if I'm not mistaking you
>>>> were not at the time)
>>>>
>>>> I totally agree with what you're saying on merging structural parts of
>>>> Spark without having a broader consensus. What I don't understand is why
>>>> there is not more investment in SS. Especially because in another thread
>>>> the community is discussing about deprecating the regular DStream streaming
>>>> framework.
>>>>
>>>> Is the orientation of Spark now mostly batch ?
>>>>
>>>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview
>>>> 2 searching for this particular feature. And regarding the workaround, I'm
>>>> not sure it meets my needs as it will add delays and also may mess up with
>>>> watermarks.
>>>>
>>>> Best
>>>>
>>>> Etienne Chauchot
>>>>
>>>>
>>>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>>>
>>>> Unfortunately I don't see enough active committers working on
>>>> Structured Streaming; I don't expect major features/improvements can be
>>>> brought in this situation.
>>>>
>>>> Technically I can review and merge the PR on major improvements in SS,
>>>> but that depends on how huge the proposal is changing. If the proposal
>>>> brings conceptual change, being reviewed by a committer wouldn't still be
>>>> enough.
>>>>
>>>> So that's not due to the fact we think it's worthless. (That might be
>>>> only me though.) I'd understand as there's not much investment on SS.
>>>> There's also a known workaround for multiple aggregations (I've documented
>>>> in the SS guide doc, in "Limitation of global watermark" section), though I
>>>> totally agree the workaround is bad.
>>>>
>>>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <ec...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm also very interested in this feature but the PR is open since
>>>>> January 2019 and was not updated. It raised a design discussion around
>>>>> watermarks and a design doc was written (
>>>>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>>>> We also commented this design but no matter what it seems that the subject
>>>>> is still stale.
>>>>>
>>>>> Is there any interest in the community in delivering this feature or
>>>>> is it considered worthless ? If the latter, can you explain why ?
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne
>>>>> On 22/05/2019 03:38, 张万新 wrote:
>>>>>
>>>>> Thanks, I'll check it out.
>>>>>
>>>>> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道:
>>>>>
>>>>>> Heres the proposal for supporting it in "append" mode -
>>>>>> https://github.com/apache/spark/pull/23576. You could see if it
>>>>>> addresses your requirement and post your feedback in the PR.
>>>>>> For "update" mode its going to be much harder to support this without
>>>>>> first adding support for "retractions", otherwise we would end up with
>>>>>> wrong results.
>>>>>>
>>>>>> - Arun
>>>>>>
>>>>>>
>>>>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <
>>>>>> gabor.g.somogyi@gmail.com> wrote:
>>>>>>
>>>>>>> There is PR for this but not yet merged.
>>>>>>>
>>>>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I'd like to know what's the root reason why multiple aggregations
>>>>>>>> on streaming dataframe is not allowed since it's a very useful feature, and
>>>>>>>> flink has supported it for a long time.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Yuanjian Li <xy...@gmail.com>.
Nice blog! Thanks for sharing, Etienne!

Let's try to raise this discussion again after the 3.1 release. I do think
more committers/contributors had realized the issue of global watermark per
SPARK-24634 <https://issues.apache.org/jira/browse/SPARK-24634> and
SPARK-33259 <https://issues.apache.org/jira/browse/SPARK-33259>.

Leaving some thoughts on my end:
1. Compatibility: The per-operation watermark should be compatible with the
original global one when there are no multi-aggregations.
2. Versioning: If we need to change checkpoints' format, versioning info
should be added for the first time.
3. Fix more things together: We'd better fix more issues(e.g. per-operation
output mode for multi-aggregations) together, which would require
versioning changes in the same Spark version.

Best,
Yuanjian


Etienne Chauchot <ec...@apache.org> 于2020年11月26日周四 下午5:29写道:

> Hi,
>
> Regarding this subject I wrote a blog article that gives details about the
> watermark architecture proposal that was discussed in the design doc and in
> the PR:
>
>
> https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
>
> Best
>
> Etienne
> On 29/09/2020 03:24, Yuanjian Li wrote:
>
> Thanks for the great discussion!
>
> Also interested in this feature and did some investigation before. As Arun
> mentioned, similar to the "update" mode, the "complete" mode also needs
> more design. We might need an operation level output mode for the complete
> mode support. That is to say, if we use "complete" mode for every
> aggregation operators, the wrong result will return.
>
> SPARK-26655 would be a good start, which only considers about "append"
> mode. Maybe we need more discussion on the watermark interface. I will take
> a close look at the doc and PR. Hope we will have the first version with
> limitations and fix/remove them gradually.
>
> Best,
> Yuanjian
>
> Jungtaek Lim <ka...@gmail.com> 于2020年9月26日周六 上午10:31写道:
>
>> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
>> sorry I forgot to send the reply (was in draft).
>>
>> Regarding investment in SS, well, unfortunately I don't know - I'm just
>> an individual. There might be various reasons to do so, most probably
>> "priority" among the stuff. There's not much I could change.
>>
>> I agree the workaround is sub-optimal, but unless I see sufficient
>> support in the community I probably couldn't make it go forward. I'll just
>> say there's an elephant in the room - as the project goes forward for more
>> than 10 years, backward compatibility is a top priority concern in the
>> project, even across the major versions along the features/APIs. It is
>> great for end users to migrate the version easily, but also blocks devs to
>> fix the bad design once it ships. I'm the one complaining about these
>> issues in the dev list, and I don't see willingness to correct them.
>>
>>
>> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi Jungtaek Lim,
>>>
>>> Nice to hear from you again since last time we talked :) and congrats on
>>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
>>> not at the time)
>>>
>>> I totally agree with what you're saying on merging structural parts of
>>> Spark without having a broader consensus. What I don't understand is why
>>> there is not more investment in SS. Especially because in another thread
>>> the community is discussing about deprecating the regular DStream streaming
>>> framework.
>>>
>>> Is the orientation of Spark now mostly batch ?
>>>
>>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview
>>> 2 searching for this particular feature. And regarding the workaround, I'm
>>> not sure it meets my needs as it will add delays and also may mess up with
>>> watermarks.
>>>
>>> Best
>>>
>>> Etienne Chauchot
>>>
>>>
>>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>>
>>> Unfortunately I don't see enough active committers working on Structured
>>> Streaming; I don't expect major features/improvements can be brought in
>>> this situation.
>>>
>>> Technically I can review and merge the PR on major improvements in SS,
>>> but that depends on how huge the proposal is changing. If the proposal
>>> brings conceptual change, being reviewed by a committer wouldn't still be
>>> enough.
>>>
>>> So that's not due to the fact we think it's worthless. (That might be
>>> only me though.) I'd understand as there's not much investment on SS.
>>> There's also a known workaround for multiple aggregations (I've documented
>>> in the SS guide doc, in "Limitation of global watermark" section), though I
>>> totally agree the workaround is bad.
>>>
>>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <ec...@apache.org>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm also very interested in this feature but the PR is open since
>>>> January 2019 and was not updated. It raised a design discussion around
>>>> watermarks and a design doc was written (
>>>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>>> We also commented this design but no matter what it seems that the subject
>>>> is still stale.
>>>>
>>>> Is there any interest in the community in delivering this feature or is
>>>> it considered worthless ? If the latter, can you explain why ?
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>> On 22/05/2019 03:38, 张万新 wrote:
>>>>
>>>> Thanks, I'll check it out.
>>>>
>>>> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道:
>>>>
>>>>> Heres the proposal for supporting it in "append" mode -
>>>>> https://github.com/apache/spark/pull/23576. You could see if it
>>>>> addresses your requirement and post your feedback in the PR.
>>>>> For "update" mode its going to be much harder to support this without
>>>>> first adding support for "retractions", otherwise we would end up with
>>>>> wrong results.
>>>>>
>>>>> - Arun
>>>>>
>>>>>
>>>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <ga...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There is PR for this but not yet merged.
>>>>>>
>>>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I'd like to know what's the root reason why multiple aggregations on
>>>>>>> streaming dataframe is not allowed since it's a very useful feature, and
>>>>>>> flink has supported it for a long time.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Etienne Chauchot <ec...@apache.org>.
Hi,

Regarding this subject I wrote a blog article that gives details about 
the watermark architecture proposal that was discussed in the design doc 
and in the PR:

https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html

Best

Etienne

On 29/09/2020 03:24, Yuanjian Li wrote:
> Thanks for the great discussion!
>
> Also interested in this feature and did some investigation before. As 
> Arun mentioned, similar to the "update" mode, the "complete" mode also 
> needs more design. We might need an operation level output mode for 
> the complete mode support. That is to say, if we use "complete" mode 
> for every aggregation operators, the wrong result will return.
>
> SPARK-26655 would be a good start, which only considers about "append" 
> mode. Maybe we need more discussion on the watermark interface. I will 
> take a close look at the doc and PR. Hope we will have the first 
> version with limitations and fix/remove them gradually.
>
> Best,
> Yuanjian
>
> Jungtaek Lim <kabhwan.opensource@gmail.com 
> <ma...@gmail.com>> 于2020年9月26日周六 上午10:31写道:
>
>     Thanks Etienne! Yeah I forgot to say nice talking with you again.
>     And sorry I forgot to send the reply (was in draft).
>
>     Regarding investment in SS, well, unfortunately I don't know - I'm
>     just an individual. There might be various reasons to do so, most
>     probably "priority" among the stuff. There's not much I could change.
>
>     I agree the workaround is sub-optimal, but unless I see sufficient
>     support in the community I probably couldn't make it go forward.
>     I'll just say there's an elephant in the room - as the project
>     goes forward for more than 10 years, backward compatibility is a
>     top priority concern in the project, even across the major
>     versions along the features/APIs. It is great for end users to
>     migrate the version easily, but also blocks devs to fix the bad
>     design once it ships. I'm the one complaining about these issues
>     in the dev list, and I don't see willingness to correct them.
>
>
>     On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>
>         Hi Jungtaek Lim,
>
>         Nice to hear from you again since last time we talked :) and
>         congrats on becoming a Spark committer in the meantime ! (if
>         I'm not mistaking you were not at the time)
>
>         I totally agree with what you're saying on merging structural
>         parts of Spark without having a broader consensus. What I
>         don't understand is why there is not more investment in SS.
>         Especially because in another thread the community is
>         discussing about deprecating the regular DStream streaming
>         framework.
>
>         Is the orientation of Spark now mostly batch ?
>
>         PS: yeah I saw your update on the doc when I took a look at
>         3.0 preview 2 searching for this particular feature. And
>         regarding the workaround, I'm not sure it meets my needs as it
>         will add delays and also may mess up with watermarks.
>
>         Best
>
>         Etienne Chauchot
>
>
>         On 04/09/2020 08:06, Jungtaek Lim wrote:
>>         Unfortunately I don't see enough active committers working on
>>         Structured Streaming; I don't expect major
>>         features/improvements can be brought in this situation.
>>
>>         Technically I can review and merge the PR on major
>>         improvements in SS, but that depends on how huge the proposal
>>         is changing. If the proposal brings conceptual change, being
>>         reviewed by a committer wouldn't still be enough.
>>
>>         So that's not due to the fact we think it's worthless. (That
>>         might be only me though.) I'd understand as there's not much
>>         investment on SS. There's also a known workaround for
>>         multiple aggregations (I've documented in the SS guide doc,
>>         in "Limitation of global watermark" section), though I
>>         totally agree the workaround is bad.
>>
>>         On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot
>>         <echauchot@apache.org <ma...@apache.org>> wrote:
>>
>>             Hi all,
>>
>>             I'm also very interested in this feature but the PR is
>>             open since January 2019 and was not updated. It raised a
>>             design discussion around watermarks and a design doc was
>>             written
>>             (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>             We also commented this design but no matter what it seems
>>             that the subject is still stale.
>>
>>             Is there any interest in the community in delivering this
>>             feature or is it considered worthless ? If the latter,
>>             can you explain why ?
>>
>>             Best
>>
>>             Etienne
>>
>>             On 22/05/2019 03:38, 张万新 wrote:
>>>             Thanks, I'll check it out.
>>>
>>>             Arun Mahadevan <arunm@apache.org
>>>             <ma...@apache.org>> 于 2019年5月21日周二 01:31写道:
>>>
>>>                 Heres the proposal for supporting it in "append"
>>>                 mode - https://github.com/apache/spark/pull/23576.
>>>                 You could see if it addresses your requirement and
>>>                 post your feedback in the PR.
>>>                 For "update" mode its going to be much harder to
>>>                 support this without first adding support for
>>>                 "retractions", otherwise we would end up with wrong
>>>                 results.
>>>
>>>                 - Arun
>>>
>>>
>>>                 On Mon, 20 May 2019 at 01:34, Gabor Somogyi
>>>                 <gabor.g.somogyi@gmail.com
>>>                 <ma...@gmail.com>> wrote:
>>>
>>>                     There is PR for this but not yet merged.
>>>
>>>                     On Mon, May 20, 2019 at 10:13 AM 张万新
>>>                     <kevinzwx1992@gmail.com
>>>                     <ma...@gmail.com>> wrote:
>>>
>>>                         Hi there,
>>>
>>>                         I'd like to know what's the root reason why
>>>                         multiple aggregations on streaming dataframe
>>>                         is not allowed since it's a very useful
>>>                         feature, and flink has supported it for a
>>>                         long time.
>>>
>>>                         Thanks.
>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Yuanjian Li <xy...@gmail.com>.
Thanks for the great discussion!

Also interested in this feature and did some investigation before. As Arun
mentioned, similar to the "update" mode, the "complete" mode also needs
more design. We might need an operation level output mode for the complete
mode support. That is to say, if we use "complete" mode for every
aggregation operators, the wrong result will return.

SPARK-26655 would be a good start, which only considers about "append"
mode. Maybe we need more discussion on the watermark interface. I will take
a close look at the doc and PR. Hope we will have the first version with
limitations and fix/remove them gradually.

Best,
Yuanjian

Jungtaek Lim <ka...@gmail.com> 于2020年9月26日周六 上午10:31写道:

> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
> sorry I forgot to send the reply (was in draft).
>
> Regarding investment in SS, well, unfortunately I don't know - I'm just an
> individual. There might be various reasons to do so, most probably
> "priority" among the stuff. There's not much I could change.
>
> I agree the workaround is sub-optimal, but unless I see sufficient support
> in the community I probably couldn't make it go forward. I'll just say
> there's an elephant in the room - as the project goes forward for more than
> 10 years, backward compatibility is a top priority concern in the project,
> even across the major versions along the features/APIs. It is great for end
> users to migrate the version easily, but also blocks devs to fix the bad
> design once it ships. I'm the one complaining about these issues in the dev
> list, and I don't see willingness to correct them.
>
>
> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi Jungtaek Lim,
>>
>> Nice to hear from you again since last time we talked :) and congrats on
>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
>> not at the time)
>>
>> I totally agree with what you're saying on merging structural parts of
>> Spark without having a broader consensus. What I don't understand is why
>> there is not more investment in SS. Especially because in another thread
>> the community is discussing about deprecating the regular DStream streaming
>> framework.
>>
>> Is the orientation of Spark now mostly batch ?
>>
>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview 2
>> searching for this particular feature. And regarding the workaround, I'm
>> not sure it meets my needs as it will add delays and also may mess up with
>> watermarks.
>>
>> Best
>>
>> Etienne Chauchot
>>
>>
>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>
>> Unfortunately I don't see enough active committers working on Structured
>> Streaming; I don't expect major features/improvements can be brought in
>> this situation.
>>
>> Technically I can review and merge the PR on major improvements in SS,
>> but that depends on how huge the proposal is changing. If the proposal
>> brings conceptual change, being reviewed by a committer wouldn't still be
>> enough.
>>
>> So that's not due to the fact we think it's worthless. (That might be
>> only me though.) I'd understand as there's not much investment on SS.
>> There's also a known workaround for multiple aggregations (I've documented
>> in the SS guide doc, in "Limitation of global watermark" section), though I
>> totally agree the workaround is bad.
>>
>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm also very interested in this feature but the PR is open since
>>> January 2019 and was not updated. It raised a design discussion around
>>> watermarks and a design doc was written (
>>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>> We also commented this design but no matter what it seems that the subject
>>> is still stale.
>>>
>>> Is there any interest in the community in delivering this feature or is
>>> it considered worthless ? If the latter, can you explain why ?
>>>
>>> Best
>>>
>>> Etienne
>>> On 22/05/2019 03:38, 张万新 wrote:
>>>
>>> Thanks, I'll check it out.
>>>
>>> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道:
>>>
>>>> Heres the proposal for supporting it in "append" mode -
>>>> https://github.com/apache/spark/pull/23576. You could see if it
>>>> addresses your requirement and post your feedback in the PR.
>>>> For "update" mode its going to be much harder to support this without
>>>> first adding support for "retractions", otherwise we would end up with
>>>> wrong results.
>>>>
>>>> - Arun
>>>>
>>>>
>>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <ga...@gmail.com>
>>>> wrote:
>>>>
>>>>> There is PR for this but not yet merged.
>>>>>
>>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'd like to know what's the root reason why multiple aggregations on
>>>>>> streaming dataframe is not allowed since it's a very useful feature, and
>>>>>> flink has supported it for a long time.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Jungtaek Lim <ka...@gmail.com>.
Thanks Etienne! Yeah I forgot to say nice talking with you again. And sorry
I forgot to send the reply (was in draft).

Regarding investment in SS, well, unfortunately I don't know - I'm just an
individual. There might be various reasons to do so, most probably
"priority" among the stuff. There's not much I could change.

I agree the workaround is sub-optimal, but unless I see sufficient support
in the community I probably couldn't make it go forward. I'll just say
there's an elephant in the room - as the project goes forward for more than
10 years, backward compatibility is a top priority concern in the project,
even across the major versions along the features/APIs. It is great for end
users to migrate the version easily, but also blocks devs to fix the bad
design once it ships. I'm the one complaining about these issues in the dev
list, and I don't see willingness to correct them.


On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi Jungtaek Lim,
>
> Nice to hear from you again since last time we talked :) and congrats on
> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
> not at the time)
>
> I totally agree with what you're saying on merging structural parts of
> Spark without having a broader consensus. What I don't understand is why
> there is not more investment in SS. Especially because in another thread
> the community is discussing about deprecating the regular DStream streaming
> framework.
>
> Is the orientation of Spark now mostly batch ?
>
> PS: yeah I saw your update on the doc when I took a look at 3.0 preview 2
> searching for this particular feature. And regarding the workaround, I'm
> not sure it meets my needs as it will add delays and also may mess up with
> watermarks.
>
> Best
>
> Etienne Chauchot
>
>
> On 04/09/2020 08:06, Jungtaek Lim wrote:
>
> Unfortunately I don't see enough active committers working on Structured
> Streaming; I don't expect major features/improvements can be brought in
> this situation.
>
> Technically I can review and merge the PR on major improvements in SS, but
> that depends on how huge the proposal is changing. If the proposal brings
> conceptual change, being reviewed by a committer wouldn't still be enough.
>
> So that's not due to the fact we think it's worthless. (That might be only
> me though.) I'd understand as there's not much investment on SS. There's
> also a known workaround for multiple aggregations (I've documented in the
> SS guide doc, in "Limitation of global watermark" section), though I
> totally agree the workaround is bad.
>
> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi all,
>>
>> I'm also very interested in this feature but the PR is open since January
>> 2019 and was not updated. It raised a design discussion around watermarks
>> and a design doc was written (
>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>> We also commented this design but no matter what it seems that the subject
>> is still stale.
>>
>> Is there any interest in the community in delivering this feature or is
>> it considered worthless ? If the latter, can you explain why ?
>>
>> Best
>>
>> Etienne
>> On 22/05/2019 03:38, 张万新 wrote:
>>
>> Thanks, I'll check it out.
>>
>> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道:
>>
>>> Heres the proposal for supporting it in "append" mode -
>>> https://github.com/apache/spark/pull/23576. You could see if it
>>> addresses your requirement and post your feedback in the PR.
>>> For "update" mode its going to be much harder to support this without
>>> first adding support for "retractions", otherwise we would end up with
>>> wrong results.
>>>
>>> - Arun
>>>
>>>
>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <ga...@gmail.com>
>>> wrote:
>>>
>>>> There is PR for this but not yet merged.
>>>>
>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'd like to know what's the root reason why multiple aggregations on
>>>>> streaming dataframe is not allowed since it's a very useful feature, and
>>>>> flink has supported it for a long time.
>>>>>
>>>>> Thanks.
>>>>>
>>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Jungtaek Lim,

Nice to hear from you again since last time we talked :) and congrats on 
becoming a Spark committer in the meantime ! (if I'm not mistaking you 
were not at the time)

I totally agree with what you're saying on merging structural parts of 
Spark without having a broader consensus. What I don't understand is why 
there is not more investment in SS. Especially because in another thread 
the community is discussing about deprecating the regular DStream 
streaming framework.

Is the orientation of Spark now mostly batch ?

PS: yeah I saw your update on the doc when I took a look at 3.0 preview 
2 searching for this particular feature. And regarding the workaround, 
I'm not sure it meets my needs as it will add delays and also may mess 
up with watermarks.

Best

Etienne Chauchot


On 04/09/2020 08:06, Jungtaek Lim wrote:
> Unfortunately I don't see enough active committers working on 
> Structured Streaming; I don't expect major features/improvements can 
> be brought in this situation.
>
> Technically I can review and merge the PR on major improvements in SS, 
> but that depends on how huge the proposal is changing. If the proposal 
> brings conceptual change, being reviewed by a committer wouldn't still 
> be enough.
>
> So that's not due to the fact we think it's worthless. (That might be 
> only me though.) I'd understand as there's not much investment on SS. 
> There's also a known workaround for multiple aggregations (I've 
> documented in the SS guide doc, in "Limitation of global watermark" 
> section), though I totally agree the workaround is bad.
>
> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <echauchot@apache.org 
> <ma...@apache.org>> wrote:
>
>     Hi all,
>
>     I'm also very interested in this feature but the PR is open since
>     January 2019 and was not updated. It raised a design discussion
>     around watermarks and a design doc was written
>     (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>     We also commented this design but no matter what it seems that the
>     subject is still stale.
>
>     Is there any interest in the community in delivering this feature
>     or is it considered worthless ? If the latter, can you explain why ?
>
>     Best
>
>     Etienne
>
>     On 22/05/2019 03:38, 张万新 wrote:
>>     Thanks, I'll check it out.
>>
>>     Arun Mahadevan <arunm@apache.org <ma...@apache.org>> 于
>>     2019年5月21日周二 01:31写道:
>>
>>         Heres the proposal for supporting it in "append" mode -
>>         https://github.com/apache/spark/pull/23576. You could see if
>>         it addresses your requirement and post your feedback in the PR.
>>         For "update" mode its going to be much harder to support this
>>         without first adding support for "retractions", otherwise we
>>         would end up with wrong results.
>>
>>         - Arun
>>
>>
>>         On Mon, 20 May 2019 at 01:34, Gabor Somogyi
>>         <gabor.g.somogyi@gmail.com
>>         <ma...@gmail.com>> wrote:
>>
>>             There is PR for this but not yet merged.
>>
>>             On Mon, May 20, 2019 at 10:13 AM 张万新
>>             <kevinzwx1992@gmail.com <ma...@gmail.com>>
>>             wrote:
>>
>>                 Hi there,
>>
>>                 I'd like to know what's the root reason why multiple
>>                 aggregations on streaming dataframe is not allowed
>>                 since it's a very useful feature, and flink has
>>                 supported it for a long time.
>>
>>                 Thanks.
>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Jungtaek Lim <ka...@gmail.com>.
Unfortunately I don't see enough active committers working on Structured
Streaming; I don't expect major features/improvements can be brought in
this situation.

Technically I can review and merge the PR on major improvements in SS, but
that depends on how huge the proposal is changing. If the proposal brings
conceptual change, being reviewed by a committer wouldn't still be enough.

So that's not due to the fact we think it's worthless. (That might be only
me though.) I'd understand as there's not much investment on SS. There's
also a known workaround for multiple aggregations (I've documented in the
SS guide doc, in "Limitation of global watermark" section), though I
totally agree the workaround is bad.

On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi all,
>
> I'm also very interested in this feature but the PR is open since January
> 2019 and was not updated. It raised a design discussion around watermarks
> and a design doc was written (
> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
> We also commented this design but no matter what it seems that the subject
> is still stale.
>
> Is there any interest in the community in delivering this feature or is it
> considered worthless ? If the latter, can you explain why ?
>
> Best
>
> Etienne
> On 22/05/2019 03:38, 张万新 wrote:
>
> Thanks, I'll check it out.
>
> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道:
>
>> Heres the proposal for supporting it in "append" mode -
>> https://github.com/apache/spark/pull/23576. You could see if it
>> addresses your requirement and post your feedback in the PR.
>> For "update" mode its going to be much harder to support this without
>> first adding support for "retractions", otherwise we would end up with
>> wrong results.
>>
>> - Arun
>>
>>
>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <ga...@gmail.com>
>> wrote:
>>
>>> There is PR for this but not yet merged.
>>>
>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'd like to know what's the root reason why multiple aggregations on
>>>> streaming dataframe is not allowed since it's a very useful feature, and
>>>> flink has supported it for a long time.
>>>>
>>>> Thanks.
>>>>
>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

I'm also very interested in this feature but the PR is open since 
January 2019 and was not updated. It raised a design discussion around 
watermarks and a design doc was written 
(https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1). 
We also commented this design but no matter what it seems that the 
subject is still stale.

Is there any interest in the community in delivering this feature or is 
it considered worthless ? If the latter, can you explain why ?

Best

Etienne

On 22/05/2019 03:38, 张万新 wrote:
> Thanks, I'll check it out.
>
> Arun Mahadevan <arunm@apache.org <ma...@apache.org>> 于 
> 2019年5月21日周二 01:31写道:
>
>     Heres the proposal for supporting it in "append" mode -
>     https://github.com/apache/spark/pull/23576. You could see if it
>     addresses your requirement and post your feedback in the PR.
>     For "update" mode its going to be much harder to support this
>     without first adding support for "retractions", otherwise we would
>     end up with wrong results.
>
>     - Arun
>
>
>     On Mon, 20 May 2019 at 01:34, Gabor Somogyi
>     <gabor.g.somogyi@gmail.com <ma...@gmail.com>> wrote:
>
>         There is PR for this but not yet merged.
>
>         On Mon, May 20, 2019 at 10:13 AM 张万新 <kevinzwx1992@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             Hi there,
>
>             I'd like to know what's the root reason why multiple
>             aggregations on streaming dataframe is not allowed since
>             it's a very useful feature, and flink has supported it for
>             a long time.
>
>             Thanks.
>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by 张万新 <ke...@gmail.com>.
Thanks, I'll check it out.

Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道:

> Heres the proposal for supporting it in "append" mode -
> https://github.com/apache/spark/pull/23576. You could see if it addresses
> your requirement and post your feedback in the PR.
> For "update" mode its going to be much harder to support this without
> first adding support for "retractions", otherwise we would end up with
> wrong results.
>
> - Arun
>
>
> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <ga...@gmail.com>
> wrote:
>
>> There is PR for this but not yet merged.
>>
>> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I'd like to know what's the root reason why multiple aggregations on
>>> streaming dataframe is not allowed since it's a very useful feature, and
>>> flink has supported it for a long time.
>>>
>>> Thanks.
>>>
>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Arun Mahadevan <ar...@apache.org>.
Heres the proposal for supporting it in "append" mode -
https://github.com/apache/spark/pull/23576. You could see if it addresses
your requirement and post your feedback in the PR.
For "update" mode its going to be much harder to support this without first
adding support for "retractions", otherwise we would end up with wrong
results.

- Arun


On Mon, 20 May 2019 at 01:34, Gabor Somogyi <ga...@gmail.com>
wrote:

> There is PR for this but not yet merged.
>
> On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:
>
>> Hi there,
>>
>> I'd like to know what's the root reason why multiple aggregations on
>> streaming dataframe is not allowed since it's a very useful feature, and
>> flink has supported it for a long time.
>>
>> Thanks.
>>
>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Posted by Gabor Somogyi <ga...@gmail.com>.
There is PR for this but not yet merged.

On Mon, May 20, 2019 at 10:13 AM 张万新 <ke...@gmail.com> wrote:

> Hi there,
>
> I'd like to know what's the root reason why multiple aggregations on
> streaming dataframe is not allowed since it's a very useful feature, and
> flink has supported it for a long time.
>
> Thanks.
>