You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Rui Wang <ru...@google.com> on 2019/07/10 17:58:36 UTC

[Discuss] Retractions in Beam

Hi Community,

Retractions is a part of core Beam model [1]. I come up with a doc to
discuss retractions about use cases, model and API (see the link below).
This is a very beginning discussion on retractions but I do hope we can
have a consensus and make retractions implemented in a useful way
eventually.


doc link:
https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing


[1]: https://issues.apache.org/jira/browse/BEAM-91


-Rui

Re: [Discuss] Retractions in Beam

Posted by Ismaël Mejía <ie...@gmail.com>.
Can you please add this to the design documents webpage.
https://beam.apache.org/contribute/design-documents/


On Wed, Jul 10, 2019 at 7:59 PM Rui Wang <ru...@google.com> wrote:
>
> Hi Community,
>
> Retractions is a part of core Beam model [1]. I come up with a doc to discuss retractions about use cases, model and API (see the link below). This is a very beginning discussion on retractions but I do hope we can have a consensus and make retractions implemented in a useful way eventually.
>
>
> doc link: https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>
>
> [1]: https://issues.apache.org/jira/browse/BEAM-91
>
>
> -Rui

Re: [Discuss] Retractions in Beam

Posted by Rui Wang <ru...@google.com>.
Thanks Kenn.

Some points to answer some concerns:
1. Adding retraction won't break existing users (because it is a new
accumulation mode).
2. Adding retraction won't affect existing pipeline's performance(it could
be done by avoiding calling retracting mode's core components, e.g.
ReduceFn, by existing modes).
3. If we keep going in the direction of "retracting and discarding", the
performance decrease should be minimal: small input change that leads to
small output change will be cheap to compute. For example, for Combining, a
retraction should never require a full recompute on every combined element.
It should be the same for Join.


Please feel free to combine/add other pieces you want to add, even if there
could be overlaps.

-Rui


On Wed, Aug 21, 2019 at 11:20 AM Kenneth Knowles <ke...@apache.org> wrote:

> I reviewed your PR (https://github.com/apache/beam/pull/9199) and Anton's
> as another reference (https://github.com/apache/beam/pull/4742). Nice
> work. I thought I would summarize for the list a little bit. I think we
> have not done too much with retractions because it seems like a big job.
> You both have shown that it is maybe not that hard to implement the core.
> But it will have a lot of user-facing things that we have to test very
> carefully.
>
>  - the technical changes are primarily to ReduceFn aka GroupAlsoByWindow
> which is the core of stateful aggregation and is straightforward, which is
> cool
>  - the boilerplate through the codebase is a lot (most of the ~1000 lines
> of both PRs) but it could have been a lot worse, so we are lucky :-)
>
> Here are steps forward that I can think of:
>
>  - we need backwards compatibility, which is trivial because it is a new
> accumulation mode
>  - we need a little more mathematical analysis (at least personally to
> have more confidence there are no bad surprises)
>  - we need more description of the user-facing impact and API changes
> (same reason)
>  - lots and lots of @ValidatesRunner tests
>  - some opinions here from runner authors about efficiency in their system
>  - and merging/unmerging window support matters since that is a key
> retractions use case too, but I would save it for later in my opinion (if
> you've seen Tyler's hack to do Validity Windows then that is even harder)
>  - we also need protections so that things which will do not work with
> retractions are rejected, which will be all existing user DoFns and all
> sinks
>
> I've got an old doc made w/ Anton, Ben, and a couple others that I can try
> to find time to edit and share that deals a little bit with mathematics
> (messy and incomplete) and the API/compatibility questions (more useful,
> probably). You've seen it offline but the list has not seen a public
> version. I was going to try to merge it with yours but I can get it out
> quicker if I just allow for the overlaps.
>
> Kenn
>
> On Mon, Aug 12, 2019 at 9:47 PM Rui Wang <ru...@google.com> wrote:
>
>> Hello!
>>
>> I have also been building a proof of concept(PR
>> <https://github.com/apache/beam/pull/9199>), which implements the
>> streaming wordcount example in the design doc.
>>
>> What is missing in the PoC is ordering guarantee implementation in sink
>> (which I am working on).
>>
>>
>> -Rui
>>
>> On Wed, Jul 24, 2019 at 1:37 PM Rui Wang <ru...@google.com> wrote:
>>
>>> Hello!
>>>
>>> In case you are not aware of, I have added a modified streaming
>>> wordcount example at the end of the doc to illustrate retractions.
>>>
>>>
>>> -Rui
>>>
>>> On Wed, Jul 10, 2019 at 10:58 AM Rui Wang <ru...@google.com> wrote:
>>>
>>>> Hi Community,
>>>>
>>>> Retractions is a part of core Beam model [1]. I come up with a doc to
>>>> discuss retractions about use cases, model and API (see the link below).
>>>> This is a very beginning discussion on retractions but I do hope we can
>>>> have a consensus and make retractions implemented in a useful way
>>>> eventually.
>>>>
>>>>
>>>> doc link:
>>>> https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>>>>
>>>>
>>>> [1]: https://issues.apache.org/jira/browse/BEAM-91
>>>>
>>>>
>>>> -Rui
>>>>
>>>

Re: [Discuss] Retractions in Beam

Posted by Kenneth Knowles <ke...@apache.org>.
I reviewed your PR (https://github.com/apache/beam/pull/9199) and Anton's
as another reference (https://github.com/apache/beam/pull/4742). Nice work.
I thought I would summarize for the list a little bit. I think we have not
done too much with retractions because it seems like a big job. You both
have shown that it is maybe not that hard to implement the core. But it
will have a lot of user-facing things that we have to test very carefully.

 - the technical changes are primarily to ReduceFn aka GroupAlsoByWindow
which is the core of stateful aggregation and is straightforward, which is
cool
 - the boilerplate through the codebase is a lot (most of the ~1000 lines
of both PRs) but it could have been a lot worse, so we are lucky :-)

Here are steps forward that I can think of:

 - we need backwards compatibility, which is trivial because it is a new
accumulation mode
 - we need a little more mathematical analysis (at least personally to have
more confidence there are no bad surprises)
 - we need more description of the user-facing impact and API changes (same
reason)
 - lots and lots of @ValidatesRunner tests
 - some opinions here from runner authors about efficiency in their system
 - and merging/unmerging window support matters since that is a key
retractions use case too, but I would save it for later in my opinion (if
you've seen Tyler's hack to do Validity Windows then that is even harder)
 - we also need protections so that things which will do not work with
retractions are rejected, which will be all existing user DoFns and all
sinks

I've got an old doc made w/ Anton, Ben, and a couple others that I can try
to find time to edit and share that deals a little bit with mathematics
(messy and incomplete) and the API/compatibility questions (more useful,
probably). You've seen it offline but the list has not seen a public
version. I was going to try to merge it with yours but I can get it out
quicker if I just allow for the overlaps.

Kenn

On Mon, Aug 12, 2019 at 9:47 PM Rui Wang <ru...@google.com> wrote:

> Hello!
>
> I have also been building a proof of concept(PR
> <https://github.com/apache/beam/pull/9199>), which implements the
> streaming wordcount example in the design doc.
>
> What is missing in the PoC is ordering guarantee implementation in sink
> (which I am working on).
>
>
> -Rui
>
> On Wed, Jul 24, 2019 at 1:37 PM Rui Wang <ru...@google.com> wrote:
>
>> Hello!
>>
>> In case you are not aware of, I have added a modified streaming wordcount
>> example at the end of the doc to illustrate retractions.
>>
>>
>> -Rui
>>
>> On Wed, Jul 10, 2019 at 10:58 AM Rui Wang <ru...@google.com> wrote:
>>
>>> Hi Community,
>>>
>>> Retractions is a part of core Beam model [1]. I come up with a doc to
>>> discuss retractions about use cases, model and API (see the link below).
>>> This is a very beginning discussion on retractions but I do hope we can
>>> have a consensus and make retractions implemented in a useful way
>>> eventually.
>>>
>>>
>>> doc link:
>>> https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>>>
>>>
>>> [1]: https://issues.apache.org/jira/browse/BEAM-91
>>>
>>>
>>> -Rui
>>>
>>

Re: [Discuss] Retractions in Beam

Posted by Rui Wang <ru...@google.com>.
Hello!

I have also been building a proof of concept(PR
<https://github.com/apache/beam/pull/9199>), which implements the streaming
wordcount example in the design doc.

What is missing in the PoC is ordering guarantee implementation in sink
(which I am working on).


-Rui

On Wed, Jul 24, 2019 at 1:37 PM Rui Wang <ru...@google.com> wrote:

> Hello!
>
> In case you are not aware of, I have added a modified streaming wordcount
> example at the end of the doc to illustrate retractions.
>
>
> -Rui
>
> On Wed, Jul 10, 2019 at 10:58 AM Rui Wang <ru...@google.com> wrote:
>
>> Hi Community,
>>
>> Retractions is a part of core Beam model [1]. I come up with a doc to
>> discuss retractions about use cases, model and API (see the link below).
>> This is a very beginning discussion on retractions but I do hope we can
>> have a consensus and make retractions implemented in a useful way
>> eventually.
>>
>>
>> doc link:
>> https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>>
>>
>> [1]: https://issues.apache.org/jira/browse/BEAM-91
>>
>>
>> -Rui
>>
>

Re: [Discuss] Retractions in Beam

Posted by Rui Wang <ru...@google.com>.
Hello!

In case you are not aware of, I have added a modified streaming wordcount
example at the end of the doc to illustrate retractions.


-Rui

On Wed, Jul 10, 2019 at 10:58 AM Rui Wang <ru...@google.com> wrote:

> Hi Community,
>
> Retractions is a part of core Beam model [1]. I come up with a doc to
> discuss retractions about use cases, model and API (see the link below).
> This is a very beginning discussion on retractions but I do hope we can
> have a consensus and make retractions implemented in a useful way
> eventually.
>
>
> doc link:
> https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>
>
> [1]: https://issues.apache.org/jira/browse/BEAM-91
>
>
> -Rui
>