You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Pablo Estrada <pa...@google.com> on 2021/09/23 23:30:12 UTC

Documenting per-key delivery semantics for various runners

Hi all,
I've been spending some time thinking about CDC use cases on Beam. One
valuable piece to enable these use cases is to define how Beam deals with
ordering of elements in streaming pipelines.
With that in mind, I wrote a document[1] that proposes a definition of the
ordering semantics supported by most Beam runners, and a pull request [2]
with ValidatesRunner tests and documentation updates.

Would you please review these, add your comments and thoughts, and let me
know if they make sense?

Thanks!
-P.

[1]
https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#

[2] https://github.com/apache/beam/pull/15378

Re: Documenting per-key delivery semantics for various runners

Posted by Jan Lukavský <je...@seznam.cz>.

Hi Pablo,

thanks for the motivating examples. I understand the motivation now, but 
one question comes to mind - we do not mind the case, when in-between 
the emitting PTransform and the comsuming PTransform is another 
(grouping) PTransform which _changes the key_? Via the change of key, 
two elements originally emitted with the same key, can change the key to 
two different ones, and then back to the same one, which would obviously 
violate any ordering defined on the original emitting transform. I'm 
aware, that the per-key definition describes the two PTransforms to be 
directly connected. I'm just asking, if we would not want to solve this 
for the more general case.

In the original design document for @RequiresTimeSortedInput [1], there 
was (not implemented) mention about "User supplied sorting criterion", 
which I believe is exactly what for instance Kafka offset offers, and 
what is actually described by the per-key delivery semantics. The key 
point is that each element is (anyhow) assigned a sequential ID, which 
then defines order as seen by a per-key authoritative _observer_ (for 
instance, the source emitting transform, in the case of Kafka that is 
the leader for a partition, and so on). If this per-key sequential ID is 
carried along the element, then the order can be reconstructed at any 
downstream stage.

I'm +1 to splitting the documentation and validation / implementation 
parts, that sounds good to me.

  Jan

[1] 
https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/edit?usp=sharing

On 9/27/21 11:56 PM, Pablo Estrada wrote:
> Hello all,
> thanks for your comments.
>
> All runners that I've tested have these semantics for streaming. See 
> the PR[1] with the cap matrix changes.
>
> I agree it makes sense to add a pipeline check for this. I think I 
> would like to receive comments / agreement on the definition and the 
> changes to the documentation - and then follow up with the pipeline 
> check. Is that reasonable for everyone?
>
> I have added a section of motivating use cases to the document, Jan. 
> Let me know if those make sense.
>
> [1] https://github.com/apache/beam/pull/15378 
> <https://github.com/apache/beam/pull/15378>
>
> Thanks
> -P
>
> On Sat, Sep 25, 2021 at 1:06 PM Jan Lukavský <je.ik@seznam.cz 
> <ma...@seznam.cz>> wrote:
>
>     +1 to adding a Pipeline requirement for this, if business logic
>     relies
>     on a specific feature runner might/might not have, then Pipeline
>     should
>     be rejected on runners that do not support it. Do we have a list
>     runners
>     that have or lack this semantics? Just for clarification - sorry my
>     ignorance, if this has been already described - do we have a
>     description
>     of the use-cases that drive this effort?
>
>     Thanks,
>
>       Jan
>
>     On 9/24/21 10:58 PM, Robert Bradshaw wrote:
>     > Thanks for writing this up. Rather than just documenting it,
>     should we
>     > have a way of asserting/requesting it (like time sorted inputs) such
>     > that a pipeline author that needs to rely on this property can be
>     > rejected on runners that don't provide it?
>     >
>     > On Fri, Sep 24, 2021 at 12:25 PM Kenneth Knowles
>     <kenn@apache.org <ma...@apache.org>> wrote:
>     >> Took a look. I definitely agree that something like this is
>     useful, and well-motivated by the use cases you raise.
>     >>
>     >> Kenn
>     >>
>     >> On Thu, Sep 23, 2021 at 4:30 PM Pablo Estrada
>     <pabloem@google.com <ma...@google.com>> wrote:
>     >>> Hi all,
>     >>> I've been spending some time thinking about CDC use cases on
>     Beam. One valuable piece to enable these use cases is to define
>     how Beam deals with ordering of elements in streaming pipelines.
>     >>> With that in mind, I wrote a document[1] that proposes a
>     definition of the ordering semantics supported by most Beam
>     runners, and a pull request [2] with ValidatesRunner tests and
>     documentation updates.
>     >>>
>     >>> Would you please review these, add your comments and thoughts,
>     and let me know if they make sense?
>     >>>
>     >>> Thanks!
>     >>> -P.
>     >>>
>     >>> [1]
>     https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#
>     <https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#>
>     >>> [2] https://github.com/apache/beam/pull/15378
>     <https://github.com/apache/beam/pull/15378>
>

Re: Documenting per-key delivery semantics for various runners

Posted by Pablo Estrada <pa...@google.com>.

Hello all,
thanks for your comments.

All runners that I've tested have these semantics for streaming. See the
PR[1] with the cap matrix changes.

I agree it makes sense to add a pipeline check for this. I think I would
like to receive comments / agreement on the definition and the changes to
the documentation - and then follow up with the pipeline check. Is that
reasonable for everyone?

I have added a section of motivating use cases to the document, Jan. Let me
know if those make sense.

[1] https://github.com/apache/beam/pull/15378

Thanks
-P

On Sat, Sep 25, 2021 at 1:06 PM Jan Lukavský <je...@seznam.cz> wrote:

> +1 to adding a Pipeline requirement for this, if business logic relies
> on a specific feature runner might/might not have, then Pipeline should
> be rejected on runners that do not support it. Do we have a list runners
> that have or lack this semantics? Just for clarification - sorry my
> ignorance, if this has been already described - do we have a description
> of the use-cases that drive this effort?
>
> Thanks,
>
>   Jan
>
> On 9/24/21 10:58 PM, Robert Bradshaw wrote:
> > Thanks for writing this up. Rather than just documenting it, should we
> > have a way of asserting/requesting it (like time sorted inputs) such
> > that a pipeline author that needs to rely on this property can be
> > rejected on runners that don't provide it?
> >
> > On Fri, Sep 24, 2021 at 12:25 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >> Took a look. I definitely agree that something like this is useful, and
> well-motivated by the use cases you raise.
> >>
> >> Kenn
> >>
> >> On Thu, Sep 23, 2021 at 4:30 PM Pablo Estrada <pa...@google.com>
> wrote:
> >>> Hi all,
> >>> I've been spending some time thinking about CDC use cases on Beam. One
> valuable piece to enable these use cases is to define how Beam deals with
> ordering of elements in streaming pipelines.
> >>> With that in mind, I wrote a document[1] that proposes a definition of
> the ordering semantics supported by most Beam runners, and a pull request
> [2] with ValidatesRunner tests and documentation updates.
> >>>
> >>> Would you please review these, add your comments and thoughts, and let
> me know if they make sense?
> >>>
> >>> Thanks!
> >>> -P.
> >>>
> >>> [1]
> https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#
> >>> [2] https://github.com/apache/beam/pull/15378
>

Re: Documenting per-key delivery semantics for various runners

Posted by Jan Lukavský <je...@seznam.cz>.

+1 to adding a Pipeline requirement for this, if business logic relies 
on a specific feature runner might/might not have, then Pipeline should 
be rejected on runners that do not support it. Do we have a list runners 
that have or lack this semantics? Just for clarification - sorry my 
ignorance, if this has been already described - do we have a description 
of the use-cases that drive this effort?

Thanks,

  Jan

On 9/24/21 10:58 PM, Robert Bradshaw wrote:
> Thanks for writing this up. Rather than just documenting it, should we
> have a way of asserting/requesting it (like time sorted inputs) such
> that a pipeline author that needs to rely on this property can be
> rejected on runners that don't provide it?
>
> On Fri, Sep 24, 2021 at 12:25 PM Kenneth Knowles <ke...@apache.org> wrote:
>> Took a look. I definitely agree that something like this is useful, and well-motivated by the use cases you raise.
>>
>> Kenn
>>
>> On Thu, Sep 23, 2021 at 4:30 PM Pablo Estrada <pa...@google.com> wrote:
>>> Hi all,
>>> I've been spending some time thinking about CDC use cases on Beam. One valuable piece to enable these use cases is to define how Beam deals with ordering of elements in streaming pipelines.
>>> With that in mind, I wrote a document[1] that proposes a definition of the ordering semantics supported by most Beam runners, and a pull request [2] with ValidatesRunner tests and documentation updates.
>>>
>>> Would you please review these, add your comments and thoughts, and let me know if they make sense?
>>>
>>> Thanks!
>>> -P.
>>>
>>> [1] https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#
>>> [2] https://github.com/apache/beam/pull/15378

Re: Documenting per-key delivery semantics for various runners

Posted by Robert Bradshaw <ro...@google.com>.

Thanks for writing this up. Rather than just documenting it, should we
have a way of asserting/requesting it (like time sorted inputs) such
that a pipeline author that needs to rely on this property can be
rejected on runners that don't provide it?

On Fri, Sep 24, 2021 at 12:25 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> Took a look. I definitely agree that something like this is useful, and well-motivated by the use cases you raise.
>
> Kenn
>
> On Thu, Sep 23, 2021 at 4:30 PM Pablo Estrada <pa...@google.com> wrote:
>>
>> Hi all,
>> I've been spending some time thinking about CDC use cases on Beam. One valuable piece to enable these use cases is to define how Beam deals with ordering of elements in streaming pipelines.
>> With that in mind, I wrote a document[1] that proposes a definition of the ordering semantics supported by most Beam runners, and a pull request [2] with ValidatesRunner tests and documentation updates.
>>
>> Would you please review these, add your comments and thoughts, and let me know if they make sense?
>>
>> Thanks!
>> -P.
>>
>> [1] https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#
>> [2] https://github.com/apache/beam/pull/15378

Re: Documenting per-key delivery semantics for various runners

Posted by Kenneth Knowles <ke...@apache.org>.

Took a look. I definitely agree that something like this is useful, and
well-motivated by the use cases you raise.

Kenn

On Thu, Sep 23, 2021 at 4:30 PM Pablo Estrada <pa...@google.com> wrote:

> Hi all,
> I've been spending some time thinking about CDC use cases on Beam. One
> valuable piece to enable these use cases is to define how Beam deals with
> ordering of elements in streaming pipelines.
> With that in mind, I wrote a document[1] that proposes a definition of the
> ordering semantics supported by most Beam runners, and a pull request [2]
> with ValidatesRunner tests and documentation updates.
>
> Would you please review these, add your comments and thoughts, and let me
> know if they make sense?
>
> Thanks!
> -P.
>
> [1]
> https://docs.google.com/document/d/1_7WRJznXlOtWuVaHl_dpy8OZcx_M8BUmeWVA4G0-wEc/edit#
>
> [2] https://github.com/apache/beam/pull/15378
>