You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by gaurav mishra <ga...@gmail.com> on 2021/11/16 23:00:22 UTC

Stateful DoFn, Global Window

Hi,
I have a pipeline which looks like this -
Input -> Convert_to_KV_pairsDoFn -> SomeStatefulDofn -> output
As you can see there is no explicit "shuffle" transform here

My understanding and observation so far has been that the  SomeStatefulDofn
will never be executed in parallel on two workers for any given key. Is
my understanding correct?
If yes, then second question - is there an implicit groupByKey kind of step
introduced by dataflow here to ensure that all msges with same key goes to
same worker?

Re: Stateful DoFn, Global Window

Posted by gaurav mishra <ga...@gmail.com>.
Thanks for the clarifications Robert.

On Tue, Nov 16, 2021 at 3:27 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, Nov 16, 2021 at 3:00 PM gaurav mishra
> <ga...@gmail.com> wrote:
> >
> > Hi,
> > I have a pipeline which looks like this -
> > Input -> Convert_to_KV_pairsDoFn -> SomeStatefulDofn -> output
> > As you can see there is no explicit "shuffle" transform here
> >
> > My understanding and observation so far has been that the
> SomeStatefulDofn will never be executed in parallel on two workers for any
> given key. Is my understanding correct?
>
> That is correct.
>
> > If yes, then second question - is there an implicit groupByKey kind of
> step introduced by dataflow here to ensure that all msges with same key
> goes to same worker?
>
> Exactly. (It doesn't technically group things, in the sense that it
> doesn't wait for all values per key-window to be available as a
> barrier, but it shuffles all keys to the same worker same as a GBK.)
>

Re: Stateful DoFn, Global Window

Posted by Robert Bradshaw <ro...@google.com>.
On Tue, Nov 16, 2021 at 3:00 PM gaurav mishra
<ga...@gmail.com> wrote:
>
> Hi,
> I have a pipeline which looks like this -
> Input -> Convert_to_KV_pairsDoFn -> SomeStatefulDofn -> output
> As you can see there is no explicit "shuffle" transform here
>
> My understanding and observation so far has been that the  SomeStatefulDofn will never be executed in parallel on two workers for any given key. Is my understanding correct?

That is correct.

> If yes, then second question - is there an implicit groupByKey kind of step introduced by dataflow here to ensure that all msges with same key goes to same worker?

Exactly. (It doesn't technically group things, in the sense that it
doesn't wait for all values per key-window to be available as a
barrier, but it shuffles all keys to the same worker same as a GBK.)