You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Lukasz Cwik <lc...@google.com.INVALID> on 2016/07/20 17:38:43 UTC

[PROPOSAL] CoGBK as primitive transform instead of GBK

I would like to propose a change to Beam to make CoGBK the basis for
grouping instead of GBK. The idea behind this proposal is that CoGBK is a
more powerful operator then GBK allowing for two key benefits:

1) SDKs are simplified: transforming a CoGBK into a GBK is trivial while
the reverse is not.
2) It will be easier for runners to provide more efficient implementations
of CoGBK as they will be responsible for the logic which takes their own
internal grouping implementation and maps it onto a CoGBK.

This requires the following modifications to the Beam code base:

1) Make GBK a composite transform in terms of CoGBK.
2) Move the CoGBK from contrib to runners-core as an adapter*. Runners that
more naturally support GBK can just use this and everything executes
exactly as before.

*just like GroupByKeyViaGroupByKeyOnly and UnboundedReadFromBoundedSource

Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
I created https://issues.apache.org/jira/browse/BEAM-490 which will track
the work for swapping the primitive from being GBK to CoGBK.

On Thu, Jul 21, 2016 at 9:24 AM, Lukasz Cwik <lc...@google.com> wrote:

> As of today, Cloud Dataflow will also be executed as a GBK.
>
> On Thu, Jul 21, 2016 at 2:56 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> +1
>>
>> Out of curiosity, does Cloud Dataflow have a CoGBK primitive or will it
>> also be executed as a GBK there?
>>
>> On Thu, 21 Jul 2016 at 02:29 Kam Kasravi <ka...@gmail.com> wrote:
>>
>> > +1 - awesome Manu.
>> >
>> >     On Wednesday, July 20, 2016 1:53 PM, Kenneth Knowles
>> > <kl...@google.com.INVALID> wrote:
>> >
>> >
>> >  +1
>> >
>> > I assume that the intent is for the semantics of both GBK and CoGBK to
>> be
>> > unchanged, just swapping their status as primitives.
>> >
>> > This seems like a good change, with strictly positive impact on users
>> and
>> > SDK authors, with only an extremely minor burden (doing an insertion of
>> the
>> > provided implementation in the worst case) on runner authors.
>> >
>> > Kenn
>> >
>> >
>> > On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik <lcwik@google.com.invalid
>> >
>> > wrote:
>> >
>> > > I would like to propose a change to Beam to make CoGBK the basis for
>> > > grouping instead of GBK. The idea behind this proposal is that CoGBK
>> is a
>> > > more powerful operator then GBK allowing for two key benefits:
>> > >
>> > > 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial
>> while
>> > > the reverse is not.
>> > > 2) It will be easier for runners to provide more efficient
>> > implementations
>> > > of CoGBK as they will be responsible for the logic which takes their
>> own
>> > > internal grouping implementation and maps it onto a CoGBK.
>> > >
>> > > This requires the following modifications to the Beam code base:
>> > >
>> > > 1) Make GBK a composite transform in terms of CoGBK.
>> > > 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners
>> > that
>> > > more naturally support GBK can just use this and everything executes
>> > > exactly as before.
>> > >
>> > > *just like GroupByKeyViaGroupByKeyOnly and
>> UnboundedReadFromBoundedSource
>> > >
>> >
>> >
>> >
>>
>
>

Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
As of today, Cloud Dataflow will also be executed as a GBK.

On Thu, Jul 21, 2016 at 2:56 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> +1
>
> Out of curiosity, does Cloud Dataflow have a CoGBK primitive or will it
> also be executed as a GBK there?
>
> On Thu, 21 Jul 2016 at 02:29 Kam Kasravi <ka...@gmail.com> wrote:
>
> > +1 - awesome Manu.
> >
> >     On Wednesday, July 20, 2016 1:53 PM, Kenneth Knowles
> > <kl...@google.com.INVALID> wrote:
> >
> >
> >  +1
> >
> > I assume that the intent is for the semantics of both GBK and CoGBK to be
> > unchanged, just swapping their status as primitives.
> >
> > This seems like a good change, with strictly positive impact on users and
> > SDK authors, with only an extremely minor burden (doing an insertion of
> the
> > provided implementation in the worst case) on runner authors.
> >
> > Kenn
> >
> >
> > On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > I would like to propose a change to Beam to make CoGBK the basis for
> > > grouping instead of GBK. The idea behind this proposal is that CoGBK
> is a
> > > more powerful operator then GBK allowing for two key benefits:
> > >
> > > 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial
> while
> > > the reverse is not.
> > > 2) It will be easier for runners to provide more efficient
> > implementations
> > > of CoGBK as they will be responsible for the logic which takes their
> own
> > > internal grouping implementation and maps it onto a CoGBK.
> > >
> > > This requires the following modifications to the Beam code base:
> > >
> > > 1) Make GBK a composite transform in terms of CoGBK.
> > > 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners
> > that
> > > more naturally support GBK can just use this and everything executes
> > > exactly as before.
> > >
> > > *just like GroupByKeyViaGroupByKeyOnly and
> UnboundedReadFromBoundedSource
> > >
> >
> >
> >
>

Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

Posted by Aljoscha Krettek <al...@apache.org>.
+1

Out of curiosity, does Cloud Dataflow have a CoGBK primitive or will it
also be executed as a GBK there?

On Thu, 21 Jul 2016 at 02:29 Kam Kasravi <ka...@gmail.com> wrote:

> +1 - awesome Manu.
>
>     On Wednesday, July 20, 2016 1:53 PM, Kenneth Knowles
> <kl...@google.com.INVALID> wrote:
>
>
>  +1
>
> I assume that the intent is for the semantics of both GBK and CoGBK to be
> unchanged, just swapping their status as primitives.
>
> This seems like a good change, with strictly positive impact on users and
> SDK authors, with only an extremely minor burden (doing an insertion of the
> provided implementation in the worst case) on runner authors.
>
> Kenn
>
>
> On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > I would like to propose a change to Beam to make CoGBK the basis for
> > grouping instead of GBK. The idea behind this proposal is that CoGBK is a
> > more powerful operator then GBK allowing for two key benefits:
> >
> > 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial while
> > the reverse is not.
> > 2) It will be easier for runners to provide more efficient
> implementations
> > of CoGBK as they will be responsible for the logic which takes their own
> > internal grouping implementation and maps it onto a CoGBK.
> >
> > This requires the following modifications to the Beam code base:
> >
> > 1) Make GBK a composite transform in terms of CoGBK.
> > 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners
> that
> > more naturally support GBK can just use this and everything executes
> > exactly as before.
> >
> > *just like GroupByKeyViaGroupByKeyOnly and UnboundedReadFromBoundedSource
> >
>
>
>

Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

Posted by Kam Kasravi <ka...@gmail.com>.
+1 - awesome Manu. 

    On Wednesday, July 20, 2016 1:53 PM, Kenneth Knowles <kl...@google.com.INVALID> wrote:
 

 +1

I assume that the intent is for the semantics of both GBK and CoGBK to be
unchanged, just swapping their status as primitives.

This seems like a good change, with strictly positive impact on users and
SDK authors, with only an extremely minor burden (doing an insertion of the
provided implementation in the worst case) on runner authors.

Kenn


On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> I would like to propose a change to Beam to make CoGBK the basis for
> grouping instead of GBK. The idea behind this proposal is that CoGBK is a
> more powerful operator then GBK allowing for two key benefits:
>
> 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial while
> the reverse is not.
> 2) It will be easier for runners to provide more efficient implementations
> of CoGBK as they will be responsible for the logic which takes their own
> internal grouping implementation and maps it onto a CoGBK.
>
> This requires the following modifications to the Beam code base:
>
> 1) Make GBK a composite transform in terms of CoGBK.
> 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners that
> more naturally support GBK can just use this and everything executes
> exactly as before.
>
> *just like GroupByKeyViaGroupByKeyOnly and UnboundedReadFromBoundedSource
>


   

Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
+1

I assume that the intent is for the semantics of both GBK and CoGBK to be
unchanged, just swapping their status as primitives.

This seems like a good change, with strictly positive impact on users and
SDK authors, with only an extremely minor burden (doing an insertion of the
provided implementation in the worst case) on runner authors.

Kenn


On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> I would like to propose a change to Beam to make CoGBK the basis for
> grouping instead of GBK. The idea behind this proposal is that CoGBK is a
> more powerful operator then GBK allowing for two key benefits:
>
> 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial while
> the reverse is not.
> 2) It will be easier for runners to provide more efficient implementations
> of CoGBK as they will be responsible for the logic which takes their own
> internal grouping implementation and maps it onto a CoGBK.
>
> This requires the following modifications to the Beam code base:
>
> 1) Make GBK a composite transform in terms of CoGBK.
> 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners that
> more naturally support GBK can just use this and everything executes
> exactly as before.
>
> *just like GroupByKeyViaGroupByKeyOnly and UnboundedReadFromBoundedSource
>