You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Mohamed Haseeb <m...@mohaseeb.com> on 2019/01/02 20:08:48 UTC

GroupByKey and number of workers

Hi,

As per the Authoring I/O Transforms guide
<https://beam.apache.org/documentation/io/authoring-overview/>, the
recommended way to implement a Read transform (from a source that can be
read in parallel) has these steps:
- Splitting the data into parts to be read in parallel (ParDo)
- Reading from each of those parts (ParDo)
- With a GroupByKey in between the ParDo:s
The stated motivation for the GroupByKey is "it allows the runner to use
different numbers of workers" for the splitting and reading parts. Can
someone elaborate (or point to some relevant DOCs) on how the GroupByKey
will enable using different number of works for the two ParDo steps.

Thanks,
Mohamed

Re: GroupByKey and number of workers

Posted by Mohamed Haseeb <m...@mohaseeb.com>.
This explains it. Thanks Reza!

On Thu, Jan 3, 2019 at 1:19 AM Reza Ardeshir Rokni <ra...@gmail.com>
wrote:

> Hi Mohamed,
>
> I believe this is related to fusion which is a feature of some of the
> runners, you will be able to find more information on fusion on:
>
>
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
>
> Cheers
>
> Reza
>
> On Thu, 3 Jan 2019 at 04:09, Mohamed Haseeb <m...@mohaseeb.com> wrote:
>
>> Hi,
>>
>> As per the Authoring I/O Transforms guide
>> <https://beam.apache.org/documentation/io/authoring-overview/>, the
>> recommended way to implement a Read transform (from a source that can be
>> read in parallel) has these steps:
>> - Splitting the data into parts to be read in parallel (ParDo)
>> - Reading from each of those parts (ParDo)
>> - With a GroupByKey in between the ParDo:s
>> The stated motivation for the GroupByKey is "it allows the runner to use
>> different numbers of workers" for the splitting and reading parts. Can
>> someone elaborate (or point to some relevant DOCs) on how the GroupByKey
>> will enable using different number of works for the two ParDo steps.
>>
>> Thanks,
>> Mohamed
>>
>

Re: GroupByKey and number of workers

Posted by Reza Ardeshir Rokni <ra...@gmail.com>.
Hi Mohamed,

I believe this is related to fusion which is a feature of some of the
runners, you will be able to find more information on fusion on:

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization

Cheers

Reza

On Thu, 3 Jan 2019 at 04:09, Mohamed Haseeb <m...@mohaseeb.com> wrote:

> Hi,
>
> As per the Authoring I/O Transforms guide
> <https://beam.apache.org/documentation/io/authoring-overview/>, the
> recommended way to implement a Read transform (from a source that can be
> read in parallel) has these steps:
> - Splitting the data into parts to be read in parallel (ParDo)
> - Reading from each of those parts (ParDo)
> - With a GroupByKey in between the ParDo:s
> The stated motivation for the GroupByKey is "it allows the runner to use
> different numbers of workers" for the splitting and reading parts. Can
> someone elaborate (or point to some relevant DOCs) on how the GroupByKey
> will enable using different number of works for the two ParDo steps.
>
> Thanks,
> Mohamed
>