You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Shen Li <cs...@gmail.com> on 2016/06/24 19:13:27 UTC

How to control the parallelism when run ParDo on PCollection?

Hi,

The document says "when a ParDo transform is executed, the elements of the
input PCollection are first divided up into some number of bundles".

How do users control the number of bundles/parallelism? Or is it completely
up to the runner?

Thanks,

Shen

Re: How to control the parallelism when run ParDo on PCollection?

Posted by Shen Li <cs...@gmail.com>.
Hi Thomas,

Thanks for the follow-up.

Shen

On Fri, Jun 24, 2016 at 4:49 PM, Thomas Groh <tg...@google.com.invalid>
wrote:

> We do also have an active JIRA issue to support limiting parallelism on a
> per-step basis, BEAM-68
>
> https://issues.apache.org/jira/browse/BEAM-68
>
> As Kenn noted, this is not equivalent to controls over bundling, which is
> entirely determined by the runner.
>
> On Fri, Jun 24, 2016 at 1:25 PM, Shen Li <cs...@gmail.com> wrote:
>
> > Hi Kenn,
> >
> > Thanks for the explanation.
> >
> > Regards,
> >
> > Shen
> >
> > On Fri, Jun 24, 2016 at 4:09 PM, Kenneth Knowles <klk@google.com.invalid
> >
> > wrote:
> >
> > > Hi Shen,
> > >
> > > It is completely up to the runner how to divide things into bundles: it
> > is
> > > one item of work that should fail or succeed atomically. Bundling
> limits
> > > parallelism, but does not determine it. For example, a streaming
> > execution
> > > may have many bundles over time as elements arrive, regardless of
> > > parallelism.
> > >
> > > Kenn
> > >
> > > On Fri, Jun 24, 2016 at 12:13 PM, Shen Li <cs...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > The document says "when a ParDo transform is executed, the elements
> of
> > > the
> > > > input PCollection are first divided up into some number of bundles".
> > > >
> > > > How do users control the number of bundles/parallelism? Or is it
> > > completely
> > > > up to the runner?
> > > >
> > > > Thanks,
> > > >
> > > > Shen
> > > >
> > >
> >
>

Re: How to control the parallelism when run ParDo on PCollection?

Posted by Thomas Groh <tg...@google.com.INVALID>.
We do also have an active JIRA issue to support limiting parallelism on a
per-step basis, BEAM-68

https://issues.apache.org/jira/browse/BEAM-68

As Kenn noted, this is not equivalent to controls over bundling, which is
entirely determined by the runner.

On Fri, Jun 24, 2016 at 1:25 PM, Shen Li <cs...@gmail.com> wrote:

> Hi Kenn,
>
> Thanks for the explanation.
>
> Regards,
>
> Shen
>
> On Fri, Jun 24, 2016 at 4:09 PM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > Hi Shen,
> >
> > It is completely up to the runner how to divide things into bundles: it
> is
> > one item of work that should fail or succeed atomically. Bundling limits
> > parallelism, but does not determine it. For example, a streaming
> execution
> > may have many bundles over time as elements arrive, regardless of
> > parallelism.
> >
> > Kenn
> >
> > On Fri, Jun 24, 2016 at 12:13 PM, Shen Li <cs...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > The document says "when a ParDo transform is executed, the elements of
> > the
> > > input PCollection are first divided up into some number of bundles".
> > >
> > > How do users control the number of bundles/parallelism? Or is it
> > completely
> > > up to the runner?
> > >
> > > Thanks,
> > >
> > > Shen
> > >
> >
>

Re: How to control the parallelism when run ParDo on PCollection?

Posted by Shen Li <cs...@gmail.com>.
Hi Kenn,

Thanks for the explanation.

Regards,

Shen

On Fri, Jun 24, 2016 at 4:09 PM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> Hi Shen,
>
> It is completely up to the runner how to divide things into bundles: it is
> one item of work that should fail or succeed atomically. Bundling limits
> parallelism, but does not determine it. For example, a streaming execution
> may have many bundles over time as elements arrive, regardless of
> parallelism.
>
> Kenn
>
> On Fri, Jun 24, 2016 at 12:13 PM, Shen Li <cs...@gmail.com> wrote:
>
> > Hi,
> >
> > The document says "when a ParDo transform is executed, the elements of
> the
> > input PCollection are first divided up into some number of bundles".
> >
> > How do users control the number of bundles/parallelism? Or is it
> completely
> > up to the runner?
> >
> > Thanks,
> >
> > Shen
> >
>

Re: How to control the parallelism when run ParDo on PCollection?

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
Hi Shen,

It is completely up to the runner how to divide things into bundles: it is
one item of work that should fail or succeed atomically. Bundling limits
parallelism, but does not determine it. For example, a streaming execution
may have many bundles over time as elements arrive, regardless of
parallelism.

Kenn

On Fri, Jun 24, 2016 at 12:13 PM, Shen Li <cs...@gmail.com> wrote:

> Hi,
>
> The document says "when a ParDo transform is executed, the elements of the
> input PCollection are first divided up into some number of bundles".
>
> How do users control the number of bundles/parallelism? Or is it completely
> up to the runner?
>
> Thanks,
>
> Shen
>