You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Lukasz Cwik <lc...@google.com.INVALID> on 2016/03/31 17:42:50 UTC

Re: [jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

I believe you can guarantee the number of shards (with a more complex set
of transforms). You just need to figure out which shards are empty, and
force the write operation. We can have two implementations of write, one
which doesn't write when zero elements (the default), and one which does go
through the motions of doing the write for zero elements.

num shards is a parallel limit control, it doesn't scale already. The thing
we lose most is the ability to dynamically rebalance work if there is a
straggler.

overly restrictive implementation, this is one of those cases where you
have a composite ptransform which has a basic implementation using GBK
underneath the hood which runners can override if they can force the
parallelism constraint in a better way.


On Wed, Mar 30, 2016 at 10:28 PM, Daniel Halperin (JIRA) <ji...@apache.org>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219376#comment-15219376
> ]
>
> Daniel Halperin commented on BEAM-68:
> -------------------------------------
>
> Okay, I think I'm partially wrong.
>
> KV<K, Iterable<V>> -> ParDo(process all elements in a single DoFn with
> per-K startBundle/endBundle/etc) is doable as a solution to BEAM-92.
>    -It won't of course work with empty K, so you can't in fact guarantee
> numShards is matched.
>    -It won't scale.
>    -It overly restricts implementation.
> but I think it works, in essence, without a model change.
>
> Would you prefer to dupe 169 against 92? I don't see a need for more bug
> bloat here tho. Have suggested edits to the text of either bug that will
> fix?
>
> > Support for limiting parallelism of a step
> > ------------------------------------------
> >
> >                 Key: BEAM-68
> >                 URL: https://issues.apache.org/jira/browse/BEAM-68
> >             Project: Beam
> >          Issue Type: New Feature
> >          Components: beam-model
> >            Reporter: Daniel Halperin
> >
> > Users may want to limit the parallelism of a step. Two classic uses
> cases are:
> > - User wants to produce at most k files, so sets
> TextIO.Write.withNumShards(k).
> > - External API only supports k QPS, so user sets a limit of k/(expected
> QPS/step) on the ParDo that makes the API call.
> > Unfortunately, there is no way to do this effectively within the Beam
> model. A GroupByKey with exactly k keys will guarantee that only k elements
> are produced, but runners are free to break fusion in ways that each
> element may be processed in parallel later.
> > To implement this functionaltiy, I believe we need to add this support
> to the Beam Model.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>