You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by "hsy541@gmail.com" <hs...@gmail.com> on 2023/12/28 01:43:55 UTC

ParDo(DoFn) with multiple context.output vs FlatMapElements

Hello

I have a question. If I have a transform for each input it will emit 1 or
many output (same collection)
I can do it with ParDo + DoFun while in processElement method for each
input, call context.output multiply times vs doing it with FlatMapElements,
is there any difference? Does the dataflow fuse the downstream transform
automatically? Eventually I want more downstream transform workers cause it
needs to handle more data, How do I supposed to do that?

Regards,
Siyuan

Re: ParDo(DoFn) with multiple context.output vs FlatMapElements

Posted by Robert Bradshaw via user <us...@beam.apache.org>.
There is no difference; FlatMapElements is implemented in terms of a
DoFn that invokes context.output multiple times. And, yes, Dataflow
will fuse consecutive operations automatically. So if you have
something like

... -> DoFnA -> DoFnB -> GBK -> DoFnC -> ...

Dataflow will fuse DoFnA and DoFnB together, and if DoFnA produces a
lot of data for DoFnB to consume then more workers will be allocated
to handle the (DoFnA + DoFnB) combination. If the fanout is so huge
that a single worker would not be expected to handle the output DoFnA
produces from a single input, you could look into making DoFnA into a
SplittableDoFn https://beam.apache.org/blog/splittable-do-fn-is-available
. If DoFnB is just really expensive, you can also decouple the
parallelism between the two with a Reshuffle. Most of the time neither
of these is needed.

On Wed, Dec 27, 2023 at 5:44 PM hsy541@gmail.com <hs...@gmail.com> wrote:
>
> Hello
>
> I have a question. If I have a transform for each input it will emit 1 or many output (same collection)
> I can do it with ParDo + DoFun while in processElement method for each input, call context.output multiply times vs doing it with FlatMapElements, is there any difference? Does the dataflow fuse the downstream transform automatically? Eventually I want more downstream transform workers cause it needs to handle more data, How do I supposed to do that?
>
> Regards,
> Siyuan