You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Andrew Jones <an...@andrew-jones.com> on 2018/02/28 16:58:21 UTC
Running code before pipeline starts
Hi,
What is the best way to run code before the pipeline starts? Anything in the `main` function doesn't get called when the pipeline is ran on Dataflow via a template - only the pipeline. If you're familiar with Spark, then I'm thinking of code that might be ran in the driver.
Alternatively, is there a way I can run part of a pipeline first, then run another part once it's completed? Not sure that makes sense, so to illustrate with a poor attempt at an ascii diagram, if I have something like this:
events
/\
/ \
| group by key
| |
| do some action
| /
| /
once action is complete,
process all original elements
I can presumably achieve this by having `do some action` either generating an empty side input or an empty PCollection which I can then use to create a PCollectionList along with the original and pass to Flatten.pCollections() before continuing. Not sure if that's the best way to do it though.
Thanks,
Andrew
Re: Running code before pipeline starts
Posted by Andrew Jones <an...@andrew-jones.com>.
Thanks Lukasz, went with the side input approach and it worked
perfectly!
On Wed, 28 Feb 2018, at 18:28, Lukasz Cwik wrote:
> You should use a side input and not an empty PCollection that you
> flatten.>
> Since
> ReadA --> Flatten --> ParDo
> ReadB -/
> can be equivalently executed as:
> ReadA --> ParDo
> ReadB --> ParDo
>
> Make sure you access the side input in case a runner evaluates the
> side input lazily.>
> So your pipeline would look like:
> Create --> ParDo(DoAction) --> View.asSingleton() named X
> ... --> ParDo(ProcessElements).withSideInput(X) --> ...
>
> An alternative would be to use CoGroupByKey to join the two streams
> since it is not possible to split the execution like I showed with
> Flatten. It is wasteful to add the CoGroupByKey but it is a lot less
> wasteful if you convert a preceding GroupByKey in your pipeline into a
> CoGroupByKey joining the two streams.>
> On Wed, Feb 28, 2018 at 8:58 AM, Andrew Jones <andrew+beam@andrew-
> jones.com> wrote:>> Hi,
>>
>> What is the best way to run code before the pipeline starts?
>> Anything in the `main` function doesn't get called when the pipeline
>> is ran on Dataflow via a template - only the pipeline. If you're
>> familiar with Spark, then I'm thinking of code that might be ran in
>> the driver.>>
>> Alternatively, is there a way I can run part of a pipeline first,
>> then run another part once it's completed? Not sure that makes
>> sense, so to illustrate with a poor attempt at an ascii diagram, if
>> I have something like this:>>
>> events
>> /\
>> / \
>> | group by key
>> | |
>> | do some action
>> | /
>> | /
>> once action is complete,
>> process all original elements
>>
>> I can presumably achieve this by having `do some action` either
>> generating an empty side input or an empty PCollection which I can
>> then use to create a PCollectionList along with the original and
>> pass to Flatten.pCollections() before continuing. Not sure if that's
>> the best way to do it though.>>
>> Thanks,
>> Andrew
Re: Running code before pipeline starts
Posted by Lukasz Cwik <lc...@google.com>.
You should use a side input and not an empty PCollection that you flatten.
Since
ReadA --> Flatten --> ParDo
ReadB -/
can be equivalently executed as:
ReadA --> ParDo
ReadB --> ParDo
Make sure you access the side input in case a runner evaluates the side
input lazily.
So your pipeline would look like:
Create --> ParDo(DoAction) --> View.asSingleton() named X
... --> ParDo(ProcessElements).withSideInput(X) --> ...
An alternative would be to use CoGroupByKey to join the two streams since
it is not possible to split the execution like I showed with Flatten. It is
wasteful to add the CoGroupByKey but it is a lot less wasteful if you
convert a preceding GroupByKey in your pipeline into a CoGroupByKey joining
the two streams.
On Wed, Feb 28, 2018 at 8:58 AM, Andrew Jones <an...@andrew-jones.com>
wrote:
> Hi,
>
> What is the best way to run code before the pipeline starts? Anything in
> the `main` function doesn't get called when the pipeline is ran on Dataflow
> via a template - only the pipeline. If you're familiar with Spark, then I'm
> thinking of code that might be ran in the driver.
>
> Alternatively, is there a way I can run part of a pipeline first, then run
> another part once it's completed? Not sure that makes sense, so to
> illustrate with a poor attempt at an ascii diagram, if I have something
> like this:
>
> events
> /\
> / \
> | group by key
> | |
> | do some action
> | /
> | /
> once action is complete,
> process all original elements
>
> I can presumably achieve this by having `do some action` either generating
> an empty side input or an empty PCollection which I can then use to create
> a PCollectionList along with the original and pass to
> Flatten.pCollections() before continuing. Not sure if that's the best way
> to do it though.
>
> Thanks,
> Andrew
>