You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sam McVeety <sg...@google.com.INVALID> on 2016/07/29 19:14:32 UTC

Proposal: Dynamic PIpelineOptions

During the graph construction phase, the given SDK generates an initial
execution graph for the program.  At execution time, this graph is
executed, either locally or by a service.  Currently, Beam only supports
parameterization at graph construction time.  Both Flink and Spark supply
functionality that allows a pre-compiled job to be run without SDK
interaction with updated runtime parameters.

In its current incarnation, Dataflow can read values of PipelineOptions at
job submission time, but this requires the presence of an SDK to properly
encode these values into the job.  We would like to build a common layer
into the Beam model so that these dynamic options can be properly provided
to jobs.

Please see
https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_IK1r1YAJ90JG5Fz0_28o/edit
for the high-level model, and
https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMMkOSgGi8ZUH-MOnFatZ8/edit
for
the specific API proposal.

Cheers,
Sam

Re: Proposal: Dynamic PIpelineOptions

Posted by Sam McVeety <sg...@google.com.INVALID>.

Thank you for the feedback.  I'm not hearing objections, so I'll go ahead
and draft a PR for ValueSupplier (possibly renamed to ValueProvider).

Cheers,
Sam

On Wed, Aug 10, 2016 at 3:50 PM, Sam McVeety <sg...@google.com> wrote:

> We can probably build a "real" case around the TextIO boilerplate -- say
> that a user wants to regularly run a Beam job with a different input path
> according to the day.  TextIO would be modified to support a dynamic value:
>
> TextIO.Read.withFilepattern(ValueSupplier<String>);
>
> ... and then the pipeline author would supply this via their own option:
>
> class MyPIpelineOptions extends PipelineOptions {
> @Default.RuntimeValueSupplier<String>("gs://bar")
> RuntimeValueSupplier<String> getInputPath();
> setInputPath(RuntimeValueSupplier<String>);
> }
>
> At this point, the same job graph could be reused with different values
> for --inputPath.
>
>
> Cheers,
> Sam
>
> On Wed, Aug 10, 2016 at 12:17 PM, Ismaël Mejía <ie...@gmail.com> wrote:
>
>> +1 It sounds really nice, (4) is by far the most consistent with the
>> current Options implementation.
>> One extra thing, maybe it is a good idea to sketch a 'real' use case to
>> make the concepts/need more evident.
>>
>> Ismaël
>>
>> On Tue, Aug 9, 2016 at 8:49 PM, Sam McVeety <sg...@google.com.invalid>
>> wrote:
>>
>> > Thanks, Amit and JB.  Amit, to your question: the intention with
>> > availability to PTransforms is provide the ValueProvider abstraction
>> (which
>> > may be implemented on top of PipelineOptions) so that they do not take a
>> > dependency on PipelineOptions.
>> >
>> > Cheers,
>> > Sam
>> >
>> > On Mon, Aug 8, 2016 at 12:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> > wrote:
>> >
>> > > +1
>> > >
>> > > Thanks Sam, it sounds interesting.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >
>> > > On 07/29/2016 09:14 PM, Sam McVeety wrote:
>> > >
>> > >> During the graph construction phase, the given SDK generates an
>> initial
>> > >> execution graph for the program.  At execution time, this graph is
>> > >> executed, either locally or by a service.  Currently, Beam only
>> supports
>> > >> parameterization at graph construction time.  Both Flink and Spark
>> > supply
>> > >> functionality that allows a pre-compiled job to be run without SDK
>> > >> interaction with updated runtime parameters.
>> > >>
>> > >> In its current incarnation, Dataflow can read values of
>> PipelineOptions
>> > at
>> > >> job submission time, but this requires the presence of an SDK to
>> > properly
>> > >> encode these values into the job.  We would like to build a common
>> layer
>> > >> into the Beam model so that these dynamic options can be properly
>> > provided
>> > >> to jobs.
>> > >>
>> > >> Please see
>> > >> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
>> > >> K1r1YAJ90JG5Fz0_28o/edit
>> > >> for the high-level model, and
>> > >> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
>> > >> kOSgGi8ZUH-MOnFatZ8/edit
>> > >> for
>> > >> the specific API proposal.
>> > >>
>> > >> Cheers,
>> > >> Sam
>> > >>
>> > >>
>> > > --
>> > > Jean-Baptiste Onofré
>> > > jbonofre@apache.org
>> > > http://blog.nanthrax.net
>> > > Talend - http://www.talend.com
>> > >
>> >
>>
>
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Sam McVeety <sg...@google.com.INVALID>.

We can probably build a "real" case around the TextIO boilerplate -- say
that a user wants to regularly run a Beam job with a different input path
according to the day.  TextIO would be modified to support a dynamic value:

TextIO.Read.withFilepattern(ValueSupplier<String>);

... and then the pipeline author would supply this via their own option:

class MyPIpelineOptions extends PipelineOptions {
@Default.RuntimeValueSupplier<String>("gs://bar")
RuntimeValueSupplier<String> getInputPath();
setInputPath(RuntimeValueSupplier<String>);
}

At this point, the same job graph could be reused with different values for
--inputPath.


Cheers,
Sam

On Wed, Aug 10, 2016 at 12:17 PM, Ismaël Mejía <ie...@gmail.com> wrote:

> +1 It sounds really nice, (4) is by far the most consistent with the
> current Options implementation.
> One extra thing, maybe it is a good idea to sketch a 'real' use case to
> make the concepts/need more evident.
>
> Ismaël
>
> On Tue, Aug 9, 2016 at 8:49 PM, Sam McVeety <sg...@google.com.invalid>
> wrote:
>
> > Thanks, Amit and JB.  Amit, to your question: the intention with
> > availability to PTransforms is provide the ValueProvider abstraction
> (which
> > may be implemented on top of PipelineOptions) so that they do not take a
> > dependency on PipelineOptions.
> >
> > Cheers,
> > Sam
> >
> > On Mon, Aug 8, 2016 at 12:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > +1
> > >
> > > Thanks Sam, it sounds interesting.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 07/29/2016 09:14 PM, Sam McVeety wrote:
> > >
> > >> During the graph construction phase, the given SDK generates an
> initial
> > >> execution graph for the program.  At execution time, this graph is
> > >> executed, either locally or by a service.  Currently, Beam only
> supports
> > >> parameterization at graph construction time.  Both Flink and Spark
> > supply
> > >> functionality that allows a pre-compiled job to be run without SDK
> > >> interaction with updated runtime parameters.
> > >>
> > >> In its current incarnation, Dataflow can read values of
> PipelineOptions
> > at
> > >> job submission time, but this requires the presence of an SDK to
> > properly
> > >> encode these values into the job.  We would like to build a common
> layer
> > >> into the Beam model so that these dynamic options can be properly
> > provided
> > >> to jobs.
> > >>
> > >> Please see
> > >> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> > >> K1r1YAJ90JG5Fz0_28o/edit
> > >> for the high-level model, and
> > >> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> > >> kOSgGi8ZUH-MOnFatZ8/edit
> > >> for
> > >> the specific API proposal.
> > >>
> > >> Cheers,
> > >> Sam
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Ismaël Mejía <ie...@gmail.com>.

+1 It sounds really nice, (4) is by far the most consistent with the
current Options implementation.
One extra thing, maybe it is a good idea to sketch a 'real' use case to
make the concepts/need more evident.

Ismaël

On Tue, Aug 9, 2016 at 8:49 PM, Sam McVeety <sg...@google.com.invalid> wrote:

> Thanks, Amit and JB.  Amit, to your question: the intention with
> availability to PTransforms is provide the ValueProvider abstraction (which
> may be implemented on top of PipelineOptions) so that they do not take a
> dependency on PipelineOptions.
>
> Cheers,
> Sam
>
> On Mon, Aug 8, 2016 at 12:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > +1
> >
> > Thanks Sam, it sounds interesting.
> >
> > Regards
> > JB
> >
> >
> > On 07/29/2016 09:14 PM, Sam McVeety wrote:
> >
> >> During the graph construction phase, the given SDK generates an initial
> >> execution graph for the program.  At execution time, this graph is
> >> executed, either locally or by a service.  Currently, Beam only supports
> >> parameterization at graph construction time.  Both Flink and Spark
> supply
> >> functionality that allows a pre-compiled job to be run without SDK
> >> interaction with updated runtime parameters.
> >>
> >> In its current incarnation, Dataflow can read values of PipelineOptions
> at
> >> job submission time, but this requires the presence of an SDK to
> properly
> >> encode these values into the job.  We would like to build a common layer
> >> into the Beam model so that these dynamic options can be properly
> provided
> >> to jobs.
> >>
> >> Please see
> >> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> >> K1r1YAJ90JG5Fz0_28o/edit
> >> for the high-level model, and
> >> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> >> kOSgGi8ZUH-MOnFatZ8/edit
> >> for
> >> the specific API proposal.
> >>
> >> Cheers,
> >> Sam
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Sam McVeety <sg...@google.com.INVALID>.

Thanks, Amit and JB.  Amit, to your question: the intention with
availability to PTransforms is provide the ValueProvider abstraction (which
may be implemented on top of PipelineOptions) so that they do not take a
dependency on PipelineOptions.

Cheers,
Sam

On Mon, Aug 8, 2016 at 12:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> +1
>
> Thanks Sam, it sounds interesting.
>
> Regards
> JB
>
>
> On 07/29/2016 09:14 PM, Sam McVeety wrote:
>
>> During the graph construction phase, the given SDK generates an initial
>> execution graph for the program.  At execution time, this graph is
>> executed, either locally or by a service.  Currently, Beam only supports
>> parameterization at graph construction time.  Both Flink and Spark supply
>> functionality that allows a pre-compiled job to be run without SDK
>> interaction with updated runtime parameters.
>>
>> In its current incarnation, Dataflow can read values of PipelineOptions at
>> job submission time, but this requires the presence of an SDK to properly
>> encode these values into the job.  We would like to build a common layer
>> into the Beam model so that these dynamic options can be properly provided
>> to jobs.
>>
>> Please see
>> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
>> K1r1YAJ90JG5Fz0_28o/edit
>> for the high-level model, and
>> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
>> kOSgGi8ZUH-MOnFatZ8/edit
>> for
>> the specific API proposal.
>>
>> Cheers,
>> Sam
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

+1

Thanks Sam, it sounds interesting.

Regards
JB

On 07/29/2016 09:14 PM, Sam McVeety wrote:
> During the graph construction phase, the given SDK generates an initial
> execution graph for the program.  At execution time, this graph is
> executed, either locally or by a service.  Currently, Beam only supports
> parameterization at graph construction time.  Both Flink and Spark supply
> functionality that allows a pre-compiled job to be run without SDK
> interaction with updated runtime parameters.
>
> In its current incarnation, Dataflow can read values of PipelineOptions at
> job submission time, but this requires the presence of an SDK to properly
> encode these values into the job.  We would like to build a common layer
> into the Beam model so that these dynamic options can be properly provided
> to jobs.
>
> Please see
> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_IK1r1YAJ90JG5Fz0_28o/edit
> for the high-level model, and
> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMMkOSgGi8ZUH-MOnFatZ8/edit
> for
> the specific API proposal.
>
> Cheers,
> Sam
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Proposal: Dynamic PIpelineOptions

Posted by Amit Sela <am...@gmail.com>.

+1 sounds like a good idea.

Spark's driver actually takes all dynamic parameters starting with "spark."
and propagates them into SparkConf which is propagated onto the Executors
and is available via the environment's SparkEnv.

I'm wondering, does this mean that PipelineOption will be available to the
PTransform, or only the ValueSupplier (yes, (4) for me too please) ?

Thanks,
Amit

On Fri, Aug 5, 2016 at 5:41 PM Aljoscha Krettek <al...@apache.org> wrote:

> +1
>
> It's true that Flink provides a way to pass dynamic parameters to operator
> instances. That's not used in any of the built-in sources and operators,
> however. They are instantiated with their parameters when the graph is
> constructed. So what you are suggesting for Beam would actually provide
> more functionality than what we currently have in Flink. :-)
>
> Out of the options I think (4) would be the best. (1) and (2) are not type
> safe, correct? and (3) seems very boilerplate-y.
>
> Cheers,
> Aljoscha
>
> On Thu, 4 Aug 2016 at 21:53 Frances Perry <fj...@google.com.invalid> wrote:
>
> > +Amit, Aljoscha, Manu
> >
> > Any comments from folks on the Flink, Spark, or Gearpump runners?
> >
> > On Tue, Aug 2, 2016 at 11:10 AM, Robert Bradshaw <
> > robertwb@google.com.invalid> wrote:
> >
> > > Being able to "late-bind" parameters like input paths to a
> > > pre-constructed program would be a very useful feature, and I think is
> > > worth adding to Beam.
> > >
> > > Of the four API proposals, I have a strong preference for (4).
> > > Further, it seems that these need not be bound to the PipelineOptions
> > > object itself (i.e. a named RuntimeValueSupplier could be constructed
> > > off of a pipeline object), which the Python API makes less heavy use
> > > of (encouraging the user to use familiar, standard libraries for
> > > argument parsing), though of course such integration is useful to
> > > provide for convenience.
> > >
> > > - Robert
> > >
> > > On Fri, Jul 29, 2016 at 12:14 PM, Sam McVeety <sgmc@google.com.invalid
> >
> > > wrote:
> > > > During the graph construction phase, the given SDK generates an
> initial
> > > > execution graph for the program.  At execution time, this graph is
> > > > executed, either locally or by a service.  Currently, Beam only
> > supports
> > > > parameterization at graph construction time.  Both Flink and Spark
> > supply
> > > > functionality that allows a pre-compiled job to be run without SDK
> > > > interaction with updated runtime parameters.
> > > >
> > > > In its current incarnation, Dataflow can read values of
> PipelineOptions
> > > at
> > > > job submission time, but this requires the presence of an SDK to
> > properly
> > > > encode these values into the job.  We would like to build a common
> > layer
> > > > into the Beam model so that these dynamic options can be properly
> > > provided
> > > > to jobs.
> > > >
> > > > Please see
> > > > https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> > > K1r1YAJ90JG5Fz0_28o/edit
> > > > for the high-level model, and
> > > > https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> > > kOSgGi8ZUH-MOnFatZ8/edit
> > > > for
> > > > the specific API proposal.
> > > >
> > > > Cheers,
> > > > Sam
> > >
> >
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Aljoscha Krettek <al...@apache.org>.

+1

It's true that Flink provides a way to pass dynamic parameters to operator
instances. That's not used in any of the built-in sources and operators,
however. They are instantiated with their parameters when the graph is
constructed. So what you are suggesting for Beam would actually provide
more functionality than what we currently have in Flink. :-)

Out of the options I think (4) would be the best. (1) and (2) are not type
safe, correct? and (3) seems very boilerplate-y.

Cheers,
Aljoscha

On Thu, 4 Aug 2016 at 21:53 Frances Perry <fj...@google.com.invalid> wrote:

> +Amit, Aljoscha, Manu
>
> Any comments from folks on the Flink, Spark, or Gearpump runners?
>
> On Tue, Aug 2, 2016 at 11:10 AM, Robert Bradshaw <
> robertwb@google.com.invalid> wrote:
>
> > Being able to "late-bind" parameters like input paths to a
> > pre-constructed program would be a very useful feature, and I think is
> > worth adding to Beam.
> >
> > Of the four API proposals, I have a strong preference for (4).
> > Further, it seems that these need not be bound to the PipelineOptions
> > object itself (i.e. a named RuntimeValueSupplier could be constructed
> > off of a pipeline object), which the Python API makes less heavy use
> > of (encouraging the user to use familiar, standard libraries for
> > argument parsing), though of course such integration is useful to
> > provide for convenience.
> >
> > - Robert
> >
> > On Fri, Jul 29, 2016 at 12:14 PM, Sam McVeety <sg...@google.com.invalid>
> > wrote:
> > > During the graph construction phase, the given SDK generates an initial
> > > execution graph for the program.  At execution time, this graph is
> > > executed, either locally or by a service.  Currently, Beam only
> supports
> > > parameterization at graph construction time.  Both Flink and Spark
> supply
> > > functionality that allows a pre-compiled job to be run without SDK
> > > interaction with updated runtime parameters.
> > >
> > > In its current incarnation, Dataflow can read values of PipelineOptions
> > at
> > > job submission time, but this requires the presence of an SDK to
> properly
> > > encode these values into the job.  We would like to build a common
> layer
> > > into the Beam model so that these dynamic options can be properly
> > provided
> > > to jobs.
> > >
> > > Please see
> > > https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> > K1r1YAJ90JG5Fz0_28o/edit
> > > for the high-level model, and
> > > https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> > kOSgGi8ZUH-MOnFatZ8/edit
> > > for
> > > the specific API proposal.
> > >
> > > Cheers,
> > > Sam
> >
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Frances Perry <fj...@google.com.INVALID>.

+Amit, Aljoscha, Manu

Any comments from folks on the Flink, Spark, or Gearpump runners?

On Tue, Aug 2, 2016 at 11:10 AM, Robert Bradshaw <
robertwb@google.com.invalid> wrote:

> Being able to "late-bind" parameters like input paths to a
> pre-constructed program would be a very useful feature, and I think is
> worth adding to Beam.
>
> Of the four API proposals, I have a strong preference for (4).
> Further, it seems that these need not be bound to the PipelineOptions
> object itself (i.e. a named RuntimeValueSupplier could be constructed
> off of a pipeline object), which the Python API makes less heavy use
> of (encouraging the user to use familiar, standard libraries for
> argument parsing), though of course such integration is useful to
> provide for convenience.
>
> - Robert
>
> On Fri, Jul 29, 2016 at 12:14 PM, Sam McVeety <sg...@google.com.invalid>
> wrote:
> > During the graph construction phase, the given SDK generates an initial
> > execution graph for the program.  At execution time, this graph is
> > executed, either locally or by a service.  Currently, Beam only supports
> > parameterization at graph construction time.  Both Flink and Spark supply
> > functionality that allows a pre-compiled job to be run without SDK
> > interaction with updated runtime parameters.
> >
> > In its current incarnation, Dataflow can read values of PipelineOptions
> at
> > job submission time, but this requires the presence of an SDK to properly
> > encode these values into the job.  We would like to build a common layer
> > into the Beam model so that these dynamic options can be properly
> provided
> > to jobs.
> >
> > Please see
> > https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> K1r1YAJ90JG5Fz0_28o/edit
> > for the high-level model, and
> > https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> kOSgGi8ZUH-MOnFatZ8/edit
> > for
> > the specific API proposal.
> >
> > Cheers,
> > Sam
>

Re: Proposal: Dynamic PIpelineOptions

Posted by Robert Bradshaw <ro...@google.com.INVALID>.

Being able to "late-bind" parameters like input paths to a
pre-constructed program would be a very useful feature, and I think is
worth adding to Beam.

Of the four API proposals, I have a strong preference for (4).
Further, it seems that these need not be bound to the PipelineOptions
object itself (i.e. a named RuntimeValueSupplier could be constructed
off of a pipeline object), which the Python API makes less heavy use
of (encouraging the user to use familiar, standard libraries for
argument parsing), though of course such integration is useful to
provide for convenience.

- Robert

On Fri, Jul 29, 2016 at 12:14 PM, Sam McVeety <sg...@google.com.invalid> wrote:
> During the graph construction phase, the given SDK generates an initial
> execution graph for the program.  At execution time, this graph is
> executed, either locally or by a service.  Currently, Beam only supports
> parameterization at graph construction time.  Both Flink and Spark supply
> functionality that allows a pre-compiled job to be run without SDK
> interaction with updated runtime parameters.
>
> In its current incarnation, Dataflow can read values of PipelineOptions at
> job submission time, but this requires the presence of an SDK to properly
> encode these values into the job.  We would like to build a common layer
> into the Beam model so that these dynamic options can be properly provided
> to jobs.
>
> Please see
> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_IK1r1YAJ90JG5Fz0_28o/edit
> for the high-level model, and
> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMMkOSgGi8ZUH-MOnFatZ8/edit
> for
> the specific API proposal.
>
> Cheers,
> Sam