You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Ning Kang <ni...@google.com> on 2019/08/07 19:36:40 UTC

Brief of interactive Beam

To whom may concern,

This is Ning from Google. We are currently making efforts to leverage an
interactive runner under python beam sdk.

There is already an interactive Beam (iBeam for short) runner with jupyter
notebook in the repo
<https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
.
Following the instructions on that page, one can set up an interactive
environment to develop and execute Beam pipeline interactively.

However, there are many issues with existing iBeam. One issue is that it
uses a concept of leaf PCollection to cache and materialize intermediate
PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
the interactive runner will run into errors.

Our initial effort will be fixing the existing issues. And we also want to
make iBeam easy to use. Since iBeam uses the same model Beam uses, there
isn't really any difference for users between creating a pipeline with
interactive runner and other runners.
So we want to minimize the interfaces a user needs to learn while giving
the user some capability to interact with the interactive environment.

See this initial PR <https://github.com/apache/beam/pull/9278>, the
interactive_beam module will provide mainly 4 interfaces:

   - For advanced users who define pipeline outside __main__, let them tell
   current interactive environment where they define their pipeline: watch()
      - This is very useful for tests where pipeline can be defined in test
      methods.
      - If the user simply creates pipeline in a Jupyter notebook or a
      plain Python script, they don't have to know/use this feature at all.
   - Let users create an interactive pipeline: create_pipeline()
      - invoking create_pipeline(), the user gets a Pipeline object that
      works as any other Pipeline object created from apache_beam.Pipeline()
      - However, the pipeline object p, when invoking p.run(), does some
      extra interactive magic.
      - We'll support interactive execution for DirectRunner at this moment.
   - Let users run the interactive pipeline as a normal pipeline:
   run_pipeline()
      - In an interactive environment, a user only needs to add and execute
      1 line of code run_pipeline(pipeline) to execute any existing interactive
      pipeline object as normal pipeline in any selected platform.
      - We'll probably support Dataflow only. Other implementations can be
      added though.
   - Let users introspect any intermediate PCollection they have handler
   to: visualize()
      - If a user ever writes pcoll = p | "Some Transform" >>
      some_transform() ..., they can visualize(pcoll) once the pipeline p is
      executed.
      - p can be batch or streaming
      - The visualization will be some plot graph of data for the given
      PCollection as if it's materialized. If the PCollection is unbounded, the
      graph is dynamic.

The PR will implement 1 and 2.

We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top level
JIRA and add blocking JIRAs as development goes.

External Beam users will not worry about any of the underlying
implementation details.
Except the 4 interfaces above, they learn and write normal Beam code and
can execute the pipeline immediately when they are done with prototyping.

Ning.

Re: Brief of interactive Beam

Posted by Robert Bradshaw <ro...@google.com>.

Thanks for the note. Are there any associated documents worth sharing as
well? More below.

On Wed, Aug 7, 2019 at 9:39 PM Ning Kang <ni...@google.com> wrote:

> To whom may concern,
>
> This is Ning from Google. We are currently making efforts to leverage an
> interactive runner under python beam sdk.
>
> There is already an interactive Beam (iBeam for short) runner with jupyter
> notebook in the repo
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
> .
> Following the instructions on that page, one can set up an interactive
> environment to develop and execute Beam pipeline interactively.
>
> However, there are many issues with existing iBeam. One issue is that it
> uses a concept of leaf PCollection to cache and materialize intermediate
> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
> the interactive runner will run into errors.
>
> Our initial effort will be fixing the existing issues. And we also want to
> make iBeam easy to use. Since iBeam uses the same model Beam uses, there
> isn't really any difference for users between creating a pipeline with
> interactive runner and other runners.
> So we want to minimize the interfaces a user needs to learn while giving
> the user some capability to interact with the interactive environment.
>
> See this initial PR <https://github.com/apache/beam/pull/9278>, the
> interactive_beam module will provide mainly 4 interfaces:
>
>    - For advanced users who define pipeline outside __main__, let them
>    tell current interactive environment where they define their pipeline:
>    watch()
>       - This is very useful for tests where pipeline can be defined in
>       test methods.
>       - If the user simply creates pipeline in a Jupyter notebook or a
>       plain Python script, they don't have to know/use this feature at all.
>
>
This is for using visualize() below, or building further on the pipeline,
right?


>
>    - Let users create an interactive pipeline: create_pipeline()
>       - invoking create_pipeline(), the user gets a Pipeline object that
>       works as any other Pipeline object created from apache_beam.Pipeline()
>       - However, the pipeline object p, when invoking p.run(), does some
>       extra interactive magic.
>       - We'll support interactive execution for DirectRunner at this
>       moment.
>
> How is this different than creating a pipeline with the interactive
runner? It'd be nice to reduce the number of new concepts a user needs to
know (and also reduce the number of changes needed to move from interactive
to non-interactive). Is there any need to limit this to the Direct runner?

>
>    - Let users run the interactive pipeline as a normal pipeline:
>    run_pipeline()
>       - In an interactive environment, a user only needs to add and
>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>       interactive pipeline object as normal pipeline in any selected platform.
>       - We'll probably support Dataflow only. Other implementations can
>       be added though.
>
> Again, how is this different than pipeline.run()? What features require
limiting this to only certain runners?

>
>    - Let users introspect any intermediate PCollection they have handler
>    to: visualize()
>       - If a user ever writes pcoll = p | "Some Transform" >>
>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>       executed.
>       - p can be batch or streaming
>       - The visualization will be some plot graph of data for the given
>       PCollection as if it's materialized. If the PCollection is unbounded, the
>       graph is dynamic.
>
> The PR will implement 1 and 2.
>
> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> level JIRA and add blocking JIRAs as development goes.
>
> External Beam users will not worry about any of the underlying
> implementation details.
> Except the 4 interfaces above, they learn and write normal Beam code and
> can execute the pipeline immediately when they are done with prototyping.
>
> Ning.
>

Re: Brief of interactive Beam

Posted by Ning Kang <ni...@google.com>.

Hi Pablo,

Thanks for reviewing the doc.

I think I can grasp some of the concepts, but it is not yet 100% clear to
> me why it's necessary to define a new abstraction to have interactivity.
> Could you elaborate?
>
It's not clear to me what the "new abstraction" you are mentioning is. But
if it means the new interactive_beam module, yes, I can elaborate.
Currently the interactive_runner module holds everything that makes a
pipeline interactive:

   - A cache manager instance that is shared within an interactive session.
   And this is how past pipeline execution preserves hidden states for future
   pipelines until
      - the pipeline object itself is re-evaluated (e.g., the user reruns
      or create a new "p = beam.Pipeline(...)")
      - a new interactive session is started (e.g., the user restarts the
      IPython kernel in a notebook)
   - Some pipeline graph renderer that utilizes graphviz to render DOT of
   the pipeline

The disadvantage of doing this is

   - The interactive_runner module that interactive Beam user needs to
   learn and create pipeline with contains all the implementation details of
   the runner that a Beam user shouldn't care.
      - If the user just wants to create a Beam pipeline with
      interactivity, we should simply provide a factory to them. Instead of
      throwing an arbitrary new runner with 200+ lines of code and ask the user
      to initialize with constructor.
   - The runner module contains too many components irrelevant to a runner
   such as the cache manager or pipeline graph renderer.
      - These components has nothing to do with Beam. They should be
      decoupled from the pipeline runner logic into utilities.
      - These components need a scope that spans across multiple pipeline
      runs. So they are better suited to be components of the
interactive session
      rather than a runner attached to a single instance of pipeline.
      - E.g., if I run a pipeline object with a different interactive
      runner instance, the interactivity should still work.
         - The relationship should look like: session --> interactivity +
         pipeline --> runner
         - But current design heavily relies on: pipeline is session -->
         runner --> interactivity.

And we want to add more features

   - Visualization of PCollection
   - Run pipeline with other runners

If we put them into interactive runner:



   - They don't really belong to a Beam runner's functionality
   - They don't fit into a runner's scope in interactive environment
   - They add complexity to the implementation details of the interactive
   runner
   - Interactive or normal Beam users don't care about the details but we
   are throwing the code to them anyway

What we want to achieve:

   - Ease of use.
      - Interactive Beam users can immediately learn the features by
      reading the interactive_beam module code and pydoc even if they opt-out
      reading any user guide or examples which users always do.
      - Users only need to learn this one module. The interactive_beam
      module hides all implementation details from end users. Users will see
      several static methods with tons of documentation but only a few lines of
      code.
   - We only maintain usability of these exposed features no matter how the
   underlying implementation iterates.

What we are doing will reduce the things a user needs to learn when using
Interactive Beam.

I can see what's the need for the watch.

As explained in the document, code and unit tests, if you define a pipeline
in __main__, you don't need to use it. So if a user writes pipeline
directly in notebook cells, they don't need to watch().
However, we need a way for users to tell us where their pipelines are if
they define pipelines in other places.
If I'm a user that defines a function and run it in __main__:

def run_pipeline():

    watch(locals())  # A user could put this line anywhere within this
function to apply interactivity even in a local scope.

    p = beam.Pipeline()

    ...

    x = p | "TX" >> tx()

    ...

    p.run()

    ...

  run_pipeline()
There is no way for us to make it interactive automatically because the
pipeline is defined in a local scope. (It's even worse if we use original
interactive runner implementation, because it's a local scope down into the
p object)
However, in an interactive environment, users can add code in any place and
execute arbitrary subset of code from any place.
If users add more statement to transform or examine data in above "..." and
only re-execute those parts, watch() is the least amount of code a user
needs to tell Interactive Beam to support their pipeline because all the
hidden states are now stored within global scope of this interactive
session.
It's already not intuitive when you define something in a function and wish
interactivity in the outer scope. Like in interactive Python, you don't
really have access to p or x. But with watch(), we can support interactive
pipeline defined anywhere in user code.

 Can you also tell us more about how a user would use visualize? Do they
> pass the kind of plot to have?

There will be a separate design doc for that. I can give you an internal
link to it and an external link will be available once internally reviewed.
But we do want to support the minimum argument usage:
  visualize(pcoll)
This will auto-magically plot something for PCollection object pcoll.
Additionally, we can have a set of optional arguments specifying:

   1. What columns of data from pcoll a user wants to render
   2. What will be the dimensions of data rendered, so multiple columns can
   be aggregated into a single dimension
   3. Schema and metadata can be supplied to help rendering legends, axes,
   titles and etc.
   4. An enum of plotting types can be selected
   5. By default, columns, dimensions, schema and plotting type (can be a
   composite of scatter, line and bar charts) can be deduced automatically
   from pcoll element type.


Ning.

On Wed, Aug 14, 2019 at 7:10 PM Pablo Estrada <pa...@google.com> wrote:

> Hi Ning!
> Thanks for the design doc and the explanations.
>
> I think I can grasp some of the concepts, but it is not yet 100% clear to
> me why it's necessary to define a new abstraction to have interactivity.
> Could you elaborate? Perhaps as a section in the  doc? : )
>
> A lot of the motivation for this doc seems related to how we decide which
> PCollections to cache - so as to avoid rerunning parts of a pipeline
> whenever a user decides to visualize specific parts. I think that makes
> sense (and probably helps to have interactivity on streaming).
>
> I agree that it's a little odd that InteractiveRunner receives an
> underlying runner. That certainly suggests that the functionality is
> orthogonal.
>
> So, in short: I think my feedback is similar to others: Can you justify
> further (or reconsider) why pipeline creation and execution need to be
> different?
>
> I can see what's the need for the watch. Can you also tell us more about
> how a user would use visualize? Do they pass the kind of plot to have?
>
> Thanks!
> -P.
>
> On Wed, Aug 14, 2019 at 12:03 PM Ning Kang <ni...@google.com> wrote:
>
>> Q1:
>> The document is shared (
>> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing).
>> If inside Google, short link (go/ibeam-external
>> <https://goto.google.com/ibeam-external>). I cannot share internal
>> documents, but you can reach out if you need internal engineering plan.
>>
>> Q2:
>> Yes, watch() is optimization used for using visualization() and building
>> further on the pipeline. And the user doesn't need to call it if they
>> simply define the pipeline in the notebook.
>>
>> Q3 and Q4:
>> I'm only focusing on direct runner as underlying runner. We'll get rid of
>> many of existing interactive Beam implementation. We can't provide
>> portability for interactivity. Users can run the pipeline with other
>> runners though due to the pipeline portability.
>> Our work is to reduce the new concepts a user needs to know when they
>> want to run interactive Beam. The implementation could be arbitrarily
>> complicated and open sourced though. Currently, the interactive runner
>> looks like as if it's supporting all kinds of underlying runners. We want
>> to rid of it too.
>>
>> On 2019/08/08 00:01:06, Ahmet Altay <al...@google.com> wrote:
>> > Ning, thank you for the heads up.
>> >
>> > All, this is a proposed work for improving interactive Beam experience.
>> As
>> > mentioned in Ning's email, new concepts are being introduced. And in
>> > addition iBeam as a name is used as a new reference. I hope that
>> bringing
>> > the discussion to the mailing list will give it the additional
>> > visibility and more people could share their feedback.
>> >
>> > (cc'ing a few folks that might be interested +Robert Bradshaw
>> > <ro...@google.com> +Valentyn Tymofieiev <va...@google.com>
>> +Sindy Li
>> > <qi...@google.com> +Brian Hulette <bh...@google.com> )
>> >
>> > Ahmet
>> >
>> >
>> > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>> >
>> > > To whom may concern,
>> > >
>> > > This is Ning from Google. We are currently making efforts to leverage
>> an
>> > > interactive runner under python beam sdk.
>> > >
>> > > There is already an interactive Beam (iBeam for short) runner with
>> jupyter
>> > > notebook in the repo
>> > > <
>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive
>> >
>> > > .
>> > > Following the instructions on that page, one can set up an interactive
>> > > environment to develop and execute Beam pipeline interactively.
>> > >
>> > > However, there are many issues with existing iBeam. One issue is that
>> it
>> > > uses a concept of leaf PCollection to cache and materialize
>> intermediate
>> > > PCollection. If the user wants to reuse/introspect a non-leaf
>> PCollection,
>> > > the interactive runner will run into errors.
>> > >
>> > > Our initial effort will be fixing the existing issues. And we also
>> want to
>> > > make iBeam easy to use. Since iBeam uses the same model Beam uses,
>> there
>> > > isn't really any difference for users between creating a pipeline with
>> > > interactive runner and other runners.
>> > > So we want to minimize the interfaces a user needs to learn while
>> giving
>> > > the user some capability to interact with the interactive environment.
>> > >
>> > > See this initial PR <https://github.com/apache/beam/pull/9278>, the
>> > > interactive_beam module will provide mainly 4 interfaces:
>> > >
>> > >    - For advanced users who define pipeline outside __main__, let them
>> > >    tell current interactive environment where they define their
>> pipeline:
>> > >    watch()
>> > >       - This is very useful for tests where pipeline can be defined in
>> > >       test methods.
>> > >       - If the user simply creates pipeline in a Jupyter notebook or a
>> > >       plain Python script, they don't have to know/use this feature
>> at all.
>> > >    - Let users create an interactive pipeline: create_pipeline()
>> > >       - invoking create_pipeline(), the user gets a Pipeline object
>> that
>> > >       works as any other Pipeline object created from
>> apache_beam.Pipeline()
>> > >       - However, the pipeline object p, when invoking p.run(), does
>> some
>> > >       extra interactive magic.
>> > >       - We'll support interactive execution for DirectRunner at this
>> > >       moment.
>> > >    - Let users run the interactive pipeline as a normal pipeline:
>> > >    run_pipeline()
>> > >       - In an interactive environment, a user only needs to add and
>> > >       execute 1 line of code run_pipeline(pipeline) to execute any
>> existing
>> > >       interactive pipeline object as normal pipeline in any selected
>> platform.
>> > >       - We'll probably support Dataflow only. Other implementations
>> can
>> > >       be added though.
>> > >    - Let users introspect any intermediate PCollection they have
>> handler
>> > >    to: visualize()
>> > >       - If a user ever writes pcoll = p | "Some Transform" >>
>> > >       some_transform() ..., they can visualize(pcoll) once the
>> pipeline p is
>> > >       executed.
>> > >       - p can be batch or streaming
>> > >       - The visualization will be some plot graph of data for the
>> given
>> > >       PCollection as if it's materialized. If the PCollection is
>> unbounded, the
>> > >       graph is dynamic.
>> > >
>> > > The PR will implement 1 and 2.
>> > >
>> > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
>> > > level JIRA and add blocking JIRAs as development goes.
>> > >
>> > > External Beam users will not worry about any of the underlying
>> > > implementation details.
>> > > Except the 4 interfaces above, they learn and write normal Beam code
>> and
>> > > can execute the pipeline immediately when they are done with
>> prototyping.
>> > >
>> > > Ning.
>> > >
>> >
>>
>

Re: Brief of interactive Beam

Posted by Pablo Estrada <pa...@google.com>.

Hi Ning!
Thanks for the design doc and the explanations.

I think I can grasp some of the concepts, but it is not yet 100% clear to
me why it's necessary to define a new abstraction to have interactivity.
Could you elaborate? Perhaps as a section in the  doc? : )

A lot of the motivation for this doc seems related to how we decide which
PCollections to cache - so as to avoid rerunning parts of a pipeline
whenever a user decides to visualize specific parts. I think that makes
sense (and probably helps to have interactivity on streaming).

I agree that it's a little odd that InteractiveRunner receives an
underlying runner. That certainly suggests that the functionality is
orthogonal.

So, in short: I think my feedback is similar to others: Can you justify
further (or reconsider) why pipeline creation and execution need to be
different?

I can see what's the need for the watch. Can you also tell us more about
how a user would use visualize? Do they pass the kind of plot to have?

Thanks!
-P.

On Wed, Aug 14, 2019 at 12:03 PM Ning Kang <ni...@google.com> wrote:

> Q1:
> The document is shared (
> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing).
> If inside Google, short link (go/ibeam-external
> <https://goto.google.com/ibeam-external>). I cannot share internal
> documents, but you can reach out if you need internal engineering plan.
>
> Q2:
> Yes, watch() is optimization used for using visualization() and building
> further on the pipeline. And the user doesn't need to call it if they
> simply define the pipeline in the notebook.
>
> Q3 and Q4:
> I'm only focusing on direct runner as underlying runner. We'll get rid of
> many of existing interactive Beam implementation. We can't provide
> portability for interactivity. Users can run the pipeline with other
> runners though due to the pipeline portability.
> Our work is to reduce the new concepts a user needs to know when they want
> to run interactive Beam. The implementation could be arbitrarily
> complicated and open sourced though. Currently, the interactive runner
> looks like as if it's supporting all kinds of underlying runners. We want
> to rid of it too.
>
> On 2019/08/08 00:01:06, Ahmet Altay <al...@google.com> wrote:
> > Ning, thank you for the heads up.
> >
> > All, this is a proposed work for improving interactive Beam experience.
> As
> > mentioned in Ning's email, new concepts are being introduced. And in
> > addition iBeam as a name is used as a new reference. I hope that bringing
> > the discussion to the mailing list will give it the additional
> > visibility and more people could share their feedback.
> >
> > (cc'ing a few folks that might be interested +Robert Bradshaw
> > <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
> Li
> > <qi...@google.com> +Brian Hulette <bh...@google.com> )
> >
> > Ahmet
> >
> >
> > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
> >
> > > To whom may concern,
> > >
> > > This is Ning from Google. We are currently making efforts to leverage
> an
> > > interactive runner under python beam sdk.
> > >
> > > There is already an interactive Beam (iBeam for short) runner with
> jupyter
> > > notebook in the repo
> > > <
> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive
> >
> > > .
> > > Following the instructions on that page, one can set up an interactive
> > > environment to develop and execute Beam pipeline interactively.
> > >
> > > However, there are many issues with existing iBeam. One issue is that
> it
> > > uses a concept of leaf PCollection to cache and materialize
> intermediate
> > > PCollection. If the user wants to reuse/introspect a non-leaf
> PCollection,
> > > the interactive runner will run into errors.
> > >
> > > Our initial effort will be fixing the existing issues. And we also
> want to
> > > make iBeam easy to use. Since iBeam uses the same model Beam uses,
> there
> > > isn't really any difference for users between creating a pipeline with
> > > interactive runner and other runners.
> > > So we want to minimize the interfaces a user needs to learn while
> giving
> > > the user some capability to interact with the interactive environment.
> > >
> > > See this initial PR <https://github.com/apache/beam/pull/9278>, the
> > > interactive_beam module will provide mainly 4 interfaces:
> > >
> > >    - For advanced users who define pipeline outside __main__, let them
> > >    tell current interactive environment where they define their
> pipeline:
> > >    watch()
> > >       - This is very useful for tests where pipeline can be defined in
> > >       test methods.
> > >       - If the user simply creates pipeline in a Jupyter notebook or a
> > >       plain Python script, they don't have to know/use this feature at
> all.
> > >    - Let users create an interactive pipeline: create_pipeline()
> > >       - invoking create_pipeline(), the user gets a Pipeline object
> that
> > >       works as any other Pipeline object created from
> apache_beam.Pipeline()
> > >       - However, the pipeline object p, when invoking p.run(), does
> some
> > >       extra interactive magic.
> > >       - We'll support interactive execution for DirectRunner at this
> > >       moment.
> > >    - Let users run the interactive pipeline as a normal pipeline:
> > >    run_pipeline()
> > >       - In an interactive environment, a user only needs to add and
> > >       execute 1 line of code run_pipeline(pipeline) to execute any
> existing
> > >       interactive pipeline object as normal pipeline in any selected
> platform.
> > >       - We'll probably support Dataflow only. Other implementations can
> > >       be added though.
> > >    - Let users introspect any intermediate PCollection they have
> handler
> > >    to: visualize()
> > >       - If a user ever writes pcoll = p | "Some Transform" >>
> > >       some_transform() ..., they can visualize(pcoll) once the
> pipeline p is
> > >       executed.
> > >       - p can be batch or streaming
> > >       - The visualization will be some plot graph of data for the given
> > >       PCollection as if it's materialized. If the PCollection is
> unbounded, the
> > >       graph is dynamic.
> > >
> > > The PR will implement 1 and 2.
> > >
> > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> > > level JIRA and add blocking JIRAs as development goes.
> > >
> > > External Beam users will not worry about any of the underlying
> > > implementation details.
> > > Except the 4 interfaces above, they learn and write normal Beam code
> and
> > > can execute the pipeline immediately when they are done with
> prototyping.
> > >
> > > Ning.
> > >
> >
>

Re: Brief of interactive Beam

Posted by Ning Kang <ni...@google.com>.

Q1:
The document is shared (https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing). If inside Google, short link (go/ibeam-external). I cannot share internal documents, but you can reach out if you need internal engineering plan.

Q2:
Yes, watch() is optimization used for using visualization() and building further on the pipeline. And the user doesn't need to call it if they simply define the pipeline in the notebook.

Q3 and Q4:
I'm only focusing on direct runner as underlying runner. We'll get rid of many of existing interactive Beam implementation. We can't provide portability for interactivity. Users can run the pipeline with other runners though due to the pipeline portability.
Our work is to reduce the new concepts a user needs to know when they want to run interactive Beam. The implementation could be arbitrarily complicated and open sourced though. Currently, the interactive runner looks like as if it's supporting all kinds of underlying runners. We want to rid of it too.

On 2019/08/08 00:01:06, Ahmet Altay <al...@google.com> wrote: 
> Ning, thank you for the heads up.
> 
> All, this is a proposed work for improving interactive Beam experience. As
> mentioned in Ning's email, new concepts are being introduced. And in
> addition iBeam as a name is used as a new reference. I hope that bringing
> the discussion to the mailing list will give it the additional
> visibility and more people could share their feedback.
> 
> (cc'ing a few folks that might be interested +Robert Bradshaw
> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy Li
> <qi...@google.com> +Brian Hulette <bh...@google.com> )
> 
> Ahmet
> 
> 
> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
> 
> > To whom may concern,
> >
> > This is Ning from Google. We are currently making efforts to leverage an
> > interactive runner under python beam sdk.
> >
> > There is already an interactive Beam (iBeam for short) runner with jupyter
> > notebook in the repo
> > <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
> > .
> > Following the instructions on that page, one can set up an interactive
> > environment to develop and execute Beam pipeline interactively.
> >
> > However, there are many issues with existing iBeam. One issue is that it
> > uses a concept of leaf PCollection to cache and materialize intermediate
> > PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
> > the interactive runner will run into errors.
> >
> > Our initial effort will be fixing the existing issues. And we also want to
> > make iBeam easy to use. Since iBeam uses the same model Beam uses, there
> > isn't really any difference for users between creating a pipeline with
> > interactive runner and other runners.
> > So we want to minimize the interfaces a user needs to learn while giving
> > the user some capability to interact with the interactive environment.
> >
> > See this initial PR <https://github.com/apache/beam/pull/9278>, the
> > interactive_beam module will provide mainly 4 interfaces:
> >
> >    - For advanced users who define pipeline outside __main__, let them
> >    tell current interactive environment where they define their pipeline:
> >    watch()
> >       - This is very useful for tests where pipeline can be defined in
> >       test methods.
> >       - If the user simply creates pipeline in a Jupyter notebook or a
> >       plain Python script, they don't have to know/use this feature at all.
> >    - Let users create an interactive pipeline: create_pipeline()
> >       - invoking create_pipeline(), the user gets a Pipeline object that
> >       works as any other Pipeline object created from apache_beam.Pipeline()
> >       - However, the pipeline object p, when invoking p.run(), does some
> >       extra interactive magic.
> >       - We'll support interactive execution for DirectRunner at this
> >       moment.
> >    - Let users run the interactive pipeline as a normal pipeline:
> >    run_pipeline()
> >       - In an interactive environment, a user only needs to add and
> >       execute 1 line of code run_pipeline(pipeline) to execute any existing
> >       interactive pipeline object as normal pipeline in any selected platform.
> >       - We'll probably support Dataflow only. Other implementations can
> >       be added though.
> >    - Let users introspect any intermediate PCollection they have handler
> >    to: visualize()
> >       - If a user ever writes pcoll = p | "Some Transform" >>
> >       some_transform() ..., they can visualize(pcoll) once the pipeline p is
> >       executed.
> >       - p can be batch or streaming
> >       - The visualization will be some plot graph of data for the given
> >       PCollection as if it's materialized. If the PCollection is unbounded, the
> >       graph is dynamic.
> >
> > The PR will implement 1 and 2.
> >
> > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> > level JIRA and add blocking JIRAs as development goes.
> >
> > External Beam users will not worry about any of the underlying
> > implementation details.
> > Except the 4 interfaces above, they learn and write normal Beam code and
> > can execute the pipeline immediately when they are done with prototyping.
> >
> > Ning.
> >
>

Re: Brief of interactive Beam

Posted by Robert Bradshaw <ro...@google.com>.

On Fri, Aug 23, 2019 at 4:25 PM Ning Kang <ni...@google.com> wrote:

> On Aug 23, 2019, at 3:09 PM, Robert Bradshaw <ro...@google.com> wrote:
>
> Cool, sounds like we're getting closer to the same page. Some more replies
> below.
>
> On Fri, Aug 23, 2019 at 1:47 PM Ning Kang <ni...@google.com> wrote:
>
>> Thanks for the feedback, Robert! I think I got your idea.
>> Let me summarize it to see if it’s correct:
>> 1. You want everything about
>>
>> standard Beam concepts
>>
>>  to follow existing pattern: so we can shot down create_pipeline() and
>> keep the InteractiveRunner notion when constructing pipeline, I agree with
>> it. A runner can delegate another runner, also agreed. Let’s keep it that
>> way.
>>
>
> Despite everything I've written, I'm not convinced that exposing this as a
> Runner is the most intuitive way to get interactivity either. Given that
> the "magic" of interactivity is being able to watch PCollections (for
> inspection and further construction), and if no PCollecitons are watched
> execution proceeds as normal, what are your thoughts about making all
> pipelines "interactive" and just doing the magic iff there are PCollections
> to watch? (The opt-in incantation here would be ibeam.watch(globals()) or
> similar.)
>
> FWIW, Flume has something similar (called marking collections as to be
> materialized). It has its pros and cons.
>
> By default __main__ is watched, similar to the watch(globals()). If no
> PCollection variable is being watched, it’s not doing any magic.
> I’m not sure about making all pipelines “interactive” such as by adding an
> “interactive=True/False” option when constructing pipeline.
>

My point was that watch(globals()) (or anything else) would be the explicit
op in to interactive, instead of doing interactive=True or manually
constructing an InteractiveRunner or anything else.


> Since we couldn’t decide which one is more intuitive, I would stick to the
> existing InteractiveRunner constructor that is open sourced.
> And we try to avoid changing any code outside …/runners/interactive/.
>
> Yes, we can stick with what's already there for now to avoid blocking any
implementation work.

> 2. watch() and visualize() can be in the independent interactive beam
>> module since they are
>>
>> concepts that are unique to being interactive
>>
>> 3. I'll add some example for the run_pipeline() in design doc. The short
>> answer is run_pipeline() != p.run(). Thanks for sharing the doc (
>> https://s.apache.org/no-beam-pipeline).
>> As described in the doc, when constructing the pipeline, we still want to
>> bundle a runner and options to the constructed pipeline even in the future.
>> So if the runner is InteractiveRunner, the interactivity instrument
>> (implicitly applied read/write cache PTransform and input/output wiring) is
>> only applied when "run_pipeline()" of the runner implementation is invoked.
>> p.run() will apply the instrument. However, this static function
>> run_pipeline() takes in a new runner and options,
>> invoking “run_pipeline()” implementation of the new runner and wouldn’t
>> have the instrument, thus no interactivity.
>> Because you cannot (don’t want to, as seen in the doc, users cannot
>> access the bundled pipeline/options in the future) change the runner easily
>> without re-executing all the notebook cells, this shorthand function allows
>> a user to run pipeline without interactivity immediately anywhere in a
>> notebook. In the meantime, the pipeline is still bundled with the original
>> Interactive Runner. The users can keep developing further pipelines.
>> The usage of this function is not intuitive until you put it in a
>> notebook user scenario where users develop, test in prod-like env and
>> develop further. And it’s equivalent to users writing
>> "from_runner_api(to_runner_api(pipeline))” in their notebook. It’s just a
>> shorthand.
>>
>
> What you're trying to work around here is the flaw in the existing API
> that a user binds the choice of Runner before pipeline construction, rather
> than at the point of execution. I propose we look at fixing this in Beam
> itself.
>
> Then I would propose not exposing this. If late runner binding is
> supported, we wouldn’t even need this. We can write it in an example
> notebook rather than exposing it.
>

Sounds good.


> 4. And we both agree that implicit cache is palatable and should be the
>> only thing we use to support interactivity. Cache and watched pipeline
>> definition (which tells us what to cache) are the main “hidden state” I
>> meant. Because the cache mechanism is totally implicit and hidden from the
>> user. A cache is either read or written in a p.run(). If an existing cache
>> is not used in a p.run(), it expires. If the user restarts the IPython
>> kernel, all cache should expire too.
>>
>
> Depending on how we label items in the cache, they could survive kernel
> restarts as well. This relates to another useful feature in Beam where if a
> batch pipeline fails towards the end, one may want to resume/rerun it from
> there after fixing the bug without redoing all the work.
>
> I would suggest kernel restart resetting everything. The lifespan of the
> cache or any interactivity shouldn’t exceed the kernel session. And the
> lifespan of a PCollection cache shouldn’t even exceed 2 consecutive
> pipeline runs if the second run doesn’t use or produce it.After all,
> resource of a running notebook is limited. We might even need cache
> eviction or full pipeline re-execution when memory or disk space is not
> enough.
>

There are pros and cons, but generally the user experience will be better
the more that is cached (even across sessions--history preserved across
sessions is one of the big wins of iPython vs. the built in prompt) up to
the point where resource constraints become prohibitive.


>    Existing InteractiveRunner has the following portability
>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability>.
>> That’s why I said the interactivity (implementation) needs to be tailored
>> for different underlying runners.
>>    If we allow users to pass in all kinds of underlying runners (even
>> their in-house ones), we have to support the interactivity for all of them
>> which we probably don't. That’s why we wanted a create_pipeline() wrapper
>> so that in notebook, when building a pipeline, bundle to DirectRunner by
>> default.
>>    The focus on the Direct Runner is also related to our objective: we
>> want to provide easy-to-use notebook and some notebook environment where
>> users can interactively execute pipelines without worrying about setup
>> (especially when the setup is not Beam but Interactive Beam related).
>> 6. We don’t fix typo for user defined transforms
>>
>> I'm talking about pruning like having a cell with
>>
>>     pcoll | beam.Map(lambda x: expression_with_typo)
>>
>> and then fixing it (and re-evaluating) with
>>
>>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>>
>> where the former Map would *always* fail and never get removed from the
>> pipeline.
>>
>> We never change the pipeline defined by the user. Interactivity is
>> applied to a copy of user defined pipeline.
>>
>
> Sure. But does the executed (copy) of the pipeline contain the bad Map
> operation in it? If so, it in essence "poisons" the entire pipeline,
> forcing a user to re-create and re-define it from the start to make forward
> progress (which results in quite a poor user experience--the errors in cell
> N manifest in cell M, but worse fixing and re-executing cell N doesn't fix
> cell M). If not, how is it intelligently excluded (and in a way that is not
> too dis-similar from non-interactive mode, and doesn't cause surprises with
> the p.run vs. run_pipeline difference)?
>
>
> The copy is always one-shot. When p.run(), a copy is instrumented with
> additional PTrasnfroms and some re-wiring. It’s executed by the underlying
> runner. And it’s gone.
> I think I know what you are saying here. It’s similar to re-executing a
> cell with anonymous PTransform. Should the user expect transforms in
> parallel (no pruning) or the original transforms replaced (with pruning)?
> If the bad transform is in the middle of several statements in a cell,
> there is no rollback. The user could re-execute the cell but cannot replace
> the failed/bad PTransfrom or remove succeeded/good PTransforms that are
> duplicated.
> Re-executing such cells after fixing the typo would “append” several
> PTransforms again, but those not needed anymore are still dangling branches
> in the pipeline.
> It might not affect the correctness of the pipeline, but does change the
> shape of the pipeline from the user’s expectation if they expect a
> replacement.
> It could be a great feature. +mnazir@google.com. I would propose
>   1. Either not implementing it for now because the behavior is consistent
> to building pipeline in non-interactive mode, and such replacement might
> not always be valid (see next answer).
>   2. Or if we decide that the interactive mode in this situation will have
> different behavior than non-interactive mode, we can implement it, try
> replacing the PTransforms (by appending new and pruning old) if the
> input/output wiring is compatible and print some warning (succeed or fail)
> saying it’s interactive mode.
> I prefer 1, because we are not trying too hard to implicitly change user’s
> pipeline when the user can expect no pruning (who are familiar with Beam
> and IPython) and pruning (who are not familiar with Beam or IPython) while
> we cannot guarantee a pruning is going to happen when the new transform is
> not compatible with the one with typo. A user could re-execute a cell 5
> times, but we cannot determine if the user wants the same transform applied
> 5 times from the same input (might be useless but valid) or the user has
> made some valid change and changed mind for 5 times. I still like the
> behavior “executing a cell” is equivalent to "appending the code for
> execution” in the IPython session. What’s executed is executed. We
> shouldn’t take the metadata of source code change in a executed cell into
> the pipeline construction process.
> But I’m open for 2 with the trying best and notifying users route.
>

I don't see any good ways to resolve the ambiguity with (2), so I think
we're stuck with (1), at least for v1.


>
>
> 7.
>>
>> One approach was that the pipeline construction is re-executed every time
>> (i.e the "pipeline" object to run is really a callback, like a callable or
>> a PTransform) and then there's no ambiguity here.
>>
>> I didn’t quite get it. Pipeline construction only happens when a user
>> executes a cell with the pipeline construction code.
>> Are you suggesting changing the logic in pipeline.apply() to always
>> reapply/replace a NamedPTransform? I don’t think we (Interactive Beam) can
>> decide that because it changes the behavior of Beam.
>> We had some thought of subclassing pipeline and use the create_pipeline()
>> method to create the subclassed pipeline object. Then intercept the
>> pipeline.apply() to always replace PTransform with existing full label and
>> apply logic of parent pipeline’s apply() logic.
>> It seems to be a no go to me now.
>>
>
> Sorry I wasn't clear. I'm referring to the style in the above doc about
> getting rid of the Pipeline object (as a long-lived thing at least). In
> this case the actual execution of pipeline construction never spans
> multiple cells (though its implementation might via function calls) so one
> never has out-of-date transforms dangling off the pipeline object.
>
> I think the desired experience is allowing users to define their pipelines
> across multiple cells or even multiple notebooks. Because the users can
> build pipeline, run, add more PTransforms and run. And the pipeline object
> is long lived like any other variables in notebook until it’s re-evaluated.
>

I'm not convinced a long-lived pipeline is going to be a good experience
due to the append-only, poisoning issues described above, and would be
interested in an experience that avoids this.

The scenario we both feel that is not intuitive is “re-execute some
> previous cells with modified PTransforms and run”. It isn’t always valid to
> replace a PTransform in a full grown pipeline (so I think existing raising
> error for named PTransform and appending more for anonymous PTransform
> behavior is still the best option). Re-executing cells is going to append
> more PTransforms. In that case, they have to re-execute all cells from the
> beginning to construct a new pipeline object. It’s like in a notebook,
> define class Foo with string field x and other fields lazily initialized
> from x. Then in a cell, the user set x to an integer. Syntactically, the
> code still works (in Python). But when constructing the Foo object, lazily
> initialized fields cannot be evaluated.
>

 Yep, and in this case a good design would be to not make x a public(ly
settable) member, and definitely not encourage patterns that require
setting x after construction. I'm trying to similarly avoid error-prone
patterns here. But it's possible that this issue won't be resolved until we
have enough of the infrastructure to play around with it and get even more
hands-on and third-party experience.

Re: Brief of interactive Beam

Posted by Ning Kang <ni...@google.com>.

Great! I’ve added some proposals. And to clarify, we want to use explicit name “Interactive Beam” instead of the short name “iBeam”. Sorry for the confusion.

> On Aug 23, 2019, at 3:09 PM, Robert Bradshaw <ro...@google.com> wrote:
> 
> Cool, sounds like we're getting closer to the same page. Some more replies below. 
> 
> On Fri, Aug 23, 2019 at 1:47 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
> Thanks for the feedback, Robert! I think I got your idea. 
> Let me summarize it to see if it’s correct: 
> 1. You want everything about 
>> 
>>> standard Beam concepts
>  to follow existing pattern: so we can shot down create_pipeline() and keep the InteractiveRunner notion when constructing pipeline, I agree with it. A runner can delegate another runner, also agreed. Let’s keep it that way.
> 
> Despite everything I've written, I'm not convinced that exposing this as a Runner is the most intuitive way to get interactivity either. Given that the "magic" of interactivity is being able to watch PCollections (for inspection and further construction), and if no PCollecitons are watched execution proceeds as normal, what are your thoughts about making all pipelines "interactive" and just doing the magic iff there are PCollections to watch? (The opt-in incantation here would be ibeam.watch(globals()) or similar.)
> 
> FWIW, Flume has something similar (called marking collections as to be materialized). It has its pros and cons. 
By default __main__ is watched, similar to the watch(globals()). If no PCollection variable is being watched, it’s not doing any magic.
I’m not sure about making all pipelines “interactive” such as by adding an “interactive=True/False” option when constructing pipeline. Since we couldn’t decide which one is more intuitive, I would stick to the existing InteractiveRunner constructor that is open sourced.
And we try to avoid changing any code outside …/runners/interactive/.

>  
> 2. watch() and visualize() can be in the independent interactive beam module since they are 
>>> concepts that are unique to being interactive
> 3. I'll add some example for the run_pipeline() in design doc. The short answer is run_pipeline() != p.run(). Thanks for sharing the doc (https://s.apache.org/no-beam-pipeline <https://s.apache.org/no-beam-pipeline>).
> As described in the doc, when constructing the pipeline, we still want to bundle a runner and options to the constructed pipeline even in the future. So if the runner is InteractiveRunner, the interactivity instrument (implicitly applied read/write cache PTransform and input/output wiring) is only applied when "run_pipeline()" of the runner implementation is invoked. p.run() will apply the instrument. However, this static function run_pipeline() takes in a new runner and options, invoking “run_pipeline()” implementation of the new runner and wouldn’t have the instrument, thus no interactivity.
> Because you cannot (don’t want to, as seen in the doc, users cannot access the bundled pipeline/options in the future) change the runner easily without re-executing all the notebook cells, this shorthand function allows a user to run pipeline without interactivity immediately anywhere in a notebook. In the meantime, the pipeline is still bundled with the original Interactive Runner. The users can keep developing further pipelines.
> The usage of this function is not intuitive until you put it in a notebook user scenario where users develop, test in prod-like env and develop further. And it’s equivalent to users writing "from_runner_api(to_runner_api(pipeline))” in their notebook. It’s just a shorthand.
> 
> What you're trying to work around here is the flaw in the existing API that a user binds the choice of Runner before pipeline construction, rather than at the point of execution. I propose we look at fixing this in Beam itself. 
Then I would propose not exposing this. If late runner binding is supported, we wouldn’t even need this. We can write it in an example notebook rather than exposing it. 

> 
> 4. And we both agree that implicit cache is palatable and should be the only thing we use to support interactivity. Cache and watched pipeline definition (which tells us what to cache) are the main “hidden state” I meant. Because the cache mechanism is totally implicit and hidden from the user. A cache is either read or written in a p.run(). If an existing cache is not used in a p.run(), it expires. If the user restarts the IPython kernel, all cache should expire too.
> 
> Depending on how we label items in the cache, they could survive kernel restarts as well. This relates to another useful feature in Beam where if a batch pipeline fails towards the end, one may want to resume/rerun it from there after fixing the bug without redoing all the work. 
I would suggest kernel restart resetting everything. The lifespan of the cache or any interactivity shouldn’t exceed the kernel session. And the lifespan of a PCollection cache shouldn’t even exceed 2 consecutive pipeline runs if the second run doesn’t use or produce it.
After all, resource of a running notebook is limited. We might even need cache eviction or full pipeline re-execution when memory or disk space is not enough.

>  
> 5. I think my explanation of focusing on Direct Runner is misleading.
>   5.1 My implementation of instrumenting pipeline with cache and wiring input/output is runner agnostic. I never access the underlying runner or options when instrumenting the pipeline. Underlying runner is only used when run_pipeline(). It's just I'm not appending PTransform (implicit read/write cache) and re-wiring input/output by directly modifying a portable proto. I’m doing it directly to a pipeline object (which is bundled with InteractiveRunner with some underlying runner) and AppliedPTransform nodes.
>   5.2 Caching is more like a way to optimize pipeline execution and materialize PCollection data for visualization. Re-evaluating the whole pipeline every time can also give users interactive experience.
>   5.3 So yes, I 
>>> don't think the "interactive experience should be tailored for different underlying runners.”
>    The caching mechanism / the magic that helps the interactivity instrumenting process might need different implementation for different underlying runners. Because the runner can be anywhere deployed in any architecture, the notebook is just a process on a machine. They need to work together.
>    Currently, we have the local file based cache. If we run a pipeline with underlying_runner as DataflowRunner, we’ll need something like GCS based cache. An in-memory cache might be runner agnostic, but it might explode with big data source.
> 
> Yep, we need filesystem/directory to use as a cache. We have an existing temp_location flag that we can use for this (and is required for distributed runners). If unset we can default to a local temp dir (which works for the direct runner). 
Yes, I think existing InteractiveRunner portability <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability> has been able to support this. It’s just the user has to set the options up in their notebooks.

>  
>    Existing InteractiveRunner has the following portability <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability>. That’s why I said the interactivity (implementation) needs to be tailored for different underlying runners.
>    If we allow users to pass in all kinds of underlying runners (even their in-house ones), we have to support the interactivity for all of them which we probably don't. That’s why we wanted a create_pipeline() wrapper so that in notebook, when building a pipeline, bundle to DirectRunner by default.
>    The focus on the Direct Runner is also related to our objective: we want to provide easy-to-use notebook and some notebook environment where users can interactively execute pipelines without worrying about setup (especially when the setup is not Beam but Interactive Beam related).
> 6. We don’t fix typo for user defined transforms
>>> I'm talking about pruning like having a cell with
>>> 
>>>     pcoll | beam.Map(lambda x: expression_with_typo)
>>> 
>>> and then fixing it (and re-evaluating) with
>>> 
>>>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>>> 
>>> where the former Map would *always* fail and never get removed from the pipeline. 
> We never change the pipeline defined by the user. Interactivity is applied to a copy of user defined pipeline.
> 
> Sure. But does the executed (copy) of the pipeline contain the bad Map operation in it? If so, it in essence "poisons" the entire pipeline, forcing a user to re-create and re-define it from the start to make forward progress (which results in quite a poor user experience--the errors in cell N manifest in cell M, but worse fixing and re-executing cell N doesn't fix cell M). If not, how is it intelligently excluded (and in a way that is not too dis-similar from non-interactive mode, and doesn't cause surprises with the p.run vs. run_pipeline difference)? 
>  
The copy is always one-shot. When p.run(), a copy is instrumented with additional PTrasnfroms and some re-wiring. It’s executed by the underlying runner. And it’s gone.
I think I know what you are saying here. It’s similar to re-executing a cell with anonymous PTransform. Should the user expect transforms in parallel (no pruning) or the original transforms replaced (with pruning)?
If the bad transform is in the middle of several statements in a cell, there is no rollback. The user could re-execute the cell but cannot replace the failed/bad PTransfrom or remove succeeded/good PTransforms that are duplicated.
Re-executing such cells after fixing the typo would “append” several PTransforms again, but those not needed anymore are still dangling branches in the pipeline.
It might not affect the correctness of the pipeline, but does change the shape of the pipeline from the user’s expectation if they expect a replacement.
It could be a great feature. +mnazir@google.com <ma...@google.com>. I would propose 
  1. Either not implementing it for now because the behavior is consistent to building pipeline in non-interactive mode, and such replacement might not always be valid (see next answer).
  2. Or if we decide that the interactive mode in this situation will have different behavior than non-interactive mode, we can implement it, try replacing the PTransforms (by appending new and pruning old) if the input/output wiring is compatible and print some warning (succeed or fail) saying it’s interactive mode.
I prefer 1, because we are not trying too hard to implicitly change user’s pipeline when the user can expect no pruning (who are familiar with Beam and IPython) and pruning (who are not familiar with Beam or IPython) while we cannot guarantee a pruning is going to happen when the new transform is not compatible with the one with typo. A user could re-execute a cell 5 times, but we cannot determine if the user wants the same transform applied 5 times from the same input (might be useless but valid) or the user has made some valid change and changed mind for 5 times. I still like the behavior “executing a cell” is equivalent to "appending the code for execution” in the IPython session. What’s executed is executed. We shouldn’t take the metadata of source code change in a executed cell into the pipeline construction process.
But I’m open for 2 with the trying best and notifying users route.


> 7. 
>>> One approach was that the pipeline construction is re-executed every time (i.e the "pipeline" object to run is really a callback, like a callable or a PTransform) and then there's no ambiguity here. 
> I didn’t quite get it. Pipeline construction only happens when a user executes a cell with the pipeline construction code.
> Are you suggesting changing the logic in pipeline.apply() to always reapply/replace a NamedPTransform? I don’t think we (Interactive Beam) can decide that because it changes the behavior of Beam.
> We had some thought of subclassing pipeline and use the create_pipeline() method to create the subclassed pipeline object. Then intercept the pipeline.apply() to always replace PTransform with existing full label and apply logic of parent pipeline’s apply() logic.
> It seems to be a no go to me now.
> 
> Sorry I wasn't clear. I'm referring to the style in the above doc about getting rid of the Pipeline object (as a long-lived thing at least). In this case the actual execution of pipeline construction never spans multiple cells (though its implementation might via function calls) so one never has out-of-date transforms dangling off the pipeline object. 
I think the desired experience is allowing users to define their pipelines across multiple cells or even multiple notebooks. Because the users can build pipeline, run, add more PTransforms and run. And the pipeline object is long lived like any other variables in notebook until it’s re-evaluated.
The scenario we both feel that is not intuitive is “re-execute some previous cells with modified PTransforms and run”. It isn’t always valid to replace a PTransform in a full grown pipeline (so I think existing raising error for named PTransform and appending more for anonymous PTransform behavior is still the best option). Re-executing cells is going to append more PTransforms. In that case, they have to re-execute all cells from the beginning to construct a new pipeline object. It’s like in a notebook, define class Foo with string field x and other fields lazily initialized from x. Then in a cell, the user set x to an integer. Syntactically, the code still works (in Python). But when constructing the Foo object, lazily initialized fields cannot be evaluated.


>  
>>> This has the downsides of recreating the PCollectiion objects which are being used as handles (though perhaps they could be re-identified).
> If a user re-executes a cell with PCollection = p | PTransform, the PCollection object will be a new instance. That is not a downside.
> We can keep the existing behavior of Beam to always raise an error when the cell with named PTransform is re-executed.
> 
> Thanks!
> 
> Ning.
> 
> 
> 
>> On Aug 23, 2019, at 11:36 AM, Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>> 
>> On Wed, Aug 21, 2019 at 3:33 PM GMAIL <kawaigin@gmail.com <ma...@gmail.com>> wrote:
>> Thanks for the input, Robert!
>>> On Aug 21, 2019, at 11:49 AM, Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>>> 
>>> On Wed, Aug 14, 2019 at 11:29 AM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>>> Ahmet, thanks for forwarding!
>>>  
>>> My main concern at this point is the introduction of new concepts, even though these are not changing other parts of the Beam SDKs. It would be good to see at least an alternative option covered in the design document. The reason is each additional concept adds to the mental load of users. And also concepts from interactive Beam will shift user's expectations of Beam even though there are not direct SDK modifications.
>>> 
>>> Hi Robert. About the concern, I think I have a few points:
>>> Interactive Beam (or Interactive Runner) is already an existing "new concept" that normal Beam user could opt-in if they want an interactive Beam experience. They need to do lots of setup steps and learn new things such as Jupyter notebook and at least interactive_runner module to make it work and make use of it.
>>> I think we should start with the perspective that most users interested in using Beam interactively already know about Jupyter notebooks, or at least ipython, and would want to use it to learn (and more effectively use) Beam. 
>> Yes, I agree with the perspective for users who are familiar with notebook. Yet it doesn’t prevent us from creating ready-to-use containers (such as binder <https://github.com/jupyterhub/binderhub>)  for users who want to try Beam interactively without setting up a environment with all the dependencies interactive Beam introduces. I agree that experienced users understand how to set up additional dependencies and read examples, it’s just we are also targeting other entry level audiences.
>> But back to the original topic, the design is not trying to add new concept, but fixing some rough edges of existing Interactive Beam features. We can discuss whether a factory of create_pipeline() is really desired and decide whether to expose it later. We hope the interactive_beam module to be the only module an Interactive Beam user would directly invoke in their notebook.
>> 
>> My goal would be that one uses a special interactive module for those concepts that are unique to being interactive, and standard Beam concepts (rather than replacements or wrappers) otherwise. 
>>> The behavior of existing interactive Beam is different from normal Beam because of the interactive nature and the users would expect that. And the users wouldn't shift their expectation of normal Beam. Just like running Python scripts might result in different behavior than running all of them in an interactive Python session. 
>>> I'm not quite following this. One of the advantages strengths of Python is that lack of the difference between the interactive vs. non-interactive behavior. (The fact that a script's execution is always in top to bottom order, unlike a notebook, is the primary difference.)
>> Sorry for the confusion. What I’m saying is about the hidden states. Running several Python scripts from top to bottom in an IPython session might generate different effects than running them in the same order normally. Say if you have an in-memory global configuration that is shared among all the scripts and if it’s missing, a script initializes one. Running the scripts in IPython will pass the initialization and modification of configuration along the scripts. While running the scripts one by one will initialize different configurations. Running cells in a notebook is equivalent to appending the cells into a script and run it. The interactivity is not about the order, but if there is hidden states preserved between each statement or each script execution. And the users should expect that there might be hidden states when they are in an interactive environment because that is exactly the interactivity they expect. However, they don’t hold the hidden states, the session does it for them. A user wouldn’t need to explicitly say “preserve the variable x I’ve defined in this cell because I want to reuse it in some other cells I’m going to execute”. The user can directly access variable x once the cell defining x is executed. And even if the user deletes the cell defining x, x still exists. At that stage, no one would know there is a variable x in memory by just looking at the notebook. One would see a missing execution sequence (on top left of each executed cell) and wonder where the piece of code executed goes.
>> Above is just for interactivity, not Beam.
>> 
>> You can have interactive without hidden state (e.g. the iPython console) notebooks are very common (and useful). IMHO, we should try to avoid leveraging hidden state when possible as it makes things difficult to reason about (even for those used to it). Things like implicit caches are more palatable because (generally, there's issues with side effects and non-determinism) the behavior remains the same (but performance, and consequently user experience, is augmented). 
>> 
>> The crux of interactive Beam seems instead to be that one can inspect the contents of PCollections after they have run, and apply further operations to these PCollections which, when executed, will (generally) not re-execute previously completed operations (i.e. incremental, interactive pipeline construction). These two properties can be leveraged in a script just as well as a notebook and have nothing to do with hidden state (in the notebook sense). 
>> 
>> Another interesting thing is that a named PTransform cannot be applied to the same pipeline more than once. It means a cell with named PTransform:  p | “Name” >> NamedPTransform() cannot be re-executed. We might want to support such re-execution if the pipeline is in an interactive mode. In that case, the Beam behavior might be different from non-interactive Beam. 
>> 
>>  Yes, this is could be a critical difference, which if we go this route makes me think the "InteractivePipeline" which differs from the standard Pipeline may make sense if we go down this route. More below. 
>>> Or if a user runs a Beam pipeline with direct runner, they should expect the behavior be different from running it on Dataflow while a user needs GCP account. I think the users are aware of the difference when they choose to use Interactive Beam.
>>>  The central, defining tenant of Beam is that behavior should be consistent across different runners. Of course there are operational details that are difficult, or perhaps even undesirable, to align (like, as you mention, needing a GCP account for running on Dataflow, or providing the location of the master when running Flink). But even these should be minimized (see the recent efforts to make the temp location a standard rather than dataflow-specific option). 
>>> 
>>> We should, however, attempt to minimize gratuitous differences. In particular, we should make it as easy as possible to transition (in terms of code, docs, and developers) between interactive and non-interactive. 
>> Yes, I agree that runners should be consistent. The interactive runner we currently have is a wrapper of its underlying runner. Thus we don’t intend to brand it as a standalone runner. And that’s why we want to create an interactive_beam module for end users to avoid confusing them with a new runner called InteractiveRunner.
>> 
>> *This* seems to be the crux of the argument, that Interactive should not be a Runner, but something else. Exactly what that something else has not yet been well defined. 
>>  
>> And I want to focus on direct runner as underlying runner for now.
>> 
>> It's fine to only use the direct runner for now; it's certainly a lot easier. But can you confirm there's nothing about the Direct runner itself that you're depending on to implement this (or if there is something, clearly document it here). 
>>  
>> Because with the portability of Beam, when applying interactivity to original pipeline, we can always do: pipeline_with_runner_A -> portable proto -> pipeline_with_direct_runner -> apply interactivity and get a new pipeline -> new portable proto -> pipeline_with_runner_A with interactivity.
>> 
>> I think there's some conflation of concepts here (which are reflected in the rest of this conversation). 
>> 
>> In the Beam model, the Runner is not a property of a Pipeline. Rather a Pipeline is passed to a Runner for execution. (Along with some options.)
>> 
>> On the other hand, Python's Pipeline object currently holds an instance of a Runner (for historical reasons). This is something we'd like to remove (as we did in Java, or at least we removed the ability to reference and query the Pipeline's Runner). This mismatch in the model vs the API was one of the motivations for https://s.apache.org/no-beam-pipeline <https://s.apache.org/no-beam-pipeline> .
>> 
>> From an implementation perspective, I would strongly discourage having the intermediate representation be "pipeline_with_direct_runner" that's derived from a portable proto. The implementation details of that class are il-defined, subject to change, and I hope will in the not-to-distant future be removed (or at least significantly cleaned up), and don't work well for things like cross-language pipelines. 
>>  
>> And the run_pipeline() feature is exactly the transition from interactive to non-interactive.
>> With p.run(), one has to re-built/re-execute the pipeline p from the beginning. While they can do p -> portable proto -> pipeline_with_their_desired_runner_no_interactivity in a subsequent cell.
>> The run_pipeline() is a shorthand for the process. It’s about the same thing we want to decide for create_pipeline(): do we leave the boilerplate one-liner to the users and what would the level of the users be for their first time playing with Beam?
>> We never mutate the p defined by the user so p by itself is always non-interactive. Only when p.run() is invoked, interactivity is applied.
>> 
>> I am not following this at all. interactive_beam.run_pipeline(p) is not p.run()? But it is the latter that is interactive, not the former? Maybe some examples would help--there aren't any in the doc. 
>>  
>> 
>>> Our design actually reduces the mental load of interactive Beam users with intuitive interactive features: create pipeline, visualize intermediate PCollection, run pipeline at some point with other runners and etc. For example, right now, the user needs to use a more complicated set of libraries, like creating a Beam pipeline with interactive runner that needs an underlying runner fed in.  We are getting rid of it. An interactive Beam user shouldn't be concerned about the underlying interactive magic. 
>>> I agree a user shouldn't be concerned about the implementation details, but I fail to see how
>>> 
>>>     p = interactive_module.create_pipeline()
>>> 
>>> is significantly simpler than, or preferable to, 
>>> 
>>>    p = Pipeline(interactive_module.InteractiveRunner())
>>> 
>>> especially as the latter is in line with non-interactive pipelines (and all our examples, docs, etc.) Now, perhaps an argument could be made that interactivity is not a property of the runner, but something orthogonal to that, e.g. one should write
>>> 
>>>     p = InteractivePipeline()
>>> 
>>> or
>>> 
>>>     p = Pipeline(options, interactive=True)
>>> 
>>> or similar. (It was introduced as a runner because, conceptually, a runner is something that takes a pipeline and executes it.)
>> Yeah, this is debatable. I think the difference is, we want to introduce something that is Interactive Beam but not Interactive Runner.
>> 
>> Gotcha. 
>>  
>> I really wish that we have a 4runner standard thingy that can be implemented to intercept the pipeline for any runner before the execution.
>> 
>> Any runner that does delegation to another runner *is* a standard thingy that can intercept the pipeline for any runner before the execution. (One may want translation of results on the other end as well.) But it can be argued that this is not the most intuitive API. 
>> 
>> In that case, the wrapped interactive logic can be just a function that is applicable to existing runner in some interactive mode with “interactive=True”. But we don’t intend to introduce the new concept.
>>>> p = interactive_module.create_pipeline()
>> is saying the pipeline is created with interactivity and runnable in notebook. You can still convert the pipeline for other runners without interactivity.
>> There is no guarantee that the interactivity is provided by (direct) runner nor any of the underlying implementation is backward compatible.
>> There is no implementation details exposed in the API. You can even treat it as using this library, one can create a pipeline with direct runner but with interactivity as additional feature.
>> I would favor the composition way.
>>>> p = Pipeline(interactive_module.InteractiveRunner())
>> is saying you can create a pipeline with this new runner thingy and we’ll maintain the runner forever like other runners. 
>> Additionally it can even take in an underlying runner. But does it make sense for a runner to take another runner? Is this applicable to all runners?
>> 
>> This is similar to Guava's filtering and transforming collection wrappers. 
>>  
>> The implementation details is exposed in the API itself while users don’t care about and we don’t want to maintain.
>>>> p = Pipeline(options, interactive=True)
>> is saying Pipeline can have an interactive mode. Like discussed above, it’ll be a new concept for the concept “pipeline”. It’s basically saying anything pipeline can have an interactive mode. That’s the inheritance way.
>> 
>> Yep. If you already know about Pipelines, it's easy to learn about interactive Pipelines. Perhaps more importantly, if you already know about Interactive Pipelines, you now know a bunch about non-interactive ones. 
>> 
>> An even more radical concept would be that an interactive Pipeline is just one that has any watched PCollections. (In other words, interactive vs. non-interactive is not a mode or a setting, just something that happens when you try to use things interactively.)
>>  
>> Does this apply to pipeline created by other runner?
>> 
>> Runners do not create piplines. Runners execute pipelines. Any Runner should be able to execute any pipeline. 
>>  
>>> The interactive experience should be tailored for different underlying runners. There is no portability of interactivity and users opt-in interactive Beam using notebook would naturally expect something similar to the direct runner.
>>> This concerns me a lot. Again, the core tenant of beam is that one can choose the execution environment (runner) completely independently with how one writes the pipeline. We should figure out what, if anything, needs to be supported by a runner to support interactivity (the only thing that comes to mind is a place to read and write temporary/cached data), but I very strongly feel we should not go down the road of having different interactive apis/experiences for different runners. In particular, the many instances to the DirectRunner are worrisome--what's special about the DirectRunner that other runners cannot provide that's needed for interactive? If we can't come up with a good answer to that, we should not impose this restriction. 
>> We don’t intend to provide different interactive experience for different runners.
>> 
>> Good, so to clarify, you don't think the "interactive experience should be tailored for different underlying runners."
>>  
>>> Interactive Beam is solving an orthogonal set of problems than Beam. You can think of it as a wrapper of Beam that enables interactivity and it's not even a real runner. It doesn't change the Beam model such as how you build a pipeline. And with the Beam portability, you get the capability to run the pipeline built from interactive runner with other runners for free. It adds the interactive behavior that a user expects.
>>> We want to open source it though we can iterate faster without doing it. The whole project can be encapsulated in a completely irrelevant repository and from a developer's perspective, I want to hide all the implementation details from the interactive Beam user. However, as there is more and more desire for interactive Beam (+Mehran Nazir <ma...@google.com> for more details), we want to share the implementation with others who want to contribute and explore the interactive world.
>>> I would much rather see interactivity as part of the Beam project. With good APIs the implementations don't have to be tightly coupled (e.g. the underlying runner delegation) but I think it will be a better user experience if interactive was a mode rather than a wrapper with different entry points. 
>>> 
>> I agree with it when the interactive mode is acknowledged as a must-have/common/standard for all runners or for Beam pipeline itself.
>> But this will be a new concept that is distributed top-down from Beam to its contributors.
>> Our current objective is to enable interactivity of Beam in notebook and introduce experimental usages/features (and we want to limit the features and make them easy to use).
>> I like the idea that anyone can write a runner or a transform. I would wish that anyone can add interactivity (before, during, or after pipeline execution) when it’s a mature concept.
>> Maybe in the future, interactivity will be something that is similar to accessibility, internationalization, testability and etc. (Like if you write a class, you need not only a test but also a notebook to demo it)
>> 
>> The discussions about how to behave with such fundamental concepts such as applying transforms imply that this is something that can't just be thrown on top, but may require deeper changes to the API itself. 
>>  
>> But we are not doing it right now until we have more feedback from the community and concrete CUJs.
>> 
>> 
>>> 
>>> I think watch() is a really good solution to knowing which collections to cache, and visualize() will be very useful. 
>>> 
>>> One thing I don't see tackled at all yet is the fact that pipelines are only ever mutated by appending on new operations, so some design needs to be done in terms of how to remove (possibly error-causing) operations or replace bad ones with fixed ones. This is where most of the unsolved problems lie. 
>> Thanks! And we haven’t really decided if plotting plain data is a good idea for PCollections. Plotting the metadata/analytics/insight might be a better option for users. I’m looking into facets <https://pair-code.github.io/facets/> that TFX notebook uses.
>> 
>> Yes, there are many different ways one might want to "look at" a PCollection. 
>>  
>> Yes, existing PR <https://github.com/apache/beam/pull/9278> is appending operations only.
>> We would have pruning when the appended operations severe the whole pipeline DAG into a set of DAGs to only execute the part of pipeline that needs to be re-executed (optimization). This is included in the design.
>> 
>> I'm talking about pruning like having a cell with
>> 
>>     pcoll | beam.Map(lambda x: expression_with_typo)
>> 
>> and then fixing it (and re-evaluating) with
>> 
>>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>> 
>> where the former Map would *always* fail and never get removed from the pipeline. 
>>  
>> We could also have the replacing logic that allows users to re-execute a cell with named PTransform by replacing the PTransform.
>> This needs your precious feedback and is a bug-fix of existing interactive beam rather than the design around user-defined Collection variables.
>> This is actually another debatable topic. Like I said, executing cells in notebook is equivalent to appending the code into a script.
>> Users can re-execute cells with only anonymous PTransforms. It’s equivalent to applying the PTransform many times "in parallel".
>> However, users cannot re-execute cells with any named PTransform. The inconsistency comes.
>> 
>> Yes, this is a *major* pain point I've seen every time people try to use Beam in a notebook. On the other hand, differing behaviors here (adding operations in parallel vs. replacement) could have surprising (non-local!) effects when moving from interactive to non-interactive mode. 
>>  
>> Should the re-execution of named PTransform be supported? When supported, should we replace PTransform or append PTransforms "in parallel"?
>> I think we’ll eventually pick a route, go through it and see what the interactive Beam users provide as feed back.
>> 
>> One approach was that the pipeline construction is re-executed every time (i.e the "pipeline" object to run is really a callback, like a callable or a PTransform) and then there's no ambiguity here. This has the downsides of recreating the PCollectiion objects which are being used as handles (though perhaps they could be re-identified). 
>>  
>> 
>>> 
>>> Also +David Yan <ma...@google.com>  for more opinions.
>>> 
>>> Thanks!
>>> 
>>> Ning.
>>> 
>>> On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <altay@google.com <ma...@google.com>> wrote:
>>> Ning, I believe Robert's questions from his email has not been answered yet.
>>> 
>>> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>>> Hi all, I'll leave another 3 days for design <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review. Then we can have a vote session if there is no objection.
>>> 
>>> Thanks!
>>> 
>>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>>> Thanks Ahmet for the introduction!
>>> 
>>> I've composed a design overview <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> describing changes we are making to components around interactive runner. I'll share the document in our email thread too.
>>> 
>>> The truth is since interactive runner is not yet a recognized runner as part of the Beam SDK (and it's fundamentally a wrapper around direct runner), we are not touching any Beam SDK components.
>>> We'll not change any behavior of existing Beam SDK and we'll try our best to keep it that way in the future.
>>> 
>>> My main concern at this point is the introduction of new concepts, even though these are not changing other parts of the Beam SDKs. It would be good to see at least an alternative option covered in the design document. The reason is each additional concept adds to the mental load of users. And also concepts from interactive Beam will shift user's expectations of Beam even though there are not direct SDK modifications.
>>>  
>>> 
>>> In the meantime, I'll work on other components orthogonal to Beam such as Pipeline Display and Data Visualization I mentioned in the design overview.
>>> 
>>> If you have any questions, please feel free to contact me through this email address!
>>> 
>>> Thanks!
>>> 
>>> Regards,
>>> Ning.
>>> 
>>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <altay@google.com <ma...@google.com>> wrote:
>>> Ning, thank you for the heads up.
>>> 
>>> All, this is a proposed work for improving interactive Beam experience. As mentioned in Ning's email, new concepts are being introduced. And in addition iBeam as a name is used as a new reference. I hope that bringing the discussion to the mailing list will give it the additional visibility and more people could share their feedback.
>>> 
>>> (cc'ing a few folks that might be interested +Robert Bradshaw <ma...@google.com> +Valentyn Tymofieiev <ma...@google.com> +Sindy Li <ma...@google.com> +Brian Hulette <ma...@google.com> )
>>> 
>>> Ahmet
>>> 
>>> 
>>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>>> To whom may concern,
>>> 
>>> This is Ning from Google. We are currently making efforts to leverage an interactive runner under python beam sdk.
>>> 
>>> There is already an interactive Beam (iBeam for short) runner with jupyter notebook in the repo <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>.
>>> Following the instructions on that page, one can set up an interactive environment to develop and execute Beam pipeline interactively.
>>> 
>>> However, there are many issues with existing iBeam. One issue is that it uses a concept of leaf PCollection to cache and materialize intermediate PCollection. If the user wants to reuse/introspect a non-leaf PCollection, the interactive runner will run into errors.
>>> 
>>> Our initial effort will be fixing the existing issues. And we also want to make iBeam easy to use. Since iBeam uses the same model Beam uses, there isn't really any difference for users between creating a pipeline with interactive runner and other runners. 
>>> So we want to minimize the interfaces a user needs to learn while giving the user some capability to interact with the interactive environment.
>>> 
>>> See this initial PR <https://github.com/apache/beam/pull/9278>, the interactive_beam module will provide mainly 4 interfaces:
>>> For advanced users who define pipeline outside __main__, let them tell current interactive environment where they define their pipeline: watch()
>>> This is very useful for tests where pipeline can be defined in test methods.
>>> If the user simply creates pipeline in a Jupyter notebook or a plain Python script, they don't have to know/use this feature at all.
>>> Let users create an interactive pipeline: create_pipeline()
>>> invoking create_pipeline(), the user gets a Pipeline object that works as any other Pipeline object created from apache_beam.Pipeline()
>>> However, the pipeline object p, when invoking p.run(), does some extra interactive magic.
>>> We'll support interactive execution for DirectRunner at this moment.
>>> Let users run the interactive pipeline as a normal pipeline: run_pipeline()
>>> In an interactive environment, a user only needs to add and execute 1 line of code run_pipeline(pipeline) to execute any existing interactive pipeline object as normal pipeline in any selected platform.
>>> We'll probably support Dataflow only. Other implementations can be added though.
>>> Let users introspect any intermediate PCollection they have handler to: visualize()
>>> If a user ever writes pcoll = p | "Some Transform" >> some_transform() ..., they can visualize(pcoll) once the pipeline p is executed.
>>> p can be batch or streaming
>>> The visualization will be some plot graph of data for the given PCollection as if it's materialized. If the PCollection is unbounded, the graph is dynamic. 
>>> The PR will implement 1 and 2.
>>> 
>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 <https://issues.apache.org/jira/browse/BEAM-7923> as the top level JIRA and add blocking JIRAs as development goes.
>>> 
>>> External Beam users will not worry about any of the underlying implementation details.
>>> Except the 4 interfaces above, they learn and write normal Beam code and can execute the pipeline immediately when they are done with prototyping.
>>> 
>>> Ning.
>

Re: Brief of interactive Beam

Posted by Robert Bradshaw <ro...@google.com>.

Cool, sounds like we're getting closer to the same page. Some more replies
below.

On Fri, Aug 23, 2019 at 1:47 PM Ning Kang <ni...@google.com> wrote:

> Thanks for the feedback, Robert! I think I got your idea.
> Let me summarize it to see if it’s correct:
> 1. You want everything about
>
> standard Beam concepts
>
>  to follow existing pattern: so we can shot down create_pipeline() and
> keep the InteractiveRunner notion when constructing pipeline, I agree with
> it. A runner can delegate another runner, also agreed. Let’s keep it that
> way.
>

Despite everything I've written, I'm not convinced that exposing this as a
Runner is the most intuitive way to get interactivity either. Given that
the "magic" of interactivity is being able to watch PCollections (for
inspection and further construction), and if no PCollecitons are watched
execution proceeds as normal, what are your thoughts about making all
pipelines "interactive" and just doing the magic iff there are PCollections
to watch? (The opt-in incantation here would be ibeam.watch(globals()) or
similar.)

FWIW, Flume has something similar (called marking collections as to be
materialized). It has its pros and cons.


> 2. watch() and visualize() can be in the independent interactive beam
> module since they are
>
> concepts that are unique to being interactive
>
> 3. I'll add some example for the run_pipeline() in design doc. The short
> answer is run_pipeline() != p.run(). Thanks for sharing the doc (
> https://s.apache.org/no-beam-pipeline).
> As described in the doc, when constructing the pipeline, we still want to
> bundle a runner and options to the constructed pipeline even in the future.
> So if the runner is InteractiveRunner, the interactivity instrument
> (implicitly applied read/write cache PTransform and input/output wiring) is
> only applied when "run_pipeline()" of the runner implementation is invoked.
> p.run() will apply the instrument. However, this static function
> run_pipeline() takes in a new runner and options,
> invoking “run_pipeline()” implementation of the new runner and wouldn’t
> have the instrument, thus no interactivity.
> Because you cannot (don’t want to, as seen in the doc, users cannot access
> the bundled pipeline/options in the future) change the runner easily
> without re-executing all the notebook cells, this shorthand function allows
> a user to run pipeline without interactivity immediately anywhere in a
> notebook. In the meantime, the pipeline is still bundled with the original
> Interactive Runner. The users can keep developing further pipelines.
> The usage of this function is not intuitive until you put it in a notebook
> user scenario where users develop, test in prod-like env and develop
> further. And it’s equivalent to users writing
> "from_runner_api(to_runner_api(pipeline))” in their notebook. It’s just a
> shorthand.
>

What you're trying to work around here is the flaw in the existing API that
a user binds the choice of Runner before pipeline construction, rather than
at the point of execution. I propose we look at fixing this in Beam itself.

4. And we both agree that implicit cache is palatable and should be the
> only thing we use to support interactivity. Cache and watched pipeline
> definition (which tells us what to cache) are the main “hidden state” I
> meant. Because the cache mechanism is totally implicit and hidden from the
> user. A cache is either read or written in a p.run(). If an existing cache
> is not used in a p.run(), it expires. If the user restarts the IPython
> kernel, all cache should expire too.
>

Depending on how we label items in the cache, they could survive kernel
restarts as well. This relates to another useful feature in Beam where if a
batch pipeline fails towards the end, one may want to resume/rerun it from
there after fixing the bug without redoing all the work.


> 5. I think my explanation of focusing on Direct Runner is misleading.
>   5.1 My implementation of instrumenting pipeline with cache and wiring
> input/output is runner agnostic. I never access the underlying runner or
> options when instrumenting the pipeline. Underlying runner is only used
> when run_pipeline(). It's just I'm not appending PTransform (implicit
> read/write cache) and re-wiring input/output by directly modifying a
> portable proto. I’m doing it directly to a pipeline object (which is
> bundled with InteractiveRunner with some underlying runner) and
> AppliedPTransform nodes.
>   5.2 Caching is more like a way to optimize pipeline execution and
> materialize PCollection data for visualization. Re-evaluating the whole
> pipeline every time can also give users interactive experience.
>   5.3 So yes, I
>
> don't think the "interactive experience should be tailored for different
> underlying runners.”
>
>    The caching mechanism / the magic that helps the interactivity
> instrumenting process might need different implementation for different
> underlying runners. Because the runner can be anywhere deployed in any
> architecture, the notebook is just a process on a machine. They need to
> work together.
>    Currently, we have the local file based cache. If we run a pipeline
> with underlying_runner as DataflowRunner, we’ll need something like GCS
> based cache. An in-memory cache might be runner agnostic, but it might
> explode with big data source.
>

Yep, we need filesystem/directory to use as a cache. We have an existing
temp_location flag that we can use for this (and is required for
distributed runners). If unset we can default to a local temp dir (which
works for the direct runner).


>    Existing InteractiveRunner has the following portability
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability>.
> That’s why I said the interactivity (implementation) needs to be tailored
> for different underlying runners.
>    If we allow users to pass in all kinds of underlying runners (even
> their in-house ones), we have to support the interactivity for all of them
> which we probably don't. That’s why we wanted a create_pipeline() wrapper
> so that in notebook, when building a pipeline, bundle to DirectRunner by
> default.
>    The focus on the Direct Runner is also related to our objective: we
> want to provide easy-to-use notebook and some notebook environment where
> users can interactively execute pipelines without worrying about setup
> (especially when the setup is not Beam but Interactive Beam related).
> 6. We don’t fix typo for user defined transforms
>
> I'm talking about pruning like having a cell with
>
>     pcoll | beam.Map(lambda x: expression_with_typo)
>
> and then fixing it (and re-evaluating) with
>
>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>
> where the former Map would *always* fail and never get removed from the
> pipeline.
>
> We never change the pipeline defined by the user. Interactivity is applied
> to a copy of user defined pipeline.
>

Sure. But does the executed (copy) of the pipeline contain the bad Map
operation in it? If so, it in essence "poisons" the entire pipeline,
forcing a user to re-create and re-define it from the start to make forward
progress (which results in quite a poor user experience--the errors in cell
N manifest in cell M, but worse fixing and re-executing cell N doesn't fix
cell M). If not, how is it intelligently excluded (and in a way that is not
too dis-similar from non-interactive mode, and doesn't cause surprises with
the p.run vs. run_pipeline difference)?


> 7.
>
> One approach was that the pipeline construction is re-executed every time
> (i.e the "pipeline" object to run is really a callback, like a callable or
> a PTransform) and then there's no ambiguity here.
>
> I didn’t quite get it. Pipeline construction only happens when a user
> executes a cell with the pipeline construction code.
> Are you suggesting changing the logic in pipeline.apply() to always
> reapply/replace a NamedPTransform? I don’t think we (Interactive Beam) can
> decide that because it changes the behavior of Beam.
> We had some thought of subclassing pipeline and use the create_pipeline()
> method to create the subclassed pipeline object. Then intercept the
> pipeline.apply() to always replace PTransform with existing full label and
> apply logic of parent pipeline’s apply() logic.
> It seems to be a no go to me now.
>

Sorry I wasn't clear. I'm referring to the style in the above doc about
getting rid of the Pipeline object (as a long-lived thing at least). In
this case the actual execution of pipeline construction never spans
multiple cells (though its implementation might via function calls) so one
never has out-of-date transforms dangling off the pipeline object.


> This has the downsides of recreating the PCollectiion objects which are
> being used as handles (though perhaps they could be re-identified).
>
> If a user re-executes a cell with PCollection = p | PTransform, the
> PCollection object will be a new instance. That is not a downside.
> We can keep the existing behavior of Beam to always raise an error when
> the cell with named PTransform is re-executed.
>
> Thanks!
>
> Ning.
>
>
>
> On Aug 23, 2019, at 11:36 AM, Robert Bradshaw <ro...@google.com> wrote:
>
> On Wed, Aug 21, 2019 at 3:33 PM GMAIL <ka...@gmail.com> wrote:
>
>> Thanks for the input, Robert!
>>
>> On Aug 21, 2019, at 11:49 AM, Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>> On Wed, Aug 14, 2019 at 11:29 AM Ning Kang <ni...@google.com> wrote:
>>
>>> Ahmet, thanks for forwarding!
>>>
>>>
>>>> My main concern at this point is the introduction of new concepts, even
>>>> though these are not changing other parts of the Beam SDKs. It would be
>>>> good to see at least an alternative option covered in the design document.
>>>> The reason is each additional concept adds to the mental load of users. And
>>>> also concepts from interactive Beam will shift user's expectations of Beam
>>>> even though there are not direct SDK modifications.
>>>
>>>
>>> Hi Robert. About the concern, I think I have a few points:
>>>
>>>    1. *Interactive Beam (or Interactive Runner) is already an existing
>>>    "new concept" that normal Beam user could opt-in if they want an
>>>    interactive Beam experience.* They need to do lots of setup steps
>>>    and learn new things such as Jupyter notebook and at least
>>>    interactive_runner module to make it work and make use of it.
>>>
>>> I think we should start with the perspective that most users interested
>> in using Beam interactively already know about Jupyter notebooks, or at
>> least ipython, and would want to use it to learn (and more effectively use)
>> Beam.
>>
>> Yes, I agree with the perspective for users who are familiar with
>> notebook. Yet it doesn’t prevent us from creating ready-to-use containers
>> (such as binder <https://github.com/jupyterhub/binderhub>)  for users
>> who want to try Beam interactively without setting up a environment with
>> all the dependencies interactive Beam introduces. I agree that experienced
>> users understand how to set up additional dependencies and read examples,
>> it’s just we are also targeting other entry level audiences.
>> But back to the original topic, the design is not trying to add new
>> concept, but fixing some rough edges of existing Interactive Beam features.
>> We can discuss whether a factory of create_pipeline() is really desired and
>> decide whether to expose it later. We hope the interactive_beam module to
>> be the only module an Interactive Beam user would directly invoke in their
>> notebook.
>>
>
> My goal would be that one uses a special interactive module for those
> concepts that are unique to being interactive, and standard Beam concepts
> (rather than replacements or wrappers) otherwise.
>
>>
>>>    1. *The behavior of existing interactive Beam is different from
>>>    normal Beam because of the interactive nature and the users would expect
>>>    that.* And the users wouldn't shift their expectation of normal
>>>    Beam. Just like running Python scripts might result in different behavior
>>>    than running all of them in an interactive Python session.
>>>
>>> I'm not quite following this. One of the advantages strengths of Python
>> is that lack of the difference between the interactive vs. non-interactive
>> behavior. (The fact that a script's execution is always in top to bottom
>> order, unlike a notebook, is the primary difference.)
>>
>> Sorry for the confusion. What I’m saying is about the hidden states.
>> Running several Python scripts from top to bottom in an IPython session
>> might generate different effects than running them in the same order
>> normally. Say if you have an in-memory global configuration that is shared
>> among all the scripts and if it’s missing, a script initializes one.
>> Running the scripts in IPython will pass the initialization and
>> modification of configuration along the scripts. While running the scripts
>> one by one will initialize different configurations. Running cells in a
>> notebook is equivalent to appending the cells into a script and run it. The
>> interactivity is not about the order, but if there is hidden states
>> preserved between each statement or each script execution. And the users
>> should expect that there might be hidden states when they are in an
>> interactive environment because that is exactly the interactivity they
>> expect. However, they don’t hold the hidden states, the session does it for
>> them. A user wouldn’t need to explicitly say “preserve the variable x I’ve
>> defined in this cell because I want to reuse it in some other cells I’m
>> going to execute”. The user can directly access variable x once the cell
>> defining x is executed. And even if the user deletes the cell defining x, x
>> still exists. At that stage, no one would know there is a variable x in
>> memory by just looking at the notebook. One would see a missing execution
>> sequence (on top left of each executed cell) and wonder where the piece of
>> code executed goes.
>> Above is just for interactivity, not Beam.
>>
>
> You can have interactive without hidden state (e.g. the iPython console)
> notebooks are very common (and useful). IMHO, we should try to avoid
> leveraging hidden state when possible as it makes things difficult to
> reason about (even for those used to it). Things like implicit caches are
> more palatable because (generally, there's issues with side effects and
> non-determinism) the behavior remains the same (but performance, and
> consequently user experience, is augmented).
>
> The crux of interactive Beam seems instead to be that one can inspect the
> contents of PCollections after they have run, and apply further operations
> to these PCollections which, when executed, will (generally) not re-execute
> previously completed operations (i.e. incremental, interactive pipeline
> construction). These two properties can be leveraged in a script just as
> well as a notebook and have nothing to do with hidden state (in the
> notebook sense).
>
> Another interesting thing is that a named PTransform cannot be applied to
>> the same pipeline more than once. It means a cell with named PTransform:  p
>> | “Name” >> NamedPTransform() cannot be re-executed. We might want to
>> support such re-execution if the pipeline is in an interactive mode. In
>> that case, the Beam behavior might be different from non-interactive Beam.
>>
>
>  Yes, this is could be a critical difference, which if we go this route
> makes me think the "InteractivePipeline" which differs from the standard
> Pipeline may make sense if we go down this route. More below.
>
>>
>>>    1. Or if a user runs a Beam pipeline with direct runner, they should
>>>    expect the behavior be different from running it on Dataflow while a user
>>>    needs GCP account. I think the users are aware of the difference when they
>>>    choose to use Interactive Beam.
>>>
>>>  The central, defining tenant of Beam is that behavior should be
>> consistent across different runners. Of course there are operational
>> details that are difficult, or perhaps even undesirable, to align (like, as
>> you mention, needing a GCP account for running on Dataflow, or providing
>> the location of the master when running Flink). But even these should be
>> minimized (see the recent efforts to make the temp location a standard
>> rather than dataflow-specific option).
>>
>> We should, however, attempt to minimize gratuitous differences. In
>> particular, we should make it as easy as possible to transition (in terms
>> of code, docs, and developers) between interactive and non-interactive.
>>
>> Yes, I agree that runners should be consistent. The interactive runner we
>> currently have is a wrapper of its underlying runner. Thus we don’t intend
>> to brand it as a standalone runner. And that’s why we want to create an
>> interactive_beam module for end users to avoid confusing them with a new
>> runner called InteractiveRunner.
>>
>
> *This* seems to be the crux of the argument, that Interactive should not
> be a Runner, but something else. Exactly what that something else has not
> yet been well defined.
>
>
>> And I want to focus on direct runner as underlying runner for now.
>>
>
> It's fine to only use the direct runner for now; it's certainly a lot
> easier. But can you confirm there's nothing about the Direct runner itself
> that you're depending on to implement this (or if there is something,
> clearly document it here).
>
>
>> Because with the portability of Beam, when applying interactivity to
>> original pipeline, we can always do: pipeline_with_runner_A -> portable
>> proto -> pipeline_with_direct_runner -> apply interactivity and get a new
>> pipeline -> new portable proto -> pipeline_with_runner_A with interactivity.
>>
>
> I think there's some conflation of concepts here (which are reflected in
> the rest of this conversation).
>
> In the Beam model, the Runner is not a property of a Pipeline. Rather a
> Pipeline is passed to a Runner for execution. (Along with some options.)
>
> On the other hand, Python's Pipeline object currently holds an instance of
> a Runner (for historical reasons). This is something we'd like to remove
> (as we did in Java, or at least we removed the ability to reference and
> query the Pipeline's Runner). This mismatch in the model vs the API was one
> of the motivations for https://s.apache.org/no-beam-pipeline .
>
> From an implementation perspective, I would strongly discourage having the
> intermediate representation be "pipeline_with_direct_runner" that's derived
> from a portable proto. The implementation details of that class are
> il-defined, subject to change, and I hope will in the not-to-distant future
> be removed (or at least significantly cleaned up), and don't work well for
> things like cross-language pipelines.
>
>
>> And the run_pipeline() feature is exactly the transition from interactive
>> to non-interactive.
>> With p.run(), one has to re-built/re-execute the pipeline p from the
>> beginning. While they can do p -> portable proto ->
>> pipeline_with_their_desired_runner_no_interactivity in a subsequent cell.
>> The run_pipeline() is a shorthand for the process. It’s about the same
>> thing we want to decide for create_pipeline(): do we leave the boilerplate
>> one-liner to the users and what would the level of the users be for their
>> first time playing with Beam?
>> We never mutate the p defined by the user so p by itself is always
>> non-interactive. Only when p.run() is invoked, interactivity is applied.
>>
>
> I am not following this at all. interactive_beam.run_pipeline(p) is not
> p.run()? But it is the latter that is interactive, not the former? Maybe
> some examples would help--there aren't any in the doc.
>
>
>>
>>
>>>    1. *Our design actually reduces the mental load of interactive Beam
>>>    users with intuitive interactive features*: create pipeline,
>>>    visualize intermediate PCollection, run pipeline at some point with other
>>>    runners and etc. For example, right now, the user needs to use a more
>>>    complicated set of libraries, like creating a Beam pipeline with
>>>    interactive runner that needs an underlying runner fed in.  We are getting
>>>    rid of it. An interactive Beam user shouldn't be concerned about the
>>>    underlying interactive magic.
>>>
>>> I agree a user shouldn't be concerned about the implementation details,
>> but I fail to see how
>>
>>     p = interactive_module.create_pipeline()
>>
>> is significantly simpler than, or preferable to,
>>
>>    p = Pipeline(interactive_module.InteractiveRunner())
>>
>> especially as the latter is in line with non-interactive pipelines (and
>> all our examples, docs, etc.) Now, perhaps an argument could be made that
>> interactivity is not a property of the runner, but something orthogonal to
>> that, e.g. one should write
>>
>>     p = InteractivePipeline()
>>
>> or
>>
>>     p = Pipeline(options, interactive=True)
>>
>> or similar. (It was introduced as a runner because, conceptually, a
>> runner is something that takes a pipeline and executes it.)
>>
>> Yeah, this is debatable. I think the difference is, we want to introduce
>> something that is Interactive Beam but not Interactive Runner.
>>
>
> Gotcha.
>
>
>> I really wish that we have a 4runner standard thingy that can be
>> implemented to intercept the pipeline for any runner before the execution.
>>
>
> Any runner that does delegation to another runner *is* a standard thingy
> that can intercept the pipeline for any runner before the execution. (One
> may want translation of results on the other end as well.) But it can be
> argued that this is not the most intuitive API.
>
> In that case, the wrapped interactive logic can be just a function that is
>> applicable to existing runner in some interactive mode with
>> “interactive=True”. But we don’t intend to introduce the new concept.
>>
>> p = interactive_module.create_pipeline()
>>
>> is saying the pipeline is created with interactivity and runnable in
>> notebook. You can still convert the pipeline for other runners without
>> interactivity.
>> There is no guarantee that the interactivity is provided by (direct)
>> runner nor any of the underlying implementation is backward compatible.
>> There is no implementation details exposed in the API. You can even treat
>> it as using this library, one can create a pipeline with direct runner but
>> with interactivity as additional feature.
>> I would favor the composition way.
>>
>> p = Pipeline(interactive_module.InteractiveRunner())
>>
>> is saying you can create a pipeline with this new runner thingy and we’ll
>> maintain the runner forever like other runners.
>> Additionally it can even take in an underlying runner. But does it make
>> sense for a runner to take another runner? Is this applicable to all
>> runners?
>>
>
> This is similar to Guava's filtering and transforming collection wrappers.
>
>
>> The implementation details is exposed in the API itself while users don’t
>> care about and we don’t want to maintain.
>>
>> p = Pipeline(options, interactive=True)
>>
>> is saying Pipeline can have an interactive mode. Like discussed above,
>> it’ll be a new concept for the concept “pipeline”. It’s basically saying
>> anything pipeline can have an interactive mode. That’s the inheritance way.
>>
>
> Yep. If you already know about Pipelines, it's easy to learn about
> interactive Pipelines. Perhaps more importantly, if you already know about
> Interactive Pipelines, you now know a bunch about non-interactive ones.
>
> An even more radical concept would be that an interactive Pipeline is just
> one that has any watched PCollections. (In other words, interactive vs.
> non-interactive is not a mode or a setting, just something that happens
> when you try to use things interactively.)
>
>
>> Does this apply to pipeline created by other runner?
>>
>
> Runners do not create piplines. Runners execute pipelines. Any Runner
> should be able to execute any pipeline.
>
>
>>
>>>    1. The interactive experience should be tailored for different
>>>    underlying runners. There is no portability of interactivity and users
>>>    opt-in interactive Beam using notebook would naturally expect something
>>>    similar to the direct runner.
>>>
>>> This concerns me a lot. Again, the core tenant of beam is that one can
>> choose the execution environment (runner) completely independently with how
>> one writes the pipeline. We should figure out what, if anything, needs to
>> be supported by a runner to support interactivity (the only thing that
>> comes to mind is a place to read and write temporary/cached data), but I
>> very strongly feel we should not go down the road of having different
>> interactive apis/experiences for different runners. In particular, the many
>> instances to the DirectRunner are worrisome--what's special about the
>> DirectRunner that other runners cannot provide that's needed for
>> interactive? If we can't come up with a good answer to that, we should not
>> impose this restriction.
>>
>> We don’t intend to provide different interactive experience for different
>> runners.
>>
>
> Good, so to clarify, you don't think the "interactive experience should be
> tailored for different underlying runners."
>
>
>>
>>>    1. *Interactive Beam is solving an orthogonal set of problems than
>>>    Beam*. You can think of it as a wrapper of Beam that enables
>>>    interactivity and it's not even a real runner. It doesn't change the Beam
>>>    model such as how you build a pipeline. And with the Beam portability, you
>>>    get the capability to run the pipeline built from interactive runner with
>>>    other runners for free. It adds the interactive behavior that a user
>>>    expects.
>>>    2. *We want to open source it though we can iterate faster without
>>>    doing it*. The whole project can be encapsulated in a completely
>>>    irrelevant repository and from a developer's perspective, I want to hide
>>>    all the implementation details from the interactive Beam user. However, as
>>>    there is more and more desire for interactive Beam (+Mehran Nazir
>>>    <mn...@google.com> for more details), we want to share the
>>>    implementation with others who want to contribute and explore the
>>>    interactive world.
>>>
>>> I would much rather see interactivity as part of the Beam project. With
>> good APIs the implementations don't have to be tightly coupled (e.g. the
>> underlying runner delegation) but I think it will be a better user
>> experience if interactive was a mode rather than a wrapper with different
>> entry points.
>>
>> I agree with it when the interactive mode is acknowledged as a
>> must-have/common/standard for all runners or for Beam pipeline itself.
>> But this will be a new concept that is distributed top-down from Beam to
>> its contributors.
>> Our current objective is to enable interactivity of Beam in notebook and
>> introduce experimental usages/features (and we want to limit the features
>> and make them easy to use).
>> I like the idea that anyone can write a runner or a transform. I would
>> wish that anyone can add interactivity (before, during, or after pipeline
>> execution) when it’s a mature concept.
>> Maybe in the future, interactivity will be something that is similar to
>> accessibility, internationalization, testability and etc. (Like if you
>> write a class, you need not only a test but also a notebook to demo it)
>>
>
> The discussions about how to behave with such fundamental concepts such as
> applying transforms imply that this is something that can't just be thrown
> on top, but may require deeper changes to the API itself.
>
>
>> But we are not doing it right now until we have more feedback from the
>> community and concrete CUJs.
>>
>>
>>
>> I think watch() is a really good solution to knowing which collections to
>> cache, and visualize() will be very useful.
>>
>> One thing I don't see tackled at all yet is the fact that pipelines are
>> only ever mutated by appending on new operations, so some design needs to
>> be done in terms of how to remove (possibly error-causing) operations or
>> replace bad ones with fixed ones. This is where most of the unsolved
>> problems lie.
>>
>> Thanks! And we haven’t really decided if plotting plain data is a good
>> idea for PCollections. Plotting the metadata/analytics/insight might be a
>> better option for users. I’m looking into facets
>> <https://pair-code.github.io/facets/> that TFX notebook uses.
>>
>
> Yes, there are many different ways one might want to "look at" a
> PCollection.
>
>
>> Yes, existing PR <https://github.com/apache/beam/pull/9278> is appending
>> operations only.
>> We would have pruning when the appended operations severe the whole
>> pipeline DAG into a set of DAGs to only execute the part of pipeline that
>> needs to be re-executed (optimization). This is included in the design.
>>
>
> I'm talking about pruning like having a cell with
>
>     pcoll | beam.Map(lambda x: expression_with_typo)
>
> and then fixing it (and re-evaluating) with
>
>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>
> where the former Map would *always* fail and never get removed from the
> pipeline.
>
>
>> We could also have the replacing logic that allows users to re-execute a
>> cell with named PTransform by replacing the PTransform.
>> This needs your precious feedback and is a bug-fix of existing
>> interactive beam rather than the design around user-defined Collection
>> variables.
>> This is actually another debatable topic. Like I said, executing cells in
>> notebook is equivalent to appending the code into a script.
>> Users can re-execute cells with only anonymous PTransforms. It’s
>> equivalent to applying the PTransform many times "in parallel".
>> However, users cannot re-execute cells with any named PTransform. The
>> inconsistency comes.
>>
>
> Yes, this is a *major* pain point I've seen every time people try to use
> Beam in a notebook. On the other hand, differing behaviors here (adding
> operations in parallel vs. replacement) could have surprising (non-local!)
> effects when moving from interactive to non-interactive mode.
>
>
>> Should the re-execution of named PTransform be supported? When supported,
>> should we replace PTransform or append PTransforms "in parallel"?
>> I think we’ll eventually pick a route, go through it and see what the
>> interactive Beam users provide as feed back.
>>
>
> One approach was that the pipeline construction is re-executed every time
> (i.e the "pipeline" object to run is really a callback, like a callable or
> a PTransform) and then there's no ambiguity here. This has the downsides of
> recreating the PCollectiion objects which are being used as handles (though
> perhaps they could be re-identified).
>
>
>>
>>
>> Also +David Yan <da...@google.com>  for more opinions.
>>>
>>> Thanks!
>>>
>>> Ning.
>>>
>>> On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Ning, I believe Robert's questions from his email has not been answered
>>>> yet.
>>>>
>>>> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ni...@google.com> wrote:
>>>>
>>>>> Hi all, I'll leave another 3 days for design
>>>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review.
>>>>> Then we can have a vote session if there is no objection.
>>>>> Thanks!
>>>>>
>>>>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ni...@google.com> wrote:
>>>>>
>>>>>> Thanks Ahmet for the introduction!
>>>>>>
>>>>>> I've composed a design overview
>>>>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
>>>>>>  describing changes we are making to components around interactive
>>>>>> runner. I'll share the document in our email thread too.
>>>>>>
>>>>>> The truth is since interactive runner is not yet a recognized runner
>>>>>> as part of the Beam SDK (and it's fundamentally a wrapper around direct
>>>>>> runner), we are not touching any Beam SDK components.
>>>>>> We'll not change any behavior of existing Beam SDK and we'll try our
>>>>>> best to keep it that way in the future.
>>>>>>
>>>>>
>>>> My main concern at this point is the introduction of new concepts, even
>>>> though these are not changing other parts of the Beam SDKs. It would be
>>>> good to see at least an alternative option covered in the design document.
>>>> The reason is each additional concept adds to the mental load of users. And
>>>> also concepts from interactive Beam will shift user's expectations of Beam
>>>> even though there are not direct SDK modifications.
>>>>
>>>>
>>>>>
>>>>>> In the meantime, I'll work on other components orthogonal to Beam
>>>>>> such as Pipeline Display and Data Visualization I mentioned in the design
>>>>>> overview.
>>>>>>
>>>>>> If you have any questions, please feel free to contact me through
>>>>>> this email address!
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Regards,
>>>>>> Ning.
>>>>>>
>>>>>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>> Ning, thank you for the heads up.
>>>>>>>
>>>>>>> All, this is a proposed work for improving interactive Beam
>>>>>>> experience. As mentioned in Ning's email, new concepts are being
>>>>>>> introduced. And in addition iBeam as a name is used as a new reference. I
>>>>>>> hope that bringing the discussion to the mailing list will give it the
>>>>>>> additional visibility and more people could share their feedback.
>>>>>>>
>>>>>>> (cc'ing a few folks that might be interested +Robert Bradshaw
>>>>>>> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
>>>>>>> Li <qi...@google.com> +Brian Hulette <bh...@google.com> )
>>>>>>>
>>>>>>> Ahmet
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>>>>>>>
>>>>>>>> To whom may concern,
>>>>>>>>
>>>>>>>> This is Ning from Google. We are currently making efforts to
>>>>>>>> leverage an interactive runner under python beam sdk.
>>>>>>>>
>>>>>>>> There is already an interactive Beam (iBeam for short) runner with
>>>>>>>> jupyter notebook in the repo
>>>>>>>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
>>>>>>>> .
>>>>>>>> Following the instructions on that page, one can set up an
>>>>>>>> interactive environment to develop and execute Beam pipeline interactively.
>>>>>>>>
>>>>>>>> However, there are many issues with existing iBeam. One issue is
>>>>>>>> that it uses a concept of leaf PCollection to cache and materialize
>>>>>>>> intermediate PCollection. If the user wants to reuse/introspect a non-leaf
>>>>>>>> PCollection, the interactive runner will run into errors.
>>>>>>>>
>>>>>>>> Our initial effort will be fixing the existing issues. And we also
>>>>>>>> want to make iBeam easy to use. Since iBeam uses the same model Beam uses,
>>>>>>>> there isn't really any difference for users between creating a pipeline
>>>>>>>> with interactive runner and other runners.
>>>>>>>> So we want to minimize the interfaces a user needs to learn while
>>>>>>>> giving the user some capability to interact with the interactive
>>>>>>>> environment.
>>>>>>>>
>>>>>>>> See this initial PR <https://github.com/apache/beam/pull/9278>,
>>>>>>>> the interactive_beam module will provide mainly 4 interfaces:
>>>>>>>>
>>>>>>>>    - For advanced users who define pipeline outside __main__, let
>>>>>>>>    them tell current interactive environment where they define their pipeline:
>>>>>>>>    watch()
>>>>>>>>       - This is very useful for tests where pipeline can be
>>>>>>>>       defined in test methods.
>>>>>>>>       - If the user simply creates pipeline in a Jupyter notebook
>>>>>>>>       or a plain Python script, they don't have to know/use this feature at all.
>>>>>>>>    - Let users create an interactive pipeline: create_pipeline()
>>>>>>>>       - invoking create_pipeline(), the user gets a Pipeline
>>>>>>>>       object that works as any other Pipeline object created from
>>>>>>>>       apache_beam.Pipeline()
>>>>>>>>       - However, the pipeline object p, when invoking p.run(),
>>>>>>>>       does some extra interactive magic.
>>>>>>>>       - We'll support interactive execution for DirectRunner at
>>>>>>>>       this moment.
>>>>>>>>    - Let users run the interactive pipeline as a normal pipeline:
>>>>>>>>    run_pipeline()
>>>>>>>>       - In an interactive environment, a user only needs to add
>>>>>>>>       and execute 1 line of code run_pipeline(pipeline) to execute any existing
>>>>>>>>       interactive pipeline object as normal pipeline in any selected platform.
>>>>>>>>       - We'll probably support Dataflow only. Other
>>>>>>>>       implementations can be added though.
>>>>>>>>    - Let users introspect any intermediate PCollection they have
>>>>>>>>    handler to: visualize()
>>>>>>>>       - If a user ever writes pcoll = p | "Some Transform" >>
>>>>>>>>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>>>>>>>>       executed.
>>>>>>>>       - p can be batch or streaming
>>>>>>>>       - The visualization will be some plot graph of data for the
>>>>>>>>       given PCollection as if it's materialized. If the PCollection is unbounded,
>>>>>>>>       the graph is dynamic.
>>>>>>>>
>>>>>>>> The PR will implement 1 and 2.
>>>>>>>>
>>>>>>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the
>>>>>>>> top level JIRA and add blocking JIRAs as development goes.
>>>>>>>>
>>>>>>>> External Beam users will not worry about any of the underlying
>>>>>>>> implementation details.
>>>>>>>> Except the 4 interfaces above, they learn and write normal Beam
>>>>>>>> code and can execute the pipeline immediately when they are done with
>>>>>>>> prototyping.
>>>>>>>>
>>>>>>>> Ning.
>>>>>>>>
>>>>>>>
>

Re: Brief of interactive Beam

Posted by Ning Kang <ni...@google.com>.

Thanks for the feedback, Robert! I think I got your idea. 
Let me summarize it to see if it’s correct: 
1. You want everything about 
> 
>> standard Beam concepts
 to follow existing pattern: so we can shot down create_pipeline() and keep the InteractiveRunner notion when constructing pipeline, I agree with it. A runner can delegate another runner, also agreed. Let’s keep it that way.
2. watch() and visualize() can be in the independent interactive beam module since they are 
>> concepts that are unique to being interactive
3. I'll add some example for the run_pipeline() in design doc. The short answer is run_pipeline() != p.run(). Thanks for sharing the doc (https://s.apache.org/no-beam-pipeline <https://s.apache.org/no-beam-pipeline>).
As described in the doc, when constructing the pipeline, we still want to bundle a runner and options to the constructed pipeline even in the future. So if the runner is InteractiveRunner, the interactivity instrument (implicitly applied read/write cache PTransform and input/output wiring) is only applied when "run_pipeline()" of the runner implementation is invoked. p.run() will apply the instrument. However, this static function run_pipeline() takes in a new runner and options, invoking “run_pipeline()” implementation of the new runner and wouldn’t have the instrument, thus no interactivity.
Because you cannot (don’t want to, as seen in the doc, users cannot access the bundled pipeline/options in the future) change the runner easily without re-executing all the notebook cells, this shorthand function allows a user to run pipeline without interactivity immediately anywhere in a notebook. In the meantime, the pipeline is still bundled with the original Interactive Runner. The users can keep developing further pipelines.
The usage of this function is not intuitive until you put it in a notebook user scenario where users develop, test in prod-like env and develop further. And it’s equivalent to users writing "from_runner_api(to_runner_api(pipeline))” in their notebook. It’s just a shorthand.
4. And we both agree that implicit cache is palatable and should be the only thing we use to support interactivity. Cache and watched pipeline definition (which tells us what to cache) are the main “hidden state” I meant. Because the cache mechanism is totally implicit and hidden from the user. A cache is either read or written in a p.run(). If an existing cache is not used in a p.run(), it expires. If the user restarts the IPython kernel, all cache should expire too.
5. I think my explanation of focusing on Direct Runner is misleading.
  5.1 My implementation of instrumenting pipeline with cache and wiring input/output is runner agnostic. I never access the underlying runner or options when instrumenting the pipeline. Underlying runner is only used when run_pipeline(). It's just I'm not appending PTransform (implicit read/write cache) and re-wiring input/output by directly modifying a portable proto. I’m doing it directly to a pipeline object (which is bundled with InteractiveRunner with some underlying runner) and AppliedPTransform nodes.
  5.2 Caching is more like a way to optimize pipeline execution and materialize PCollection data for visualization. Re-evaluating the whole pipeline every time can also give users interactive experience.
  5.3 So yes, I 
>> don't think the "interactive experience should be tailored for different underlying runners.”
   The caching mechanism / the magic that helps the interactivity instrumenting process might need different implementation for different underlying runners. Because the runner can be anywhere deployed in any architecture, the notebook is just a process on a machine. They need to work together.
   Currently, we have the local file based cache. If we run a pipeline with underlying_runner as DataflowRunner, we’ll need something like GCS based cache. An in-memory cache might be runner agnostic, but it might explode with big data source.
   Existing InteractiveRunner has the following portability <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability>. That’s why I said the interactivity (implementation) needs to be tailored for different underlying runners.
   If we allow users to pass in all kinds of underlying runners (even their in-house ones), we have to support the interactivity for all of them which we probably don't. That’s why we wanted a create_pipeline() wrapper so that in notebook, when building a pipeline, bundle to DirectRunner by default.
   The focus on the Direct Runner is also related to our objective: we want to provide easy-to-use notebook and some notebook environment where users can interactively execute pipelines without worrying about setup (especially when the setup is not Beam but Interactive Beam related).
6. We don’t fix typo for user defined transforms
>> I'm talking about pruning like having a cell with
>> 
>>     pcoll | beam.Map(lambda x: expression_with_typo)
>> 
>> and then fixing it (and re-evaluating) with
>> 
>>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>> 
>> where the former Map would *always* fail and never get removed from the pipeline. 
We never change the pipeline defined by the user. Interactivity is applied to a copy of user defined pipeline.
7. 
>> One approach was that the pipeline construction is re-executed every time (i.e the "pipeline" object to run is really a callback, like a callable or a PTransform) and then there's no ambiguity here. 
I didn’t quite get it. Pipeline construction only happens when a user executes a cell with the pipeline construction code.
Are you suggesting changing the logic in pipeline.apply() to always reapply/replace a NamedPTransform? I don’t think we (Interactive Beam) can decide that because it changes the behavior of Beam.
We had some thought of subclassing pipeline and use the create_pipeline() method to create the subclassed pipeline object. Then intercept the pipeline.apply() to always replace PTransform with existing full label and apply logic of parent pipeline’s apply() logic.
It seems to be a no go to me now.
>> This has the downsides of recreating the PCollectiion objects which are being used as handles (though perhaps they could be re-identified).
If a user re-executes a cell with PCollection = p | PTransform, the PCollection object will be a new instance. That is not a downside.
We can keep the existing behavior of Beam to always raise an error when the cell with named PTransform is re-executed.

Thanks!

Ning.



> On Aug 23, 2019, at 11:36 AM, Robert Bradshaw <ro...@google.com> wrote:
> 
> On Wed, Aug 21, 2019 at 3:33 PM GMAIL <kawaigin@gmail.com <ma...@gmail.com>> wrote:
> Thanks for the input, Robert!
>> On Aug 21, 2019, at 11:49 AM, Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>> 
>> On Wed, Aug 14, 2019 at 11:29 AM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>> Ahmet, thanks for forwarding!
>>  
>> My main concern at this point is the introduction of new concepts, even though these are not changing other parts of the Beam SDKs. It would be good to see at least an alternative option covered in the design document. The reason is each additional concept adds to the mental load of users. And also concepts from interactive Beam will shift user's expectations of Beam even though there are not direct SDK modifications.
>> 
>> Hi Robert. About the concern, I think I have a few points:
>> Interactive Beam (or Interactive Runner) is already an existing "new concept" that normal Beam user could opt-in if they want an interactive Beam experience. They need to do lots of setup steps and learn new things such as Jupyter notebook and at least interactive_runner module to make it work and make use of it.
>> I think we should start with the perspective that most users interested in using Beam interactively already know about Jupyter notebooks, or at least ipython, and would want to use it to learn (and more effectively use) Beam. 
> Yes, I agree with the perspective for users who are familiar with notebook. Yet it doesn’t prevent us from creating ready-to-use containers (such as binder <https://github.com/jupyterhub/binderhub>)  for users who want to try Beam interactively without setting up a environment with all the dependencies interactive Beam introduces. I agree that experienced users understand how to set up additional dependencies and read examples, it’s just we are also targeting other entry level audiences.
> But back to the original topic, the design is not trying to add new concept, but fixing some rough edges of existing Interactive Beam features. We can discuss whether a factory of create_pipeline() is really desired and decide whether to expose it later. We hope the interactive_beam module to be the only module an Interactive Beam user would directly invoke in their notebook.
> 
> My goal would be that one uses a special interactive module for those concepts that are unique to being interactive, and standard Beam concepts (rather than replacements or wrappers) otherwise. 
>> The behavior of existing interactive Beam is different from normal Beam because of the interactive nature and the users would expect that. And the users wouldn't shift their expectation of normal Beam. Just like running Python scripts might result in different behavior than running all of them in an interactive Python session. 
>> I'm not quite following this. One of the advantages strengths of Python is that lack of the difference between the interactive vs. non-interactive behavior. (The fact that a script's execution is always in top to bottom order, unlike a notebook, is the primary difference.)
> Sorry for the confusion. What I’m saying is about the hidden states. Running several Python scripts from top to bottom in an IPython session might generate different effects than running them in the same order normally. Say if you have an in-memory global configuration that is shared among all the scripts and if it’s missing, a script initializes one. Running the scripts in IPython will pass the initialization and modification of configuration along the scripts. While running the scripts one by one will initialize different configurations. Running cells in a notebook is equivalent to appending the cells into a script and run it. The interactivity is not about the order, but if there is hidden states preserved between each statement or each script execution. And the users should expect that there might be hidden states when they are in an interactive environment because that is exactly the interactivity they expect. However, they don’t hold the hidden states, the session does it for them. A user wouldn’t need to explicitly say “preserve the variable x I’ve defined in this cell because I want to reuse it in some other cells I’m going to execute”. The user can directly access variable x once the cell defining x is executed. And even if the user deletes the cell defining x, x still exists. At that stage, no one would know there is a variable x in memory by just looking at the notebook. One would see a missing execution sequence (on top left of each executed cell) and wonder where the piece of code executed goes.
> Above is just for interactivity, not Beam.
> 
> You can have interactive without hidden state (e.g. the iPython console) notebooks are very common (and useful). IMHO, we should try to avoid leveraging hidden state when possible as it makes things difficult to reason about (even for those used to it). Things like implicit caches are more palatable because (generally, there's issues with side effects and non-determinism) the behavior remains the same (but performance, and consequently user experience, is augmented). 
> 
> The crux of interactive Beam seems instead to be that one can inspect the contents of PCollections after they have run, and apply further operations to these PCollections which, when executed, will (generally) not re-execute previously completed operations (i.e. incremental, interactive pipeline construction). These two properties can be leveraged in a script just as well as a notebook and have nothing to do with hidden state (in the notebook sense). 
> 
> Another interesting thing is that a named PTransform cannot be applied to the same pipeline more than once. It means a cell with named PTransform:  p | “Name” >> NamedPTransform() cannot be re-executed. We might want to support such re-execution if the pipeline is in an interactive mode. In that case, the Beam behavior might be different from non-interactive Beam. 
> 
>  Yes, this is could be a critical difference, which if we go this route makes me think the "InteractivePipeline" which differs from the standard Pipeline may make sense if we go down this route. More below. 
>> Or if a user runs a Beam pipeline with direct runner, they should expect the behavior be different from running it on Dataflow while a user needs GCP account. I think the users are aware of the difference when they choose to use Interactive Beam.
>>  The central, defining tenant of Beam is that behavior should be consistent across different runners. Of course there are operational details that are difficult, or perhaps even undesirable, to align (like, as you mention, needing a GCP account for running on Dataflow, or providing the location of the master when running Flink). But even these should be minimized (see the recent efforts to make the temp location a standard rather than dataflow-specific option). 
>> 
>> We should, however, attempt to minimize gratuitous differences. In particular, we should make it as easy as possible to transition (in terms of code, docs, and developers) between interactive and non-interactive. 
> Yes, I agree that runners should be consistent. The interactive runner we currently have is a wrapper of its underlying runner. Thus we don’t intend to brand it as a standalone runner. And that’s why we want to create an interactive_beam module for end users to avoid confusing them with a new runner called InteractiveRunner.
> 
> *This* seems to be the crux of the argument, that Interactive should not be a Runner, but something else. Exactly what that something else has not yet been well defined. 
>  
> And I want to focus on direct runner as underlying runner for now.
> 
> It's fine to only use the direct runner for now; it's certainly a lot easier. But can you confirm there's nothing about the Direct runner itself that you're depending on to implement this (or if there is something, clearly document it here). 
>  
> Because with the portability of Beam, when applying interactivity to original pipeline, we can always do: pipeline_with_runner_A -> portable proto -> pipeline_with_direct_runner -> apply interactivity and get a new pipeline -> new portable proto -> pipeline_with_runner_A with interactivity.
> 
> I think there's some conflation of concepts here (which are reflected in the rest of this conversation). 
> 
> In the Beam model, the Runner is not a property of a Pipeline. Rather a Pipeline is passed to a Runner for execution. (Along with some options.)
> 
> On the other hand, Python's Pipeline object currently holds an instance of a Runner (for historical reasons). This is something we'd like to remove (as we did in Java, or at least we removed the ability to reference and query the Pipeline's Runner). This mismatch in the model vs the API was one of the motivations for https://s.apache.org/no-beam-pipeline <https://s.apache.org/no-beam-pipeline> .
> 
> From an implementation perspective, I would strongly discourage having the intermediate representation be "pipeline_with_direct_runner" that's derived from a portable proto. The implementation details of that class are il-defined, subject to change, and I hope will in the not-to-distant future be removed (or at least significantly cleaned up), and don't work well for things like cross-language pipelines. 
>  
> And the run_pipeline() feature is exactly the transition from interactive to non-interactive.
> With p.run(), one has to re-built/re-execute the pipeline p from the beginning. While they can do p -> portable proto -> pipeline_with_their_desired_runner_no_interactivity in a subsequent cell.
> The run_pipeline() is a shorthand for the process. It’s about the same thing we want to decide for create_pipeline(): do we leave the boilerplate one-liner to the users and what would the level of the users be for their first time playing with Beam?
> We never mutate the p defined by the user so p by itself is always non-interactive. Only when p.run() is invoked, interactivity is applied.
> 
> I am not following this at all. interactive_beam.run_pipeline(p) is not p.run()? But it is the latter that is interactive, not the former? Maybe some examples would help--there aren't any in the doc. 
>  
> 
>> Our design actually reduces the mental load of interactive Beam users with intuitive interactive features: create pipeline, visualize intermediate PCollection, run pipeline at some point with other runners and etc. For example, right now, the user needs to use a more complicated set of libraries, like creating a Beam pipeline with interactive runner that needs an underlying runner fed in.  We are getting rid of it. An interactive Beam user shouldn't be concerned about the underlying interactive magic. 
>> I agree a user shouldn't be concerned about the implementation details, but I fail to see how
>> 
>>     p = interactive_module.create_pipeline()
>> 
>> is significantly simpler than, or preferable to, 
>> 
>>    p = Pipeline(interactive_module.InteractiveRunner())
>> 
>> especially as the latter is in line with non-interactive pipelines (and all our examples, docs, etc.) Now, perhaps an argument could be made that interactivity is not a property of the runner, but something orthogonal to that, e.g. one should write
>> 
>>     p = InteractivePipeline()
>> 
>> or
>> 
>>     p = Pipeline(options, interactive=True)
>> 
>> or similar. (It was introduced as a runner because, conceptually, a runner is something that takes a pipeline and executes it.)
> Yeah, this is debatable. I think the difference is, we want to introduce something that is Interactive Beam but not Interactive Runner.
> 
> Gotcha. 
>  
> I really wish that we have a 4runner standard thingy that can be implemented to intercept the pipeline for any runner before the execution.
> 
> Any runner that does delegation to another runner *is* a standard thingy that can intercept the pipeline for any runner before the execution. (One may want translation of results on the other end as well.) But it can be argued that this is not the most intuitive API. 
> 
> In that case, the wrapped interactive logic can be just a function that is applicable to existing runner in some interactive mode with “interactive=True”. But we don’t intend to introduce the new concept.
>>> p = interactive_module.create_pipeline()
> is saying the pipeline is created with interactivity and runnable in notebook. You can still convert the pipeline for other runners without interactivity.
> There is no guarantee that the interactivity is provided by (direct) runner nor any of the underlying implementation is backward compatible.
> There is no implementation details exposed in the API. You can even treat it as using this library, one can create a pipeline with direct runner but with interactivity as additional feature.
> I would favor the composition way.
>>> p = Pipeline(interactive_module.InteractiveRunner())
> is saying you can create a pipeline with this new runner thingy and we’ll maintain the runner forever like other runners. 
> Additionally it can even take in an underlying runner. But does it make sense for a runner to take another runner? Is this applicable to all runners?
> 
> This is similar to Guava's filtering and transforming collection wrappers. 
>  
> The implementation details is exposed in the API itself while users don’t care about and we don’t want to maintain.
>>> p = Pipeline(options, interactive=True)
> is saying Pipeline can have an interactive mode. Like discussed above, it’ll be a new concept for the concept “pipeline”. It’s basically saying anything pipeline can have an interactive mode. That’s the inheritance way.
> 
> Yep. If you already know about Pipelines, it's easy to learn about interactive Pipelines. Perhaps more importantly, if you already know about Interactive Pipelines, you now know a bunch about non-interactive ones. 
> 
> An even more radical concept would be that an interactive Pipeline is just one that has any watched PCollections. (In other words, interactive vs. non-interactive is not a mode or a setting, just something that happens when you try to use things interactively.)
>  
> Does this apply to pipeline created by other runner?
> 
> Runners do not create piplines. Runners execute pipelines. Any Runner should be able to execute any pipeline. 
>  
>> The interactive experience should be tailored for different underlying runners. There is no portability of interactivity and users opt-in interactive Beam using notebook would naturally expect something similar to the direct runner.
>> This concerns me a lot. Again, the core tenant of beam is that one can choose the execution environment (runner) completely independently with how one writes the pipeline. We should figure out what, if anything, needs to be supported by a runner to support interactivity (the only thing that comes to mind is a place to read and write temporary/cached data), but I very strongly feel we should not go down the road of having different interactive apis/experiences for different runners. In particular, the many instances to the DirectRunner are worrisome--what's special about the DirectRunner that other runners cannot provide that's needed for interactive? If we can't come up with a good answer to that, we should not impose this restriction. 
> We don’t intend to provide different interactive experience for different runners.
> 
> Good, so to clarify, you don't think the "interactive experience should be tailored for different underlying runners."
>  
>> Interactive Beam is solving an orthogonal set of problems than Beam. You can think of it as a wrapper of Beam that enables interactivity and it's not even a real runner. It doesn't change the Beam model such as how you build a pipeline. And with the Beam portability, you get the capability to run the pipeline built from interactive runner with other runners for free. It adds the interactive behavior that a user expects.
>> We want to open source it though we can iterate faster without doing it. The whole project can be encapsulated in a completely irrelevant repository and from a developer's perspective, I want to hide all the implementation details from the interactive Beam user. However, as there is more and more desire for interactive Beam (+Mehran Nazir <ma...@google.com> for more details), we want to share the implementation with others who want to contribute and explore the interactive world.
>> I would much rather see interactivity as part of the Beam project. With good APIs the implementations don't have to be tightly coupled (e.g. the underlying runner delegation) but I think it will be a better user experience if interactive was a mode rather than a wrapper with different entry points. 
>> 
> I agree with it when the interactive mode is acknowledged as a must-have/common/standard for all runners or for Beam pipeline itself.
> But this will be a new concept that is distributed top-down from Beam to its contributors.
> Our current objective is to enable interactivity of Beam in notebook and introduce experimental usages/features (and we want to limit the features and make them easy to use).
> I like the idea that anyone can write a runner or a transform. I would wish that anyone can add interactivity (before, during, or after pipeline execution) when it’s a mature concept.
> Maybe in the future, interactivity will be something that is similar to accessibility, internationalization, testability and etc. (Like if you write a class, you need not only a test but also a notebook to demo it)
> 
> The discussions about how to behave with such fundamental concepts such as applying transforms imply that this is something that can't just be thrown on top, but may require deeper changes to the API itself. 
>  
> But we are not doing it right now until we have more feedback from the community and concrete CUJs.
> 
> 
>> 
>> I think watch() is a really good solution to knowing which collections to cache, and visualize() will be very useful. 
>> 
>> One thing I don't see tackled at all yet is the fact that pipelines are only ever mutated by appending on new operations, so some design needs to be done in terms of how to remove (possibly error-causing) operations or replace bad ones with fixed ones. This is where most of the unsolved problems lie. 
> Thanks! And we haven’t really decided if plotting plain data is a good idea for PCollections. Plotting the metadata/analytics/insight might be a better option for users. I’m looking into facets <https://pair-code.github.io/facets/> that TFX notebook uses.
> 
> Yes, there are many different ways one might want to "look at" a PCollection. 
>  
> Yes, existing PR <https://github.com/apache/beam/pull/9278> is appending operations only.
> We would have pruning when the appended operations severe the whole pipeline DAG into a set of DAGs to only execute the part of pipeline that needs to be re-executed (optimization). This is included in the design.
> 
> I'm talking about pruning like having a cell with
> 
>     pcoll | beam.Map(lambda x: expression_with_typo)
> 
> and then fixing it (and re-evaluating) with
> 
>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
> 
> where the former Map would *always* fail and never get removed from the pipeline. 
>  
> We could also have the replacing logic that allows users to re-execute a cell with named PTransform by replacing the PTransform.
> This needs your precious feedback and is a bug-fix of existing interactive beam rather than the design around user-defined Collection variables.
> This is actually another debatable topic. Like I said, executing cells in notebook is equivalent to appending the code into a script.
> Users can re-execute cells with only anonymous PTransforms. It’s equivalent to applying the PTransform many times "in parallel".
> However, users cannot re-execute cells with any named PTransform. The inconsistency comes.
> 
> Yes, this is a *major* pain point I've seen every time people try to use Beam in a notebook. On the other hand, differing behaviors here (adding operations in parallel vs. replacement) could have surprising (non-local!) effects when moving from interactive to non-interactive mode. 
>  
> Should the re-execution of named PTransform be supported? When supported, should we replace PTransform or append PTransforms "in parallel"?
> I think we’ll eventually pick a route, go through it and see what the interactive Beam users provide as feed back.
> 
> One approach was that the pipeline construction is re-executed every time (i.e the "pipeline" object to run is really a callback, like a callable or a PTransform) and then there's no ambiguity here. This has the downsides of recreating the PCollectiion objects which are being used as handles (though perhaps they could be re-identified). 
>  
> 
>> 
>> Also +David Yan <ma...@google.com>  for more opinions.
>> 
>> Thanks!
>> 
>> Ning.
>> 
>> On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <altay@google.com <ma...@google.com>> wrote:
>> Ning, I believe Robert's questions from his email has not been answered yet.
>> 
>> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>> Hi all, I'll leave another 3 days for design <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review. Then we can have a vote session if there is no objection.
>> 
>> Thanks!
>> 
>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>> Thanks Ahmet for the introduction!
>> 
>> I've composed a design overview <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> describing changes we are making to components around interactive runner. I'll share the document in our email thread too.
>> 
>> The truth is since interactive runner is not yet a recognized runner as part of the Beam SDK (and it's fundamentally a wrapper around direct runner), we are not touching any Beam SDK components.
>> We'll not change any behavior of existing Beam SDK and we'll try our best to keep it that way in the future.
>> 
>> My main concern at this point is the introduction of new concepts, even though these are not changing other parts of the Beam SDKs. It would be good to see at least an alternative option covered in the design document. The reason is each additional concept adds to the mental load of users. And also concepts from interactive Beam will shift user's expectations of Beam even though there are not direct SDK modifications.
>>  
>> 
>> In the meantime, I'll work on other components orthogonal to Beam such as Pipeline Display and Data Visualization I mentioned in the design overview.
>> 
>> If you have any questions, please feel free to contact me through this email address!
>> 
>> Thanks!
>> 
>> Regards,
>> Ning.
>> 
>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <altay@google.com <ma...@google.com>> wrote:
>> Ning, thank you for the heads up.
>> 
>> All, this is a proposed work for improving interactive Beam experience. As mentioned in Ning's email, new concepts are being introduced. And in addition iBeam as a name is used as a new reference. I hope that bringing the discussion to the mailing list will give it the additional visibility and more people could share their feedback.
>> 
>> (cc'ing a few folks that might be interested +Robert Bradshaw <ma...@google.com> +Valentyn Tymofieiev <ma...@google.com> +Sindy Li <ma...@google.com> +Brian Hulette <ma...@google.com> )
>> 
>> Ahmet
>> 
>> 
>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
>> To whom may concern,
>> 
>> This is Ning from Google. We are currently making efforts to leverage an interactive runner under python beam sdk.
>> 
>> There is already an interactive Beam (iBeam for short) runner with jupyter notebook in the repo <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>.
>> Following the instructions on that page, one can set up an interactive environment to develop and execute Beam pipeline interactively.
>> 
>> However, there are many issues with existing iBeam. One issue is that it uses a concept of leaf PCollection to cache and materialize intermediate PCollection. If the user wants to reuse/introspect a non-leaf PCollection, the interactive runner will run into errors.
>> 
>> Our initial effort will be fixing the existing issues. And we also want to make iBeam easy to use. Since iBeam uses the same model Beam uses, there isn't really any difference for users between creating a pipeline with interactive runner and other runners. 
>> So we want to minimize the interfaces a user needs to learn while giving the user some capability to interact with the interactive environment.
>> 
>> See this initial PR <https://github.com/apache/beam/pull/9278>, the interactive_beam module will provide mainly 4 interfaces:
>> For advanced users who define pipeline outside __main__, let them tell current interactive environment where they define their pipeline: watch()
>> This is very useful for tests where pipeline can be defined in test methods.
>> If the user simply creates pipeline in a Jupyter notebook or a plain Python script, they don't have to know/use this feature at all.
>> Let users create an interactive pipeline: create_pipeline()
>> invoking create_pipeline(), the user gets a Pipeline object that works as any other Pipeline object created from apache_beam.Pipeline()
>> However, the pipeline object p, when invoking p.run(), does some extra interactive magic.
>> We'll support interactive execution for DirectRunner at this moment.
>> Let users run the interactive pipeline as a normal pipeline: run_pipeline()
>> In an interactive environment, a user only needs to add and execute 1 line of code run_pipeline(pipeline) to execute any existing interactive pipeline object as normal pipeline in any selected platform.
>> We'll probably support Dataflow only. Other implementations can be added though.
>> Let users introspect any intermediate PCollection they have handler to: visualize()
>> If a user ever writes pcoll = p | "Some Transform" >> some_transform() ..., they can visualize(pcoll) once the pipeline p is executed.
>> p can be batch or streaming
>> The visualization will be some plot graph of data for the given PCollection as if it's materialized. If the PCollection is unbounded, the graph is dynamic. 
>> The PR will implement 1 and 2.
>> 
>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 <https://issues.apache.org/jira/browse/BEAM-7923> as the top level JIRA and add blocking JIRAs as development goes.
>> 
>> External Beam users will not worry about any of the underlying implementation details.
>> Except the 4 interfaces above, they learn and write normal Beam code and can execute the pipeline immediately when they are done with prototyping.
>> 
>> Ning.

Re: Brief of interactive Beam

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Aug 21, 2019 at 3:33 PM GMAIL <ka...@gmail.com> wrote:

> Thanks for the input, Robert!
>
> On Aug 21, 2019, at 11:49 AM, Robert Bradshaw <ro...@google.com> wrote:
>
> On Wed, Aug 14, 2019 at 11:29 AM Ning Kang <ni...@google.com> wrote:
>
>> Ahmet, thanks for forwarding!
>>
>>
>>> My main concern at this point is the introduction of new concepts, even
>>> though these are not changing other parts of the Beam SDKs. It would be
>>> good to see at least an alternative option covered in the design document.
>>> The reason is each additional concept adds to the mental load of users. And
>>> also concepts from interactive Beam will shift user's expectations of Beam
>>> even though there are not direct SDK modifications.
>>
>>
>> Hi Robert. About the concern, I think I have a few points:
>>
>>    1. *Interactive Beam (or Interactive Runner) is already an existing
>>    "new concept" that normal Beam user could opt-in if they want an
>>    interactive Beam experience.* They need to do lots of setup steps and
>>    learn new things such as Jupyter notebook and at least interactive_runner
>>    module to make it work and make use of it.
>>
>> I think we should start with the perspective that most users interested
> in using Beam interactively already know about Jupyter notebooks, or at
> least ipython, and would want to use it to learn (and more effectively use)
> Beam.
>
> Yes, I agree with the perspective for users who are familiar with
> notebook. Yet it doesn’t prevent us from creating ready-to-use containers
> (such as binder <https://github.com/jupyterhub/binderhub>)  for users who
> want to try Beam interactively without setting up a environment with all
> the dependencies interactive Beam introduces. I agree that experienced
> users understand how to set up additional dependencies and read examples,
> it’s just we are also targeting other entry level audiences.
> But back to the original topic, the design is not trying to add new
> concept, but fixing some rough edges of existing Interactive Beam features.
> We can discuss whether a factory of create_pipeline() is really desired and
> decide whether to expose it later. We hope the interactive_beam module to
> be the only module an Interactive Beam user would directly invoke in their
> notebook.
>

My goal would be that one uses a special interactive module for those
concepts that are unique to being interactive, and standard Beam concepts
(rather than replacements or wrappers) otherwise.

>
>>    1. *The behavior of existing interactive Beam is different from
>>    normal Beam because of the interactive nature and the users would expect
>>    that.* And the users wouldn't shift their expectation of normal Beam.
>>    Just like running Python scripts might result in different behavior than
>>    running all of them in an interactive Python session.
>>
>> I'm not quite following this. One of the advantages strengths of Python
> is that lack of the difference between the interactive vs. non-interactive
> behavior. (The fact that a script's execution is always in top to bottom
> order, unlike a notebook, is the primary difference.)
>
> Sorry for the confusion. What I’m saying is about the hidden states.
> Running several Python scripts from top to bottom in an IPython session
> might generate different effects than running them in the same order
> normally. Say if you have an in-memory global configuration that is shared
> among all the scripts and if it’s missing, a script initializes one.
> Running the scripts in IPython will pass the initialization and
> modification of configuration along the scripts. While running the scripts
> one by one will initialize different configurations. Running cells in a
> notebook is equivalent to appending the cells into a script and run it. The
> interactivity is not about the order, but if there is hidden states
> preserved between each statement or each script execution. And the users
> should expect that there might be hidden states when they are in an
> interactive environment because that is exactly the interactivity they
> expect. However, they don’t hold the hidden states, the session does it for
> them. A user wouldn’t need to explicitly say “preserve the variable x I’ve
> defined in this cell because I want to reuse it in some other cells I’m
> going to execute”. The user can directly access variable x once the cell
> defining x is executed. And even if the user deletes the cell defining x, x
> still exists. At that stage, no one would know there is a variable x in
> memory by just looking at the notebook. One would see a missing execution
> sequence (on top left of each executed cell) and wonder where the piece of
> code executed goes.
> Above is just for interactivity, not Beam.
>

You can have interactive without hidden state (e.g. the iPython console)
notebooks are very common (and useful). IMHO, we should try to avoid
leveraging hidden state when possible as it makes things difficult to
reason about (even for those used to it). Things like implicit caches are
more palatable because (generally, there's issues with side effects and
non-determinism) the behavior remains the same (but performance, and
consequently user experience, is augmented).

The crux of interactive Beam seems instead to be that one can inspect the
contents of PCollections after they have run, and apply further operations
to these PCollections which, when executed, will (generally) not re-execute
previously completed operations (i.e. incremental, interactive pipeline
construction). These two properties can be leveraged in a script just as
well as a notebook and have nothing to do with hidden state (in the
notebook sense).

Another interesting thing is that a named PTransform cannot be applied to
> the same pipeline more than once. It means a cell with named PTransform:  p
> | “Name” >> NamedPTransform() cannot be re-executed. We might want to
> support such re-execution if the pipeline is in an interactive mode. In
> that case, the Beam behavior might be different from non-interactive Beam.
>

 Yes, this is could be a critical difference, which if we go this route
makes me think the "InteractivePipeline" which differs from the standard
Pipeline may make sense if we go down this route. More below.

>
>>    1. Or if a user runs a Beam pipeline with direct runner, they should
>>    expect the behavior be different from running it on Dataflow while a user
>>    needs GCP account. I think the users are aware of the difference when they
>>    choose to use Interactive Beam.
>>
>>  The central, defining tenant of Beam is that behavior should be
> consistent across different runners. Of course there are operational
> details that are difficult, or perhaps even undesirable, to align (like, as
> you mention, needing a GCP account for running on Dataflow, or providing
> the location of the master when running Flink). But even these should be
> minimized (see the recent efforts to make the temp location a standard
> rather than dataflow-specific option).
>
> We should, however, attempt to minimize gratuitous differences. In
> particular, we should make it as easy as possible to transition (in terms
> of code, docs, and developers) between interactive and non-interactive.
>
> Yes, I agree that runners should be consistent. The interactive runner we
> currently have is a wrapper of its underlying runner. Thus we don’t intend
> to brand it as a standalone runner. And that’s why we want to create an
> interactive_beam module for end users to avoid confusing them with a new
> runner called InteractiveRunner.
>

*This* seems to be the crux of the argument, that Interactive should not be
a Runner, but something else. Exactly what that something else has not yet
been well defined.


> And I want to focus on direct runner as underlying runner for now.
>

It's fine to only use the direct runner for now; it's certainly a lot
easier. But can you confirm there's nothing about the Direct runner itself
that you're depending on to implement this (or if there is something,
clearly document it here).


> Because with the portability of Beam, when applying interactivity to
> original pipeline, we can always do: pipeline_with_runner_A -> portable
> proto -> pipeline_with_direct_runner -> apply interactivity and get a new
> pipeline -> new portable proto -> pipeline_with_runner_A with interactivity.
>

I think there's some conflation of concepts here (which are reflected in
the rest of this conversation).

In the Beam model, the Runner is not a property of a Pipeline. Rather a
Pipeline is passed to a Runner for execution. (Along with some options.)

On the other hand, Python's Pipeline object currently holds an instance of
a Runner (for historical reasons). This is something we'd like to remove
(as we did in Java, or at least we removed the ability to reference and
query the Pipeline's Runner). This mismatch in the model vs the API was one
of the motivations for https://s.apache.org/no-beam-pipeline .

From an implementation perspective, I would strongly discourage having the
intermediate representation be "pipeline_with_direct_runner" that's derived
from a portable proto. The implementation details of that class are
il-defined, subject to change, and I hope will in the not-to-distant future
be removed (or at least significantly cleaned up), and don't work well for
things like cross-language pipelines.


> And the run_pipeline() feature is exactly the transition from interactive
> to non-interactive.
> With p.run(), one has to re-built/re-execute the pipeline p from the
> beginning. While they can do p -> portable proto ->
> pipeline_with_their_desired_runner_no_interactivity in a subsequent cell.
> The run_pipeline() is a shorthand for the process. It’s about the same
> thing we want to decide for create_pipeline(): do we leave the boilerplate
> one-liner to the users and what would the level of the users be for their
> first time playing with Beam?
> We never mutate the p defined by the user so p by itself is always
> non-interactive. Only when p.run() is invoked, interactivity is applied.
>

I am not following this at all. interactive_beam.run_pipeline(p) is not
p.run()? But it is the latter that is interactive, not the former? Maybe
some examples would help--there aren't any in the doc.


>
>
>>    1. *Our design actually reduces the mental load of interactive Beam
>>    users with intuitive interactive features*: create pipeline,
>>    visualize intermediate PCollection, run pipeline at some point with other
>>    runners and etc. For example, right now, the user needs to use a more
>>    complicated set of libraries, like creating a Beam pipeline with
>>    interactive runner that needs an underlying runner fed in.  We are getting
>>    rid of it. An interactive Beam user shouldn't be concerned about the
>>    underlying interactive magic.
>>
>> I agree a user shouldn't be concerned about the implementation details,
> but I fail to see how
>
>     p = interactive_module.create_pipeline()
>
> is significantly simpler than, or preferable to,
>
>    p = Pipeline(interactive_module.InteractiveRunner())
>
> especially as the latter is in line with non-interactive pipelines (and
> all our examples, docs, etc.) Now, perhaps an argument could be made that
> interactivity is not a property of the runner, but something orthogonal to
> that, e.g. one should write
>
>     p = InteractivePipeline()
>
> or
>
>     p = Pipeline(options, interactive=True)
>
> or similar. (It was introduced as a runner because, conceptually, a runner
> is something that takes a pipeline and executes it.)
>
> Yeah, this is debatable. I think the difference is, we want to introduce
> something that is Interactive Beam but not Interactive Runner.
>

Gotcha.


> I really wish that we have a 4runner standard thingy that can be
> implemented to intercept the pipeline for any runner before the execution.
>

Any runner that does delegation to another runner *is* a standard thingy
that can intercept the pipeline for any runner before the execution. (One
may want translation of results on the other end as well.) But it can be
argued that this is not the most intuitive API.

In that case, the wrapped interactive logic can be just a function that is
> applicable to existing runner in some interactive mode with
> “interactive=True”. But we don’t intend to introduce the new concept.
>
> p = interactive_module.create_pipeline()
>
> is saying the pipeline is created with interactivity and runnable in
> notebook. You can still convert the pipeline for other runners without
> interactivity.
> There is no guarantee that the interactivity is provided by (direct)
> runner nor any of the underlying implementation is backward compatible.
> There is no implementation details exposed in the API. You can even treat
> it as using this library, one can create a pipeline with direct runner but
> with interactivity as additional feature.
> I would favor the composition way.
>
> p = Pipeline(interactive_module.InteractiveRunner())
>
> is saying you can create a pipeline with this new runner thingy and we’ll
> maintain the runner forever like other runners.
> Additionally it can even take in an underlying runner. But does it make
> sense for a runner to take another runner? Is this applicable to all
> runners?
>

This is similar to Guava's filtering and transforming collection wrappers.


> The implementation details is exposed in the API itself while users don’t
> care about and we don’t want to maintain.
>
> p = Pipeline(options, interactive=True)
>
> is saying Pipeline can have an interactive mode. Like discussed above,
> it’ll be a new concept for the concept “pipeline”. It’s basically saying
> anything pipeline can have an interactive mode. That’s the inheritance way.
>

Yep. If you already know about Pipelines, it's easy to learn about
interactive Pipelines. Perhaps more importantly, if you already know about
Interactive Pipelines, you now know a bunch about non-interactive ones.

An even more radical concept would be that an interactive Pipeline is just
one that has any watched PCollections. (In other words, interactive vs.
non-interactive is not a mode or a setting, just something that happens
when you try to use things interactively.)


> Does this apply to pipeline created by other runner?
>

Runners do not create piplines. Runners execute pipelines. Any Runner
should be able to execute any pipeline.


>
>>    1. The interactive experience should be tailored for different
>>    underlying runners. There is no portability of interactivity and users
>>    opt-in interactive Beam using notebook would naturally expect something
>>    similar to the direct runner.
>>
>> This concerns me a lot. Again, the core tenant of beam is that one can
> choose the execution environment (runner) completely independently with how
> one writes the pipeline. We should figure out what, if anything, needs to
> be supported by a runner to support interactivity (the only thing that
> comes to mind is a place to read and write temporary/cached data), but I
> very strongly feel we should not go down the road of having different
> interactive apis/experiences for different runners. In particular, the many
> instances to the DirectRunner are worrisome--what's special about the
> DirectRunner that other runners cannot provide that's needed for
> interactive? If we can't come up with a good answer to that, we should not
> impose this restriction.
>
> We don’t intend to provide different interactive experience for different
> runners.
>

Good, so to clarify, you don't think the "interactive experience should be
tailored for different underlying runners."


>
>>    1. *Interactive Beam is solving an orthogonal set of problems than
>>    Beam*. You can think of it as a wrapper of Beam that enables
>>    interactivity and it's not even a real runner. It doesn't change the Beam
>>    model such as how you build a pipeline. And with the Beam portability, you
>>    get the capability to run the pipeline built from interactive runner with
>>    other runners for free. It adds the interactive behavior that a user
>>    expects.
>>    2. *We want to open source it though we can iterate faster without
>>    doing it*. The whole project can be encapsulated in a completely
>>    irrelevant repository and from a developer's perspective, I want to hide
>>    all the implementation details from the interactive Beam user. However, as
>>    there is more and more desire for interactive Beam (+Mehran Nazir
>>    <mn...@google.com> for more details), we want to share the
>>    implementation with others who want to contribute and explore the
>>    interactive world.
>>
>> I would much rather see interactivity as part of the Beam project. With
> good APIs the implementations don't have to be tightly coupled (e.g. the
> underlying runner delegation) but I think it will be a better user
> experience if interactive was a mode rather than a wrapper with different
> entry points.
>
> I agree with it when the interactive mode is acknowledged as a
> must-have/common/standard for all runners or for Beam pipeline itself.
> But this will be a new concept that is distributed top-down from Beam to
> its contributors.
> Our current objective is to enable interactivity of Beam in notebook and
> introduce experimental usages/features (and we want to limit the features
> and make them easy to use).
> I like the idea that anyone can write a runner or a transform. I would
> wish that anyone can add interactivity (before, during, or after pipeline
> execution) when it’s a mature concept.
> Maybe in the future, interactivity will be something that is similar to
> accessibility, internationalization, testability and etc. (Like if you
> write a class, you need not only a test but also a notebook to demo it)
>

The discussions about how to behave with such fundamental concepts such as
applying transforms imply that this is something that can't just be thrown
on top, but may require deeper changes to the API itself.


> But we are not doing it right now until we have more feedback from the
> community and concrete CUJs.
>
>
>
> I think watch() is a really good solution to knowing which collections to
> cache, and visualize() will be very useful.
>
> One thing I don't see tackled at all yet is the fact that pipelines are
> only ever mutated by appending on new operations, so some design needs to
> be done in terms of how to remove (possibly error-causing) operations or
> replace bad ones with fixed ones. This is where most of the unsolved
> problems lie.
>
> Thanks! And we haven’t really decided if plotting plain data is a good
> idea for PCollections. Plotting the metadata/analytics/insight might be a
> better option for users. I’m looking into facets
> <https://pair-code.github.io/facets/> that TFX notebook uses.
>

Yes, there are many different ways one might want to "look at" a
PCollection.


> Yes, existing PR <https://github.com/apache/beam/pull/9278> is appending
> operations only.
> We would have pruning when the appended operations severe the whole
> pipeline DAG into a set of DAGs to only execute the part of pipeline that
> needs to be re-executed (optimization). This is included in the design.
>

I'm talking about pruning like having a cell with

    pcoll | beam.Map(lambda x: expression_with_typo)

and then fixing it (and re-evaluating) with

    pcoll | beam.Map(lambda x: expression_with_typo_fixed)

where the former Map would *always* fail and never get removed from the
pipeline.


> We could also have the replacing logic that allows users to re-execute a
> cell with named PTransform by replacing the PTransform.
> This needs your precious feedback and is a bug-fix of existing interactive
> beam rather than the design around user-defined Collection variables.
> This is actually another debatable topic. Like I said, executing cells in
> notebook is equivalent to appending the code into a script.
> Users can re-execute cells with only anonymous PTransforms. It’s
> equivalent to applying the PTransform many times "in parallel".
> However, users cannot re-execute cells with any named PTransform. The
> inconsistency comes.
>

Yes, this is a *major* pain point I've seen every time people try to use
Beam in a notebook. On the other hand, differing behaviors here (adding
operations in parallel vs. replacement) could have surprising (non-local!)
effects when moving from interactive to non-interactive mode.


> Should the re-execution of named PTransform be supported? When supported,
> should we replace PTransform or append PTransforms "in parallel"?
> I think we’ll eventually pick a route, go through it and see what the
> interactive Beam users provide as feed back.
>

One approach was that the pipeline construction is re-executed every time
(i.e the "pipeline" object to run is really a callback, like a callable or
a PTransform) and then there's no ambiguity here. This has the downsides of
recreating the PCollectiion objects which are being used as handles (though
perhaps they could be re-identified).


>
>
> Also +David Yan <da...@google.com>  for more opinions.
>>
>> Thanks!
>>
>> Ning.
>>
>> On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> Ning, I believe Robert's questions from his email has not been answered
>>> yet.
>>>
>>> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ni...@google.com> wrote:
>>>
>>>> Hi all, I'll leave another 3 days for design
>>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review.
>>>> Then we can have a vote session if there is no objection.
>>>> Thanks!
>>>>
>>>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ni...@google.com> wrote:
>>>>
>>>>> Thanks Ahmet for the introduction!
>>>>>
>>>>> I've composed a design overview
>>>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
>>>>> describing changes we are making to components around interactive runner.
>>>>> I'll share the document in our email thread too.
>>>>>
>>>>> The truth is since interactive runner is not yet a recognized runner
>>>>> as part of the Beam SDK (and it's fundamentally a wrapper around direct
>>>>> runner), we are not touching any Beam SDK components.
>>>>> We'll not change any behavior of existing Beam SDK and we'll try our
>>>>> best to keep it that way in the future.
>>>>>
>>>>
>>> My main concern at this point is the introduction of new concepts, even
>>> though these are not changing other parts of the Beam SDKs. It would be
>>> good to see at least an alternative option covered in the design document.
>>> The reason is each additional concept adds to the mental load of users. And
>>> also concepts from interactive Beam will shift user's expectations of Beam
>>> even though there are not direct SDK modifications.
>>>
>>>
>>>>
>>>>> In the meantime, I'll work on other components orthogonal to Beam such
>>>>> as Pipeline Display and Data Visualization I mentioned in the design
>>>>> overview.
>>>>>
>>>>> If you have any questions, please feel free to contact me through this
>>>>> email address!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Regards,
>>>>> Ning.
>>>>>
>>>>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> Ning, thank you for the heads up.
>>>>>>
>>>>>> All, this is a proposed work for improving interactive Beam
>>>>>> experience. As mentioned in Ning's email, new concepts are being
>>>>>> introduced. And in addition iBeam as a name is used as a new reference. I
>>>>>> hope that bringing the discussion to the mailing list will give it the
>>>>>> additional visibility and more people could share their feedback.
>>>>>>
>>>>>> (cc'ing a few folks that might be interested +Robert Bradshaw
>>>>>> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
>>>>>> Li <qi...@google.com> +Brian Hulette <bh...@google.com> )
>>>>>>
>>>>>> Ahmet
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>>>>>>
>>>>>>> To whom may concern,
>>>>>>>
>>>>>>> This is Ning from Google. We are currently making efforts to
>>>>>>> leverage an interactive runner under python beam sdk.
>>>>>>>
>>>>>>> There is already an interactive Beam (iBeam for short) runner with
>>>>>>> jupyter notebook in the repo
>>>>>>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
>>>>>>> .
>>>>>>> Following the instructions on that page, one can set up an
>>>>>>> interactive environment to develop and execute Beam pipeline interactively.
>>>>>>>
>>>>>>> However, there are many issues with existing iBeam. One issue is
>>>>>>> that it uses a concept of leaf PCollection to cache and materialize
>>>>>>> intermediate PCollection. If the user wants to reuse/introspect a non-leaf
>>>>>>> PCollection, the interactive runner will run into errors.
>>>>>>>
>>>>>>> Our initial effort will be fixing the existing issues. And we also
>>>>>>> want to make iBeam easy to use. Since iBeam uses the same model Beam uses,
>>>>>>> there isn't really any difference for users between creating a pipeline
>>>>>>> with interactive runner and other runners.
>>>>>>> So we want to minimize the interfaces a user needs to learn while
>>>>>>> giving the user some capability to interact with the interactive
>>>>>>> environment.
>>>>>>>
>>>>>>> See this initial PR <https://github.com/apache/beam/pull/9278>, the
>>>>>>> interactive_beam module will provide mainly 4 interfaces:
>>>>>>>
>>>>>>>    - For advanced users who define pipeline outside __main__, let
>>>>>>>    them tell current interactive environment where they define their pipeline:
>>>>>>>    watch()
>>>>>>>       - This is very useful for tests where pipeline can be defined
>>>>>>>       in test methods.
>>>>>>>       - If the user simply creates pipeline in a Jupyter notebook
>>>>>>>       or a plain Python script, they don't have to know/use this feature at all.
>>>>>>>    - Let users create an interactive pipeline: create_pipeline()
>>>>>>>       - invoking create_pipeline(), the user gets a Pipeline object
>>>>>>>       that works as any other Pipeline object created from apache_beam.Pipeline()
>>>>>>>       - However, the pipeline object p, when invoking p.run(), does
>>>>>>>       some extra interactive magic.
>>>>>>>       - We'll support interactive execution for DirectRunner at
>>>>>>>       this moment.
>>>>>>>    - Let users run the interactive pipeline as a normal pipeline:
>>>>>>>    run_pipeline()
>>>>>>>       - In an interactive environment, a user only needs to add and
>>>>>>>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>>>>>>>       interactive pipeline object as normal pipeline in any selected platform.
>>>>>>>       - We'll probably support Dataflow only. Other implementations
>>>>>>>       can be added though.
>>>>>>>    - Let users introspect any intermediate PCollection they have
>>>>>>>    handler to: visualize()
>>>>>>>       - If a user ever writes pcoll = p | "Some Transform" >>
>>>>>>>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>>>>>>>       executed.
>>>>>>>       - p can be batch or streaming
>>>>>>>       - The visualization will be some plot graph of data for the
>>>>>>>       given PCollection as if it's materialized. If the PCollection is unbounded,
>>>>>>>       the graph is dynamic.
>>>>>>>
>>>>>>> The PR will implement 1 and 2.
>>>>>>>
>>>>>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the
>>>>>>> top level JIRA and add blocking JIRAs as development goes.
>>>>>>>
>>>>>>> External Beam users will not worry about any of the underlying
>>>>>>> implementation details.
>>>>>>> Except the 4 interfaces above, they learn and write normal Beam code
>>>>>>> and can execute the pipeline immediately when they are done with
>>>>>>> prototyping.
>>>>>>>
>>>>>>> Ning.
>>>>>>>
>>>>>>
>

Re: Brief of interactive Beam

Posted by GMAIL <ka...@gmail.com>.

Thanks for the input, Robert!

> On Aug 21, 2019, at 11:49 AM, Robert Bradshaw <ro...@google.com> wrote:
> 
> On Wed, Aug 14, 2019 at 11:29 AM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
> Ahmet, thanks for forwarding!
>  
> My main concern at this point is the introduction of new concepts, even though these are not changing other parts of the Beam SDKs. It would be good to see at least an alternative option covered in the design document. The reason is each additional concept adds to the mental load of users. And also concepts from interactive Beam will shift user's expectations of Beam even though there are not direct SDK modifications.
> 
> Hi Robert. About the concern, I think I have a few points:
> Interactive Beam (or Interactive Runner) is already an existing "new concept" that normal Beam user could opt-in if they want an interactive Beam experience. They need to do lots of setup steps and learn new things such as Jupyter notebook and at least interactive_runner module to make it work and make use of it.
> I think we should start with the perspective that most users interested in using Beam interactively already know about Jupyter notebooks, or at least ipython, and would want to use it to learn (and more effectively use) Beam. 
Yes, I agree with the perspective for users who are familiar with notebook. Yet it doesn’t prevent us from creating ready-to-use containers (such as binder <https://github.com/jupyterhub/binderhub>)  for users who want to try Beam interactively without setting up a environment with all the dependencies interactive Beam introduces. I agree that experienced users understand how to set up additional dependencies and read examples, it’s just we are also targeting other entry level audiences.
But back to the original topic, the design is not trying to add new concept, but fixing some rough edges of existing Interactive Beam features. We can discuss whether a factory of create_pipeline() is really desired and decide whether to expose it later. We hope the interactive_beam module to be the only module an Interactive Beam user would directly invoke in their notebook.
> The behavior of existing interactive Beam is different from normal Beam because of the interactive nature and the users would expect that. And the users wouldn't shift their expectation of normal Beam. Just like running Python scripts might result in different behavior than running all of them in an interactive Python session.
> I'm not quite following this. One of the advantages strengths of Python is that lack of the difference between the interactive vs. non-interactive behavior. (The fact that a script's execution is always in top to bottom order, unlike a notebook, is the primary difference.)
Sorry for the confusion. What I’m saying is about the hidden states. Running several Python scripts from top to bottom in an IPython session might generate different effects than running them in the same order normally. Say if you have an in-memory global configuration that is shared among all the scripts and if it’s missing, a script initializes one. Running the scripts in IPython will pass the initialization and modification of configuration along the scripts. While running the scripts one by one will initialize different configurations. Running cells in a notebook is equivalent to appending the cells into a script and run it. The interactivity is not about the order, but if there is hidden states preserved between each statement or each script execution. And the users should expect that there might be hidden states when they are in an interactive environment because that is exactly the interactivity they expect. However, they don’t hold the hidden states, the session does it for them. A user wouldn’t need to explicitly say “preserve the variable x I’ve defined in this cell because I want to reuse it in some other cells I’m going to execute”. The user can directly access variable x once the cell defining x is executed. And even if the user deletes the cell defining x, x still exists. At that stage, no one would know there is a variable x in memory by just looking at the notebook. One would see a missing execution sequence (on top left of each executed cell) and wonder where the piece of code executed goes.
Above is just for interactivity, not Beam.

Another interesting thing is that a named PTransform cannot be applied to the same pipeline more than once. It means a cell with named PTransform:  p | “Name” >> NamedPTransform() cannot be re-executed. We might want to support such re-execution if the pipeline is in an interactive mode. In that case, the Beam behavior might be different from non-interactive Beam. 
> Or if a user runs a Beam pipeline with direct runner, they should expect the behavior be different from running it on Dataflow while a user needs GCP account. I think the users are aware of the difference when they choose to use Interactive Beam.
>  The central, defining tenant of Beam is that behavior should be consistent across different runners. Of course there are operational details that are difficult, or perhaps even undesirable, to align (like, as you mention, needing a GCP account for running on Dataflow, or providing the location of the master when running Flink). But even these should be minimized (see the recent efforts to make the temp location a standard rather than dataflow-specific option). 
> 
> We should, however, attempt to minimize gratuitous differences. In particular, we should make it as easy as possible to transition (in terms of code, docs, and developers) between interactive and non-interactive. 
Yes, I agree that runners should be consistent. The interactive runner we currently have is a wrapper of its underlying runner. Thus we don’t intend to brand it as a standalone runner. And that’s why we want to create an interactive_beam module for end users to avoid confusing them with a new runner called InteractiveRunner. And I want to focus on direct runner as underlying runner for now. Because with the portability of Beam, when applying interactivity to original pipeline, we can always do: pipeline_with_runner_A -> portable proto -> pipeline_with_direct_runner -> apply interactivity and get a new pipeline -> new portable proto -> pipeline_with_runner_A with interactivity.

And the run_pipeline() feature is exactly the transition from interactive to non-interactive.
With p.run(), one has to re-built/re-execute the pipeline p from the beginning. While they can do p -> portable proto -> pipeline_with_their_desired_runner_no_interactivity in a subsequent cell.
The run_pipeline() is a shorthand for the process. It’s about the same thing we want to decide for create_pipeline(): do we leave the boilerplate one-liner to the users and what would the level of the users be for their first time playing with Beam?
We never mutate the p defined by the user so p by itself is always non-interactive. Only when p.run() is invoked, interactivity is applied.

> Our design actually reduces the mental load of interactive Beam users with intuitive interactive features: create pipeline, visualize intermediate PCollection, run pipeline at some point with other runners and etc. For example, right now, the user needs to use a more complicated set of libraries, like creating a Beam pipeline with interactive runner that needs an underlying runner fed in.  We are getting rid of it. An interactive Beam user shouldn't be concerned about the underlying interactive magic.
> I agree a user shouldn't be concerned about the implementation details, but I fail to see how
> 
>     p = interactive_module.create_pipeline()
> 
> is significantly simpler than, or preferable to, 
> 
>    p = Pipeline(interactive_module.InteractiveRunner())
> 
> especially as the latter is in line with non-interactive pipelines (and all our examples, docs, etc.) Now, perhaps an argument could be made that interactivity is not a property of the runner, but something orthogonal to that, e.g. one should write
> 
>     p = InteractivePipeline()
> 
> or
> 
>     p = Pipeline(options, interactive=True)
> 
> or similar. (It was introduced as a runner because, conceptually, a runner is something that takes a pipeline and executes it.)
Yeah, this is debatable. I think the difference is, we want to introduce something that is Interactive Beam but not Interactive Runner.
I really wish that we have a 4runner standard thingy that can be implemented to intercept the pipeline for any runner before the execution. In that case, the wrapped interactive logic can be just a function that is applicable to existing runner in some interactive mode with “interactive=True”. But we don’t intend to introduce the new concept.
>> p = interactive_module.create_pipeline()
is saying the pipeline is created with interactivity and runnable in notebook. You can still convert the pipeline for other runners without interactivity.
There is no guarantee that the interactivity is provided by (direct) runner nor any of the underlying implementation is backward compatible.
There is no implementation details exposed in the API. You can even treat it as using this library, one can create a pipeline with direct runner but with interactivity as additional feature.
I would favor the composition way.
>> p = Pipeline(interactive_module.InteractiveRunner())
is saying you can create a pipeline with this new runner thingy and we’ll maintain the runner forever like other runners. 
Additionally it can even take in an underlying runner. But does it make sense for a runner to take another runner? Is this applicable to all runners?
The implementation details is exposed in the API itself while users don’t care about and we don’t want to maintain.
>> p = Pipeline(options, interactive=True)
is saying Pipeline can have an interactive mode. Like discussed above, it’ll be a new concept for the concept “pipeline”. It’s basically saying anything pipeline can have an interactive mode. That’s the inheritance way.
Does this apply to pipeline created by other runner? Is it needed when interactivity can be portable through doing it just for the direct runner and convert to pipelines for all other runners with Beam’s portability.

I would argue that anyone can use composition to achieve: pipeline + interactivity without expecting that interactivity is a standard defined by Beam.
And anyone can use any available interactivity component they could find online or implement by themselves.
Rather than Beam defining that pipeline.interactivity is a thing and anyone should implement and maintain it.

> 
> Similarly, why introduce a special run_pipline(p) rather than p.run()? 
> 
p.run() is interactive while run_pipeline(p) is the transition to non-interactive without the user’s boilerplate.
Switching runner for p.run() in notebook requires modifying existing cells and re-execution of cells.
Adding a line of run_pipeline(p) doesn’t require modifying existing cell nor re-execution of cells. It’s adhoc and a feature for interactivity.
> The interactive experience should be tailored for different underlying runners. There is no portability of interactivity and users opt-in interactive Beam using notebook would naturally expect something similar to the direct runner.
> This concerns me a lot. Again, the core tenant of beam is that one can choose the execution environment (runner) completely independently with how one writes the pipeline. We should figure out what, if anything, needs to be supported by a runner to support interactivity (the only thing that comes to mind is a place to read and write temporary/cached data), but I very strongly feel we should not go down the road of having different interactive apis/experiences for different runners. In particular, the many instances to the DirectRunner are worrisome--what's special about the DirectRunner that other runners cannot provide that's needed for interactive? If we can't come up with a good answer to that, we should not impose this restriction. 
We don’t intend to provide different interactive experience for different runners.
We want to focus on direct runner for now. In a notebook environment, direct runner is very intuitive. Especially for users who have not built any Beam pipeline nor even running pipelines with any runner. No excessive pipeline options required either.
And we apply the interactivity by creating a new pipeline with hidden states in the interactive environment from the user defined pipeline.
Runners run the new pipeline. It doesn’t matter what the runner is: (pipeline_with_runner_A -> portable proto ->) pipeline_with_direct_runner -> apply interactivity and get a new pipeline -> new portable proto -> pipeline_with_runner_A with interactivity.
And at the same time, user defined pipeline is never touched by interactivity logic. Users can further develop pipelines from it.
> When users run pipeline built from interactive runner in a non-interactive environment, it's direct runner like any other Beam tutorial demonstrates. It's even easier because the user doesn't need to specify the runner nor pass in options.
>  So is the idea to have code like
> 
>     if is_ipython() or is_jupyter() or is ...:
>       do_something()
>     else:
>       do_another_thing()
> 
> I'd really like to avoid this as it means one will (quite surprisingly, and possibly for subtle reasons) not copy code from a notebook elsewhere. Or did you mean something else here? 
It’s still the same discussion of interactivity. We don’t have this specification.
Assuming you use interactive_beam module to write a piece of code:
  - When a user writes and execute codes in a notebook or in an IPython session, there will be hidden states. Until a kernel gets restarted, hidden states preserves across all statement execution.
  - When a user writes code in a script and execute it, each execution is equivalent to a distinct IPython session.
Interactivity is only applicable when hidden states are accessed from within the same interactive session/environment/context.
Thus the code is automatically non-interactive and equivalent to running the pipeline for the first time in a notebook.
> Interactive Beam is solving an orthogonal set of problems than Beam. You can think of it as a wrapper of Beam that enables interactivity and it's not even a real runner. It doesn't change the Beam model such as how you build a pipeline. And with the Beam portability, you get the capability to run the pipeline built from interactive runner with other runners for free. It adds the interactive behavior that a user expects.
> We want to open source it though we can iterate faster without doing it. The whole project can be encapsulated in a completely irrelevant repository and from a developer's perspective, I want to hide all the implementation details from the interactive Beam user. However, as there is more and more desire for interactive Beam (+Mehran Nazir <ma...@google.com> for more details), we want to share the implementation with others who want to contribute and explore the interactive world.
> I would much rather see interactivity as part of the Beam project. With good APIs the implementations don't have to be tightly coupled (e.g. the underlying runner delegation) but I think it will be a better user experience if interactive was a mode rather than a wrapper with different entry points. 
> 
I agree with it when the interactive mode is acknowledged as a must-have/common/standard for all runners or for Beam pipeline itself.
But this will be a new concept that is distributed top-down from Beam to its contributors.
Our current objective is to enable interactivity of Beam in notebook and introduce experimental usages/features (and we want to limit the features and make them easy to use).
I like the idea that anyone can write a runner or a transform. I would wish that anyone can add interactivity (before, during, or after pipeline execution) when it’s a mature concept.
Maybe in the future, interactivity will be something that is similar to accessibility, internationalization, testability and etc. (Like if you write a class, you need not only a test but also a notebook to demo it)
But we are not doing it right now until we have more feedback from the community and concrete CUJs.


> 
> I think watch() is a really good solution to knowing which collections to cache, and visualize() will be very useful. 
> 
> One thing I don't see tackled at all yet is the fact that pipelines are only ever mutated by appending on new operations, so some design needs to be done in terms of how to remove (possibly error-causing) operations or replace bad ones with fixed ones. This is where most of the unsolved problems lie. 
Thanks! And we haven’t really decided if plotting plain data is a good idea for PCollections. Plotting the metadata/analytics/insight might be a better option for users. I’m looking into facets <https://pair-code.github.io/facets/> that TFX notebook uses.

Yes, existing PR <https://github.com/apache/beam/pull/9278> is appending operations only.
We would have pruning when the appended operations severe the whole pipeline DAG into a set of DAGs to only execute the part of pipeline that needs to be re-executed (optimization). This is included in the design.

We could also have the replacing logic that allows users to re-execute a cell with named PTransform by replacing the PTransform.
This needs your precious feedback and is a bug-fix of existing interactive beam rather than the design around user-defined Collection variables.
This is actually another debatable topic. Like I said, executing cells in notebook is equivalent to appending the code into a script.
Users can re-execute cells with only anonymous PTransforms. It’s equivalent to applying the PTransform many times "in parallel".
However, users cannot re-execute cells with any named PTransform. The inconsistency comes.
Should the re-execution of named PTransform be supported? When supported, should we replace PTransform or append PTransforms "in parallel"?
I think we’ll eventually pick a route, go through it and see what the interactive Beam users provide as feed back.

> 
> Also +David Yan <ma...@google.com>  for more opinions.
> 
> Thanks!
> 
> Ning.
> 
> On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <altay@google.com <ma...@google.com>> wrote:
> Ning, I believe Robert's questions from his email has not been answered yet.
> 
> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
> Hi all, I'll leave another 3 days for design <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review. Then we can have a vote session if there is no objection.
> 
> Thanks!
> 
> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
> Thanks Ahmet for the introduction!
> 
> I've composed a design overview <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> describing changes we are making to components around interactive runner. I'll share the document in our email thread too.
> 
> The truth is since interactive runner is not yet a recognized runner as part of the Beam SDK (and it's fundamentally a wrapper around direct runner), we are not touching any Beam SDK components.
> We'll not change any behavior of existing Beam SDK and we'll try our best to keep it that way in the future.
> 
> My main concern at this point is the introduction of new concepts, even though these are not changing other parts of the Beam SDKs. It would be good to see at least an alternative option covered in the design document. The reason is each additional concept adds to the mental load of users. And also concepts from interactive Beam will shift user's expectations of Beam even though there are not direct SDK modifications.
>  
> 
> In the meantime, I'll work on other components orthogonal to Beam such as Pipeline Display and Data Visualization I mentioned in the design overview.
> 
> If you have any questions, please feel free to contact me through this email address!
> 
> Thanks!
> 
> Regards,
> Ning.
> 
> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <altay@google.com <ma...@google.com>> wrote:
> Ning, thank you for the heads up.
> 
> All, this is a proposed work for improving interactive Beam experience. As mentioned in Ning's email, new concepts are being introduced. And in addition iBeam as a name is used as a new reference. I hope that bringing the discussion to the mailing list will give it the additional visibility and more people could share their feedback.
> 
> (cc'ing a few folks that might be interested +Robert Bradshaw <ma...@google.com> +Valentyn Tymofieiev <ma...@google.com> +Sindy Li <ma...@google.com> +Brian Hulette <ma...@google.com> )
> 
> Ahmet
> 
> 
> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
> To whom may concern,
> 
> This is Ning from Google. We are currently making efforts to leverage an interactive runner under python beam sdk.
> 
> There is already an interactive Beam (iBeam for short) runner with jupyter notebook in the repo <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>.
> Following the instructions on that page, one can set up an interactive environment to develop and execute Beam pipeline interactively.
> 
> However, there are many issues with existing iBeam. One issue is that it uses a concept of leaf PCollection to cache and materialize intermediate PCollection. If the user wants to reuse/introspect a non-leaf PCollection, the interactive runner will run into errors.
> 
> Our initial effort will be fixing the existing issues. And we also want to make iBeam easy to use. Since iBeam uses the same model Beam uses, there isn't really any difference for users between creating a pipeline with interactive runner and other runners. 
> So we want to minimize the interfaces a user needs to learn while giving the user some capability to interact with the interactive environment.
> 
> See this initial PR <https://github.com/apache/beam/pull/9278>, the interactive_beam module will provide mainly 4 interfaces:
> For advanced users who define pipeline outside __main__, let them tell current interactive environment where they define their pipeline: watch()
> This is very useful for tests where pipeline can be defined in test methods.
> If the user simply creates pipeline in a Jupyter notebook or a plain Python script, they don't have to know/use this feature at all.
> Let users create an interactive pipeline: create_pipeline()
> invoking create_pipeline(), the user gets a Pipeline object that works as any other Pipeline object created from apache_beam.Pipeline()
> However, the pipeline object p, when invoking p.run(), does some extra interactive magic.
> We'll support interactive execution for DirectRunner at this moment.
> Let users run the interactive pipeline as a normal pipeline: run_pipeline()
> In an interactive environment, a user only needs to add and execute 1 line of code run_pipeline(pipeline) to execute any existing interactive pipeline object as normal pipeline in any selected platform.
> We'll probably support Dataflow only. Other implementations can be added though.
> Let users introspect any intermediate PCollection they have handler to: visualize()
> If a user ever writes pcoll = p | "Some Transform" >> some_transform() ..., they can visualize(pcoll) once the pipeline p is executed.
> p can be batch or streaming
> The visualization will be some plot graph of data for the given PCollection as if it's materialized. If the PCollection is unbounded, the graph is dynamic. 
> The PR will implement 1 and 2.
> 
> We'll use https://issues.apache.org/jira/browse/BEAM-7923 <https://issues.apache.org/jira/browse/BEAM-7923> as the top level JIRA and add blocking JIRAs as development goes.
> 
> External Beam users will not worry about any of the underlying implementation details.
> Except the 4 interfaces above, they learn and write normal Beam code and can execute the pipeline immediately when they are done with prototyping.
> 
> Ning.

Re: Brief of interactive Beam

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Aug 14, 2019 at 11:29 AM Ning Kang <ni...@google.com> wrote:

> Ahmet, thanks for forwarding!
>
>
>> My main concern at this point is the introduction of new concepts, even
>> though these are not changing other parts of the Beam SDKs. It would be
>> good to see at least an alternative option covered in the design document.
>> The reason is each additional concept adds to the mental load of users. And
>> also concepts from interactive Beam will shift user's expectations of Beam
>> even though there are not direct SDK modifications.
>
>
> Hi Robert. About the concern, I think I have a few points:
>
>    1. *Interactive Beam (or Interactive Runner) is already an existing
>    "new concept" that normal Beam user could opt-in if they want an
>    interactive Beam experience.* They need to do lots of setup steps and
>    learn new things such as Jupyter notebook and at least interactive_runner
>    module to make it work and make use of it.
>
> I think we should start with the perspective that most users interested in
using Beam interactively already know about Jupyter notebooks, or at least
ipython, and would want to use it to learn (and more effectively use) Beam.

>
>    1. *The behavior of existing interactive Beam is different from normal
>    Beam because of the interactive nature and the users would expect that.* And
>    the users wouldn't shift their expectation of normal Beam. Just like
>    running Python scripts might result in different behavior than running all
>    of them in an interactive Python session.
>
> I'm not quite following this. One of the advantages strengths of Python is
that lack of the difference between the interactive vs. non-interactive
behavior. (The fact that a script's execution is always in top to bottom
order, unlike a notebook, is the primary difference.)

>
>    1. Or if a user runs a Beam pipeline with direct runner, they should
>    expect the behavior be different from running it on Dataflow while a user
>    needs GCP account. I think the users are aware of the difference when they
>    choose to use Interactive Beam.
>
>  The central, defining tenant of Beam is that behavior should be
consistent across different runners. Of course there are operational
details that are difficult, or perhaps even undesirable, to align (like, as
you mention, needing a GCP account for running on Dataflow, or providing
the location of the master when running Flink). But even these should be
minimized (see the recent efforts to make the temp location a standard
rather than dataflow-specific option).

We should, however, attempt to minimize gratuitous differences. In
particular, we should make it as easy as possible to transition (in terms
of code, docs, and developers) between interactive and non-interactive.

>
>    1. *Our design actually reduces the mental load of interactive Beam
>    users with intuitive interactive features*: create pipeline, visualize
>    intermediate PCollection, run pipeline at some point with other runners and
>    etc. For example, right now, the user needs to use a more complicated set
>    of libraries, like creating a Beam pipeline with interactive runner that
>    needs an underlying runner fed in.  We are getting rid of it. An
>    interactive Beam user shouldn't be concerned about the underlying
>    interactive magic.
>
> I agree a user shouldn't be concerned about the implementation details,
but I fail to see how

    p = interactive_module.create_pipeline()

is significantly simpler than, or preferable to,

   p = Pipeline(interactive_module.InteractiveRunner())

especially as the latter is in line with non-interactive pipelines (and all
our examples, docs, etc.) Now, perhaps an argument could be made that
interactivity is not a property of the runner, but something orthogonal to
that, e.g. one should write

    p = InteractivePipeline()

or

    p = Pipeline(options, interactive=True)

or similar. (It was introduced as a runner because, conceptually, a runner
is something that takes a pipeline and executes it.)

Similarly, why introduce a special run_pipline(p) rather than p.run()?


>    1. The interactive experience should be tailored for different
>    underlying runners. There is no portability of interactivity and users
>    opt-in interactive Beam using notebook would naturally expect something
>    similar to the direct runner.
>
> This concerns me a lot. Again, the core tenant of beam is that one can
choose the execution environment (runner) completely independently with how
one writes the pipeline. We should figure out what, if anything, needs to
be supported by a runner to support interactivity (the only thing that
comes to mind is a place to read and write temporary/cached data), but I
very strongly feel we should not go down the road of having different
interactive apis/experiences for different runners. In particular, the many
instances to the DirectRunner are worrisome--what's special about the
DirectRunner that other runners cannot provide that's needed for
interactive? If we can't come up with a good answer to that, we should not
impose this restriction.

>
>    1. *When users run pipeline built from interactive runner in a
>    non-interactive environment, it's direct runner like any other Beam
>    tutorial demonstrates*. It's even easier because the user doesn't need
>    to specify the runner nor pass in options.
>
>  So is the idea to have code like

    if is_ipython() or is_jupyter() or is ...:
      do_something()
    else:
      do_another_thing()

I'd really like to avoid this as it means one will (quite surprisingly, and
possibly for subtle reasons) not copy code from a notebook elsewhere. Or
did you mean something else here?

>
>    1. *Interactive Beam is solving an orthogonal set of problems than
>    Beam*. You can think of it as a wrapper of Beam that enables
>    interactivity and it's not even a real runner. It doesn't change the Beam
>    model such as how you build a pipeline. And with the Beam portability, you
>    get the capability to run the pipeline built from interactive runner with
>    other runners for free. It adds the interactive behavior that a user
>    expects.
>    2. *We want to open source it though we can iterate faster without
>    doing it*. The whole project can be encapsulated in a completely
>    irrelevant repository and from a developer's perspective, I want to hide
>    all the implementation details from the interactive Beam user. However, as
>    there is more and more desire for interactive Beam (+Mehran Nazir
>    <mn...@google.com> for more details), we want to share the
>    implementation with others who want to contribute and explore the
>    interactive world.
>
> I would much rather see interactivity as part of the Beam project. With
good APIs the implementations don't have to be tightly coupled (e.g. the
underlying runner delegation) but I think it will be a better user
experience if interactive was a mode rather than a wrapper with different
entry points.


I think watch() is a really good solution to knowing which collections to
cache, and visualize() will be very useful.

One thing I don't see tackled at all yet is the fact that pipelines are
only ever mutated by appending on new operations, so some design needs to
be done in terms of how to remove (possibly error-causing) operations or
replace bad ones with fixed ones. This is where most of the unsolved
problems lie.

Also +David Yan <da...@google.com>  for more opinions.
>
> Thanks!
>
> Ning.
>
> On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <al...@google.com> wrote:
>
>> Ning, I believe Robert's questions from his email has not been answered
>> yet.
>>
>> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ni...@google.com> wrote:
>>
>>> Hi all, I'll leave another 3 days for design
>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review.
>>> Then we can have a vote session if there is no objection.
>>>
>>> Thanks!
>>>
>>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ni...@google.com> wrote:
>>>
>>>> Thanks Ahmet for the introduction!
>>>>
>>>> I've composed a design overview
>>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
>>>> describing changes we are making to components around interactive runner.
>>>> I'll share the document in our email thread too.
>>>>
>>>> The truth is since interactive runner is not yet a recognized runner as
>>>> part of the Beam SDK (and it's fundamentally a wrapper around direct
>>>> runner), we are not touching any Beam SDK components.
>>>> We'll not change any behavior of existing Beam SDK and we'll try our
>>>> best to keep it that way in the future.
>>>>
>>>
>> My main concern at this point is the introduction of new concepts, even
>> though these are not changing other parts of the Beam SDKs. It would be
>> good to see at least an alternative option covered in the design document.
>> The reason is each additional concept adds to the mental load of users. And
>> also concepts from interactive Beam will shift user's expectations of Beam
>> even though there are not direct SDK modifications.
>>
>>
>>>
>>>> In the meantime, I'll work on other components orthogonal to Beam such
>>>> as Pipeline Display and Data Visualization I mentioned in the design
>>>> overview.
>>>>
>>>> If you have any questions, please feel free to contact me through this
>>>> email address!
>>>>
>>>> Thanks!
>>>>
>>>> Regards,
>>>> Ning.
>>>>
>>>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Ning, thank you for the heads up.
>>>>>
>>>>> All, this is a proposed work for improving interactive Beam
>>>>> experience. As mentioned in Ning's email, new concepts are being
>>>>> introduced. And in addition iBeam as a name is used as a new reference. I
>>>>> hope that bringing the discussion to the mailing list will give it the
>>>>> additional visibility and more people could share their feedback.
>>>>>
>>>>> (cc'ing a few folks that might be interested +Robert Bradshaw
>>>>> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
>>>>> Li <qi...@google.com> +Brian Hulette <bh...@google.com> )
>>>>>
>>>>> Ahmet
>>>>>
>>>>>
>>>>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>>>>>
>>>>>> To whom may concern,
>>>>>>
>>>>>> This is Ning from Google. We are currently making efforts to leverage
>>>>>> an interactive runner under python beam sdk.
>>>>>>
>>>>>> There is already an interactive Beam (iBeam for short) runner with
>>>>>> jupyter notebook in the repo
>>>>>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
>>>>>> .
>>>>>> Following the instructions on that page, one can set up an
>>>>>> interactive environment to develop and execute Beam pipeline interactively.
>>>>>>
>>>>>> However, there are many issues with existing iBeam. One issue is that
>>>>>> it uses a concept of leaf PCollection to cache and materialize intermediate
>>>>>> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
>>>>>> the interactive runner will run into errors.
>>>>>>
>>>>>> Our initial effort will be fixing the existing issues. And we also
>>>>>> want to make iBeam easy to use. Since iBeam uses the same model Beam uses,
>>>>>> there isn't really any difference for users between creating a pipeline
>>>>>> with interactive runner and other runners.
>>>>>> So we want to minimize the interfaces a user needs to learn while
>>>>>> giving the user some capability to interact with the interactive
>>>>>> environment.
>>>>>>
>>>>>> See this initial PR <https://github.com/apache/beam/pull/9278>, the
>>>>>> interactive_beam module will provide mainly 4 interfaces:
>>>>>>
>>>>>>    - For advanced users who define pipeline outside __main__, let
>>>>>>    them tell current interactive environment where they define their pipeline:
>>>>>>    watch()
>>>>>>       - This is very useful for tests where pipeline can be defined
>>>>>>       in test methods.
>>>>>>       - If the user simply creates pipeline in a Jupyter notebook or
>>>>>>       a plain Python script, they don't have to know/use this feature at all.
>>>>>>    - Let users create an interactive pipeline: create_pipeline()
>>>>>>       - invoking create_pipeline(), the user gets a Pipeline object
>>>>>>       that works as any other Pipeline object created from apache_beam.Pipeline()
>>>>>>       - However, the pipeline object p, when invoking p.run(), does
>>>>>>       some extra interactive magic.
>>>>>>       - We'll support interactive execution for DirectRunner at this
>>>>>>       moment.
>>>>>>    - Let users run the interactive pipeline as a normal pipeline:
>>>>>>    run_pipeline()
>>>>>>       - In an interactive environment, a user only needs to add and
>>>>>>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>>>>>>       interactive pipeline object as normal pipeline in any selected platform.
>>>>>>       - We'll probably support Dataflow only. Other implementations
>>>>>>       can be added though.
>>>>>>    - Let users introspect any intermediate PCollection they have
>>>>>>    handler to: visualize()
>>>>>>       - If a user ever writes pcoll = p | "Some Transform" >>
>>>>>>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>>>>>>       executed.
>>>>>>       - p can be batch or streaming
>>>>>>       - The visualization will be some plot graph of data for the
>>>>>>       given PCollection as if it's materialized. If the PCollection is unbounded,
>>>>>>       the graph is dynamic.
>>>>>>
>>>>>> The PR will implement 1 and 2.
>>>>>>
>>>>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
>>>>>> level JIRA and add blocking JIRAs as development goes.
>>>>>>
>>>>>> External Beam users will not worry about any of the underlying
>>>>>> implementation details.
>>>>>> Except the 4 interfaces above, they learn and write normal Beam code
>>>>>> and can execute the pipeline immediately when they are done with
>>>>>> prototyping.
>>>>>>
>>>>>> Ning.
>>>>>>
>>>>>

Re: Brief of interactive Beam

Posted by Ning Kang <ni...@google.com>.

Ahmet, thanks for forwarding!

> My main concern at this point is the introduction of new concepts, even
> though these are not changing other parts of the Beam SDKs. It would be
> good to see at least an alternative option covered in the design document.
> The reason is each additional concept adds to the mental load of users. And
> also concepts from interactive Beam will shift user's expectations of Beam
> even though there are not direct SDK modifications.

Hi Robert. About the concern, I think I have a few points:

   1. *Interactive Beam (or Interactive Runner) is already an existing "new
   concept" that normal Beam user could opt-in if they want an interactive
   Beam experience.* They need to do lots of setup steps and learn new
   things such as Jupyter notebook and at least interactive_runner module to
   make it work and make use of it.
   2. *The behavior of existing interactive Beam is different from normal
   Beam because of the interactive nature and the users would expect that.* And
   the users wouldn't shift their expectation of normal Beam. Just like
   running Python scripts might result in different behavior than running all
   of them in an interactive Python session. Or if a user runs a Beam pipeline
   with direct runner, they should expect the behavior be different from
   running it on Dataflow while a user needs GCP account. I think the users
   are aware of the difference when they choose to use Interactive Beam.
   3. *Our design actually reduces the mental load of interactive Beam
   users with intuitive interactive features*: create pipeline, visualize
   intermediate PCollection, run pipeline at some point with other runners and
   etc. For example, right now, the user needs to use a more complicated set
   of libraries, like creating a Beam pipeline with interactive runner that
   needs an underlying runner fed in.  We are getting rid of it. An
   interactive Beam user shouldn't be concerned about the underlying
   interactive magic. The interactive experience should be tailored for
   different underlying runners. There is no portability of interactivity and
   users opt-in interactive Beam using notebook would naturally expect
   something similar to the direct runner.
   4. *When users run pipeline built from interactive runner in a
   non-interactive environment, it's direct runner like any other Beam
   tutorial demonstrates*. It's even easier because the user doesn't need
   to specify the runner nor pass in options.
   5. *Interactive Beam is solving an orthogonal set of problems than Beam*.
   You can think of it as a wrapper of Beam that enables interactivity and
   it's not even a real runner. It doesn't change the Beam model such as how
   you build a pipeline. And with the Beam portability, you get the capability
   to run the pipeline built from interactive runner with other runners for
   free. It adds the interactive behavior that a user expects.
   6. *We want to open source it though we can iterate faster without doing
   it*. The whole project can be encapsulated in a completely irrelevant
   repository and from a developer's perspective, I want to hide all the
   implementation details from the interactive Beam user. However, as there is
   more and more desire for interactive Beam (+Mehran Nazir
   <mn...@google.com> for more details), we want to share the
   implementation with others who want to contribute and explore the
   interactive world.

Also +David Yan <da...@google.com>  for more opinions.

Thanks!

Ning.

On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay <al...@google.com> wrote:

> Ning, I believe Robert's questions from his email has not been answered
> yet.
>
> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ni...@google.com> wrote:
>
>> Hi all, I'll leave another 3 days for design
>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review.
>> Then we can have a vote session if there is no objection.
>>
>> Thanks!
>>
>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ni...@google.com> wrote:
>>
>>> Thanks Ahmet for the introduction!
>>>
>>> I've composed a design overview
>>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
>>> describing changes we are making to components around interactive runner.
>>> I'll share the document in our email thread too.
>>>
>>> The truth is since interactive runner is not yet a recognized runner as
>>> part of the Beam SDK (and it's fundamentally a wrapper around direct
>>> runner), we are not touching any Beam SDK components.
>>> We'll not change any behavior of existing Beam SDK and we'll try our
>>> best to keep it that way in the future.
>>>
>>
> My main concern at this point is the introduction of new concepts, even
> though these are not changing other parts of the Beam SDKs. It would be
> good to see at least an alternative option covered in the design document.
> The reason is each additional concept adds to the mental load of users. And
> also concepts from interactive Beam will shift user's expectations of Beam
> even though there are not direct SDK modifications.
>
>
>>
>>> In the meantime, I'll work on other components orthogonal to Beam such
>>> as Pipeline Display and Data Visualization I mentioned in the design
>>> overview.
>>>
>>> If you have any questions, please feel free to contact me through this
>>> email address!
>>>
>>> Thanks!
>>>
>>> Regards,
>>> Ning.
>>>
>>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Ning, thank you for the heads up.
>>>>
>>>> All, this is a proposed work for improving interactive Beam experience.
>>>> As mentioned in Ning's email, new concepts are being introduced. And in
>>>> addition iBeam as a name is used as a new reference. I hope that bringing
>>>> the discussion to the mailing list will give it the additional
>>>> visibility and more people could share their feedback.
>>>>
>>>> (cc'ing a few folks that might be interested +Robert Bradshaw
>>>> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
>>>> Li <qi...@google.com> +Brian Hulette <bh...@google.com> )
>>>>
>>>> Ahmet
>>>>
>>>>
>>>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>>>>
>>>>> To whom may concern,
>>>>>
>>>>> This is Ning from Google. We are currently making efforts to leverage
>>>>> an interactive runner under python beam sdk.
>>>>>
>>>>> There is already an interactive Beam (iBeam for short) runner with
>>>>> jupyter notebook in the repo
>>>>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
>>>>> .
>>>>> Following the instructions on that page, one can set up an interactive
>>>>> environment to develop and execute Beam pipeline interactively.
>>>>>
>>>>> However, there are many issues with existing iBeam. One issue is that
>>>>> it uses a concept of leaf PCollection to cache and materialize intermediate
>>>>> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
>>>>> the interactive runner will run into errors.
>>>>>
>>>>> Our initial effort will be fixing the existing issues. And we also
>>>>> want to make iBeam easy to use. Since iBeam uses the same model Beam uses,
>>>>> there isn't really any difference for users between creating a pipeline
>>>>> with interactive runner and other runners.
>>>>> So we want to minimize the interfaces a user needs to learn while
>>>>> giving the user some capability to interact with the interactive
>>>>> environment.
>>>>>
>>>>> See this initial PR <https://github.com/apache/beam/pull/9278>, the
>>>>> interactive_beam module will provide mainly 4 interfaces:
>>>>>
>>>>>    - For advanced users who define pipeline outside __main__, let
>>>>>    them tell current interactive environment where they define their pipeline:
>>>>>    watch()
>>>>>       - This is very useful for tests where pipeline can be defined
>>>>>       in test methods.
>>>>>       - If the user simply creates pipeline in a Jupyter notebook or
>>>>>       a plain Python script, they don't have to know/use this feature at all.
>>>>>    - Let users create an interactive pipeline: create_pipeline()
>>>>>       - invoking create_pipeline(), the user gets a Pipeline object
>>>>>       that works as any other Pipeline object created from apache_beam.Pipeline()
>>>>>       - However, the pipeline object p, when invoking p.run(), does
>>>>>       some extra interactive magic.
>>>>>       - We'll support interactive execution for DirectRunner at this
>>>>>       moment.
>>>>>    - Let users run the interactive pipeline as a normal pipeline:
>>>>>    run_pipeline()
>>>>>       - In an interactive environment, a user only needs to add and
>>>>>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>>>>>       interactive pipeline object as normal pipeline in any selected platform.
>>>>>       - We'll probably support Dataflow only. Other implementations
>>>>>       can be added though.
>>>>>    - Let users introspect any intermediate PCollection they have
>>>>>    handler to: visualize()
>>>>>       - If a user ever writes pcoll = p | "Some Transform" >>
>>>>>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>>>>>       executed.
>>>>>       - p can be batch or streaming
>>>>>       - The visualization will be some plot graph of data for the
>>>>>       given PCollection as if it's materialized. If the PCollection is unbounded,
>>>>>       the graph is dynamic.
>>>>>
>>>>> The PR will implement 1 and 2.
>>>>>
>>>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
>>>>> level JIRA and add blocking JIRAs as development goes.
>>>>>
>>>>> External Beam users will not worry about any of the underlying
>>>>> implementation details.
>>>>> Except the 4 interfaces above, they learn and write normal Beam code
>>>>> and can execute the pipeline immediately when they are done with
>>>>> prototyping.
>>>>>
>>>>> Ning.
>>>>>
>>>>

Re: Brief of interactive Beam

Posted by Ahmet Altay <al...@google.com>.

Ning, I believe Robert's questions from his email has not been answered yet.

On Tue, Aug 13, 2019 at 5:00 PM Ning Kang <ni...@google.com> wrote:

> Hi all, I'll leave another 3 days for design
> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing> review.
> Then we can have a vote session if there is no objection.
>
> Thanks!
>
> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ni...@google.com> wrote:
>
>> Thanks Ahmet for the introduction!
>>
>> I've composed a design overview
>> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
>> describing changes we are making to components around interactive runner.
>> I'll share the document in our email thread too.
>>
>> The truth is since interactive runner is not yet a recognized runner as
>> part of the Beam SDK (and it's fundamentally a wrapper around direct
>> runner), we are not touching any Beam SDK components.
>> We'll not change any behavior of existing Beam SDK and we'll try our best
>> to keep it that way in the future.
>>
>
My main concern at this point is the introduction of new concepts, even
though these are not changing other parts of the Beam SDKs. It would be
good to see at least an alternative option covered in the design document.
The reason is each additional concept adds to the mental load of users. And
also concepts from interactive Beam will shift user's expectations of Beam
even though there are not direct SDK modifications.


>
>> In the meantime, I'll work on other components orthogonal to Beam such as
>> Pipeline Display and Data Visualization I mentioned in the design overview.
>>
>> If you have any questions, please feel free to contact me through this
>> email address!
>>
>> Thanks!
>>
>> Regards,
>> Ning.
>>
>> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> Ning, thank you for the heads up.
>>>
>>> All, this is a proposed work for improving interactive Beam experience.
>>> As mentioned in Ning's email, new concepts are being introduced. And in
>>> addition iBeam as a name is used as a new reference. I hope that bringing
>>> the discussion to the mailing list will give it the additional
>>> visibility and more people could share their feedback.
>>>
>>> (cc'ing a few folks that might be interested +Robert Bradshaw
>>> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
>>> Li <qi...@google.com> +Brian Hulette <bh...@google.com> )
>>>
>>> Ahmet
>>>
>>>
>>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>>>
>>>> To whom may concern,
>>>>
>>>> This is Ning from Google. We are currently making efforts to leverage
>>>> an interactive runner under python beam sdk.
>>>>
>>>> There is already an interactive Beam (iBeam for short) runner with
>>>> jupyter notebook in the repo
>>>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
>>>> .
>>>> Following the instructions on that page, one can set up an interactive
>>>> environment to develop and execute Beam pipeline interactively.
>>>>
>>>> However, there are many issues with existing iBeam. One issue is that
>>>> it uses a concept of leaf PCollection to cache and materialize intermediate
>>>> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
>>>> the interactive runner will run into errors.
>>>>
>>>> Our initial effort will be fixing the existing issues. And we also want
>>>> to make iBeam easy to use. Since iBeam uses the same model Beam uses, there
>>>> isn't really any difference for users between creating a pipeline with
>>>> interactive runner and other runners.
>>>> So we want to minimize the interfaces a user needs to learn while
>>>> giving the user some capability to interact with the interactive
>>>> environment.
>>>>
>>>> See this initial PR <https://github.com/apache/beam/pull/9278>, the
>>>> interactive_beam module will provide mainly 4 interfaces:
>>>>
>>>>    - For advanced users who define pipeline outside __main__, let them
>>>>    tell current interactive environment where they define their pipeline:
>>>>    watch()
>>>>       - This is very useful for tests where pipeline can be defined in
>>>>       test methods.
>>>>       - If the user simply creates pipeline in a Jupyter notebook or a
>>>>       plain Python script, they don't have to know/use this feature at all.
>>>>    - Let users create an interactive pipeline: create_pipeline()
>>>>       - invoking create_pipeline(), the user gets a Pipeline object
>>>>       that works as any other Pipeline object created from apache_beam.Pipeline()
>>>>       - However, the pipeline object p, when invoking p.run(), does
>>>>       some extra interactive magic.
>>>>       - We'll support interactive execution for DirectRunner at this
>>>>       moment.
>>>>    - Let users run the interactive pipeline as a normal pipeline:
>>>>    run_pipeline()
>>>>       - In an interactive environment, a user only needs to add and
>>>>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>>>>       interactive pipeline object as normal pipeline in any selected platform.
>>>>       - We'll probably support Dataflow only. Other implementations
>>>>       can be added though.
>>>>    - Let users introspect any intermediate PCollection they have
>>>>    handler to: visualize()
>>>>       - If a user ever writes pcoll = p | "Some Transform" >>
>>>>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>>>>       executed.
>>>>       - p can be batch or streaming
>>>>       - The visualization will be some plot graph of data for the
>>>>       given PCollection as if it's materialized. If the PCollection is unbounded,
>>>>       the graph is dynamic.
>>>>
>>>> The PR will implement 1 and 2.
>>>>
>>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
>>>> level JIRA and add blocking JIRAs as development goes.
>>>>
>>>> External Beam users will not worry about any of the underlying
>>>> implementation details.
>>>> Except the 4 interfaces above, they learn and write normal Beam code
>>>> and can execute the pipeline immediately when they are done with
>>>> prototyping.
>>>>
>>>> Ning.
>>>>
>>>

Re: Brief of interactive Beam

Posted by Ning Kang <ni...@google.com>.

Hi all, I'll leave another 3 days for design
<https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
review.
Then we can have a vote session if there is no objection.

Thanks!

On Fri, Aug 9, 2019 at 12:14 PM Ning Kang <ni...@google.com> wrote:

> Thanks Ahmet for the introduction!
>
> I've composed a design overview
> <https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing>
> describing changes we are making to components around interactive runner.
> I'll share the document in our email thread too.
>
> The truth is since interactive runner is not yet a recognized runner as
> part of the Beam SDK (and it's fundamentally a wrapper around direct
> runner), we are not touching any Beam SDK components.
> We'll not change any behavior of existing Beam SDK and we'll try our best
> to keep it that way in the future.
>
> In the meantime, I'll work on other components orthogonal to Beam such as
> Pipeline Display and Data Visualization I mentioned in the design overview.
>
> If you have any questions, please feel free to contact me through this
> email address!
>
> Thanks!
>
> Regards,
> Ning.
>
> On Wed, Aug 7, 2019 at 5:01 PM Ahmet Altay <al...@google.com> wrote:
>
>> Ning, thank you for the heads up.
>>
>> All, this is a proposed work for improving interactive Beam experience.
>> As mentioned in Ning's email, new concepts are being introduced. And in
>> addition iBeam as a name is used as a new reference. I hope that bringing
>> the discussion to the mailing list will give it the additional
>> visibility and more people could share their feedback.
>>
>> (cc'ing a few folks that might be interested +Robert Bradshaw
>> <ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy
>> Li <qi...@google.com> +Brian Hulette <bh...@google.com> )
>>
>> Ahmet
>>
>>
>> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:
>>
>>> To whom may concern,
>>>
>>> This is Ning from Google. We are currently making efforts to leverage an
>>> interactive runner under python beam sdk.
>>>
>>> There is already an interactive Beam (iBeam for short) runner with
>>> jupyter notebook in the repo
>>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
>>> .
>>> Following the instructions on that page, one can set up an interactive
>>> environment to develop and execute Beam pipeline interactively.
>>>
>>> However, there are many issues with existing iBeam. One issue is that it
>>> uses a concept of leaf PCollection to cache and materialize intermediate
>>> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
>>> the interactive runner will run into errors.
>>>
>>> Our initial effort will be fixing the existing issues. And we also want
>>> to make iBeam easy to use. Since iBeam uses the same model Beam uses, there
>>> isn't really any difference for users between creating a pipeline with
>>> interactive runner and other runners.
>>> So we want to minimize the interfaces a user needs to learn while giving
>>> the user some capability to interact with the interactive environment.
>>>
>>> See this initial PR <https://github.com/apache/beam/pull/9278>, the
>>> interactive_beam module will provide mainly 4 interfaces:
>>>
>>>    - For advanced users who define pipeline outside __main__, let them
>>>    tell current interactive environment where they define their pipeline:
>>>    watch()
>>>       - This is very useful for tests where pipeline can be defined in
>>>       test methods.
>>>       - If the user simply creates pipeline in a Jupyter notebook or a
>>>       plain Python script, they don't have to know/use this feature at all.
>>>    - Let users create an interactive pipeline: create_pipeline()
>>>       - invoking create_pipeline(), the user gets a Pipeline object
>>>       that works as any other Pipeline object created from apache_beam.Pipeline()
>>>       - However, the pipeline object p, when invoking p.run(), does
>>>       some extra interactive magic.
>>>       - We'll support interactive execution for DirectRunner at this
>>>       moment.
>>>    - Let users run the interactive pipeline as a normal pipeline:
>>>    run_pipeline()
>>>       - In an interactive environment, a user only needs to add and
>>>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>>>       interactive pipeline object as normal pipeline in any selected platform.
>>>       - We'll probably support Dataflow only. Other implementations can
>>>       be added though.
>>>    - Let users introspect any intermediate PCollection they have
>>>    handler to: visualize()
>>>       - If a user ever writes pcoll = p | "Some Transform" >>
>>>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>>>       executed.
>>>       - p can be batch or streaming
>>>       - The visualization will be some plot graph of data for the given
>>>       PCollection as if it's materialized. If the PCollection is unbounded, the
>>>       graph is dynamic.
>>>
>>> The PR will implement 1 and 2.
>>>
>>> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
>>> level JIRA and add blocking JIRAs as development goes.
>>>
>>> External Beam users will not worry about any of the underlying
>>> implementation details.
>>> Except the 4 interfaces above, they learn and write normal Beam code and
>>> can execute the pipeline immediately when they are done with prototyping.
>>>
>>> Ning.
>>>
>>

Re: Brief of interactive Beam

Posted by Ahmet Altay <al...@google.com>.

Ning, thank you for the heads up.

All, this is a proposed work for improving interactive Beam experience. As
mentioned in Ning's email, new concepts are being introduced. And in
addition iBeam as a name is used as a new reference. I hope that bringing
the discussion to the mailing list will give it the additional
visibility and more people could share their feedback.

(cc'ing a few folks that might be interested +Robert Bradshaw
<ro...@google.com> +Valentyn Tymofieiev <va...@google.com> +Sindy Li
<qi...@google.com> +Brian Hulette <bh...@google.com> )

Ahmet


On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:

> To whom may concern,
>
> This is Ning from Google. We are currently making efforts to leverage an
> interactive runner under python beam sdk.
>
> There is already an interactive Beam (iBeam for short) runner with jupyter
> notebook in the repo
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
> .
> Following the instructions on that page, one can set up an interactive
> environment to develop and execute Beam pipeline interactively.
>
> However, there are many issues with existing iBeam. One issue is that it
> uses a concept of leaf PCollection to cache and materialize intermediate
> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
> the interactive runner will run into errors.
>
> Our initial effort will be fixing the existing issues. And we also want to
> make iBeam easy to use. Since iBeam uses the same model Beam uses, there
> isn't really any difference for users between creating a pipeline with
> interactive runner and other runners.
> So we want to minimize the interfaces a user needs to learn while giving
> the user some capability to interact with the interactive environment.
>
> See this initial PR <https://github.com/apache/beam/pull/9278>, the
> interactive_beam module will provide mainly 4 interfaces:
>
>    - For advanced users who define pipeline outside __main__, let them
>    tell current interactive environment where they define their pipeline:
>    watch()
>       - This is very useful for tests where pipeline can be defined in
>       test methods.
>       - If the user simply creates pipeline in a Jupyter notebook or a
>       plain Python script, they don't have to know/use this feature at all.
>    - Let users create an interactive pipeline: create_pipeline()
>       - invoking create_pipeline(), the user gets a Pipeline object that
>       works as any other Pipeline object created from apache_beam.Pipeline()
>       - However, the pipeline object p, when invoking p.run(), does some
>       extra interactive magic.
>       - We'll support interactive execution for DirectRunner at this
>       moment.
>    - Let users run the interactive pipeline as a normal pipeline:
>    run_pipeline()
>       - In an interactive environment, a user only needs to add and
>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>       interactive pipeline object as normal pipeline in any selected platform.
>       - We'll probably support Dataflow only. Other implementations can
>       be added though.
>    - Let users introspect any intermediate PCollection they have handler
>    to: visualize()
>       - If a user ever writes pcoll = p | "Some Transform" >>
>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>       executed.
>       - p can be batch or streaming
>       - The visualization will be some plot graph of data for the given
>       PCollection as if it's materialized. If the PCollection is unbounded, the
>       graph is dynamic.
>
> The PR will implement 1 and 2.
>
> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> level JIRA and add blocking JIRAs as development goes.
>
> External Beam users will not worry about any of the underlying
> implementation details.
> Except the 4 interfaces above, they learn and write normal Beam code and
> can execute the pipeline immediately when they are done with prototyping.
>
> Ning.
>