You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Ning Kang <ni...@google.com> on 2021/09/22 18:04:16 UTC

[VOTE] naming/usage when running Beam SQL on different runners in notebooks

Hi,

*TL;DR*: If you are not a notebook/IPython or a Beam SQL user/developer,
you may ignore this email. If you are interested in the topic, please
continue.

Recently, we merged an example
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb>
notebook utilizing a new `beam_sql` IPython magic
<https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if
you're not familiar with magics, here are some built-in examples
<https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to run Beam
SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for
simplicity, we didn't expose the pipeline option to switch dialects, so
it's only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
Basically, its usage in notebooks is:
[image: help.png]
And an example query joining 2 PCollections:
[image: join.png]

It currently by default runs locally in a notebook instance on the
InteractiveRunner (DirectRunner). We want to allow users to run a pipeline
built from it on other runners too.
Here are a few proposals:

   1. Add another option in the beam_sql magic: -r, --runner specify the
   runner name (case/space insensitive) such as a DataflowRunner or a
   FlinkRunner.
   2. Add a separate magic for each runner: dataflow_sql, flink_sql and etc.
   3. Create other runner instances to run the
   pipeline: SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
   4. Any other suggestions?

I personally prefer the 1st option at this moment. Please share your
thoughts, thanks!

Ning.

Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Posted by Ning Kang <ni...@google.com>.
To Alexey,

Are there any plans to add a support for Spark runner?
>

Temporarily we don't have the plan. But if the effort to support it is
trivial (could be natively supported
<https://beam.apache.org/documentation/runners/spark/> through the
PortableRunner), we will add an example for it too.

On Fri, Sep 24, 2021 at 10:11 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> +1 for option (1)
>
> Are there any plans to add a support for Spark runner?
>
> On 24 Sep 2021, at 19:03, Ning Kang <ni...@google.com> wrote:
>
> To Kyle,
>
> Yes, we have examples
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive/examples>
> to run notebooks with DirectRunner and FlinkRunner locally and
> DataflowRunner remotely.
> Theoretically, should have access to the PCollection cache, any runner can
> be used as the underlying runner of the InteractiveRunner since it's only a
> thin wrapper.
> We plan to add more support in the future such as remotely run notebooks
> on FlinkRunner-on-Dataproc.
>
> To Brian,
>
>> Does interactive Beam have a way to set the runner once and have that
>> apply to future invocations?
>
> Yes, you can set the underlying runner of the InteractiveRunner when you
> create the pipeline object.
>
> You can also build a pipeline with one runner and run it with another as
> long as it's portable. For example, if option (1) is implemented, the "-r,
> --runner" option allows you to run the pipeline you have built on a
> selected runner. If you have chained a few beam_sql magics, ideally,
> Interactive Beam should have memorized them all and can run them as a
> single pipeline on a different runner.
> Usually you use a known dataset to prototype your SQLs locally and inspect
> the result quickly, then switch to a production dataset and run it remotely
> on your production cluster/service.
>
> Thanks for the vote!
>
>
>
> On Wed, Sep 22, 2021 at 1:05 PM Brian Hulette <bh...@google.com> wrote:
>
>> +1 for option (1).
>>
>> Does interactive Beam have a way to set the runner once and have that
>> apply to future invocations?
>>
>> I don't like (2), since it could lead to confusion with separate SQL
>> solutions offered by runners (e.g. Dataflow SQL [1], Spark SQL [2], Flink
>> SQL [3])
>>
>> [1] https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro
>> [2] https://spark.apache.org/sql/
>> [3]
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/
>>
>> On Wed, Sep 22, 2021 at 11:04 AM Ning Kang <ni...@google.com> wrote:
>>
>>> Hi,
>>>
>>> *TL;DR*: If you are not a notebook/IPython or a Beam SQL
>>> user/developer, you may ignore this email. If you are interested in the
>>> topic, please continue.
>>>
>>> Recently, we merged an example
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb>
>>> notebook utilizing a new `beam_sql` IPython magic
>>> <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if
>>> you're not familiar with magics, here are some built-in examples
>>> <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to
>>> run Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for
>>> simplicity, we didn't expose the pipeline option to switch dialects, so
>>> it's only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
>>> Basically, its usage in notebooks is:
>>> <help.png>
>>> And an example query joining 2 PCollections:
>>> <join.png>
>>>
>>> It currently by default runs locally in a notebook instance on the
>>> InteractiveRunner (DirectRunner). We want to allow users to run a pipeline
>>> built from it on other runners too.
>>> Here are a few proposals:
>>>
>>>    1. Add another option in the beam_sql magic: -r, --runner specify
>>>    the runner name (case/space insensitive) such as a DataflowRunner or a
>>>    FlinkRunner.
>>>    2. Add a separate magic for each runner: dataflow_sql, flink_sql and
>>>    etc.
>>>    3. Create other runner instances to run the
>>>    pipeline: SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
>>>    4. Any other suggestions?
>>>
>>> I personally prefer the 1st option at this moment. Please share your
>>> thoughts, thanks!
>>>
>>> Ning.
>>>
>>
>

Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Posted by Alexey Romanenko <ar...@gmail.com>.
+1 for option (1)

Are there any plans to add a support for Spark runner?

> On 24 Sep 2021, at 19:03, Ning Kang <ni...@google.com> wrote:
> 
> To Kyle,
> 
> Yes, we have examples <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive/examples> to run notebooks with DirectRunner and FlinkRunner locally and DataflowRunner remotely.
> Theoretically, should have access to the PCollection cache, any runner can be used as the underlying runner of the InteractiveRunner since it's only a thin wrapper.
> We plan to add more support in the future such as remotely run notebooks on FlinkRunner-on-Dataproc.
> 
> To Brian,
> Does interactive Beam have a way to set the runner once and have that apply to future invocations?
> Yes, you can set the underlying runner of the InteractiveRunner when you create the pipeline object.
> 
> You can also build a pipeline with one runner and run it with another as long as it's portable. For example, if option (1) is implemented, the "-r, --runner" option allows you to run the pipeline you have built on a selected runner. If you have chained a few beam_sql magics, ideally, Interactive Beam should have memorized them all and can run them as a single pipeline on a different runner.
> Usually you use a known dataset to prototype your SQLs locally and inspect the result quickly, then switch to a production dataset and run it remotely on your production cluster/service.
> 
> Thanks for the vote!
> 
> 
> 
> On Wed, Sep 22, 2021 at 1:05 PM Brian Hulette <bhulette@google.com <ma...@google.com>> wrote:
> +1 for option (1).
> 
> Does interactive Beam have a way to set the runner once and have that apply to future invocations?
> 
> I don't like (2), since it could lead to confusion with separate SQL solutions offered by runners (e.g. Dataflow SQL [1], Spark SQL [2], Flink SQL [3])
> 
> [1] https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro <https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro>
> [2] https://spark.apache.org/sql/ <https://spark.apache.org/sql/>
> [3] https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/ <https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/>
> On Wed, Sep 22, 2021 at 11:04 AM Ning Kang <ningk@google.com <ma...@google.com>> wrote:
> Hi,
> 
> TL;DR: If you are not a notebook/IPython or a Beam SQL user/developer, you may ignore this email. If you are interested in the topic, please continue.
> 
> Recently, we merged an example <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb> notebook utilizing a new `beam_sql` IPython magic <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if you're not familiar with magics, here are some built-in examples <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to run Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for simplicity, we didn't expose the pipeline option to switch dialects, so it's only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
> Basically, its usage in notebooks is:
> <help.png>
> And an example query joining 2 PCollections:
> <join.png>
> 
> It currently by default runs locally in a notebook instance on the InteractiveRunner (DirectRunner). We want to allow users to run a pipeline built from it on other runners too.
> Here are a few proposals:
> Add another option in the beam_sql magic: -r, --runner specify the runner name (case/space insensitive) such as a DataflowRunner or a FlinkRunner.
> Add a separate magic for each runner: dataflow_sql, flink_sql and etc.
> Create other runner instances to run the pipeline: SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
> Any other suggestions?
> I personally prefer the 1st option at this moment. Please share your thoughts, thanks!
> 
> Ning.


Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Posted by Ning Kang <ni...@google.com>.
To Kyle,

Yes, we have examples
<https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive/examples>
to run notebooks with DirectRunner and FlinkRunner locally and
DataflowRunner remotely.
Theoretically, should have access to the PCollection cache, any runner can
be used as the underlying runner of the InteractiveRunner since it's only a
thin wrapper.
We plan to add more support in the future such as remotely run notebooks on
FlinkRunner-on-Dataproc.

To Brian,

> Does interactive Beam have a way to set the runner once and have that
> apply to future invocations?

Yes, you can set the underlying runner of the InteractiveRunner when you
create the pipeline object.

You can also build a pipeline with one runner and run it with another as
long as it's portable. For example, if option (1) is implemented, the "-r,
--runner" option allows you to run the pipeline you have built on a
selected runner. If you have chained a few beam_sql magics, ideally,
Interactive Beam should have memorized them all and can run them as a
single pipeline on a different runner.
Usually you use a known dataset to prototype your SQLs locally and inspect
the result quickly, then switch to a production dataset and run it remotely
on your production cluster/service.

Thanks for the vote!



On Wed, Sep 22, 2021 at 1:05 PM Brian Hulette <bh...@google.com> wrote:

> +1 for option (1).
>
> Does interactive Beam have a way to set the runner once and have that
> apply to future invocations?
>
> I don't like (2), since it could lead to confusion with separate SQL
> solutions offered by runners (e.g. Dataflow SQL [1], Spark SQL [2], Flink
> SQL [3])
>
> [1] https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro
> [2] https://spark.apache.org/sql/
> [3]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/
>
> On Wed, Sep 22, 2021 at 11:04 AM Ning Kang <ni...@google.com> wrote:
>
>> Hi,
>>
>> *TL;DR*: If you are not a notebook/IPython or a Beam SQL user/developer,
>> you may ignore this email. If you are interested in the topic, please
>> continue.
>>
>> Recently, we merged an example
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb>
>> notebook utilizing a new `beam_sql` IPython magic
>> <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if
>> you're not familiar with magics, here are some built-in examples
>> <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to
>> run Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for
>> simplicity, we didn't expose the pipeline option to switch dialects, so
>> it's only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
>> Basically, its usage in notebooks is:
>> [image: help.png]
>> And an example query joining 2 PCollections:
>> [image: join.png]
>>
>> It currently by default runs locally in a notebook instance on the
>> InteractiveRunner (DirectRunner). We want to allow users to run a pipeline
>> built from it on other runners too.
>> Here are a few proposals:
>>
>>    1. Add another option in the beam_sql magic: -r, --runner specify the
>>    runner name (case/space insensitive) such as a DataflowRunner or a
>>    FlinkRunner.
>>    2. Add a separate magic for each runner: dataflow_sql, flink_sql and
>>    etc.
>>    3. Create other runner instances to run the
>>    pipeline: SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
>>    4. Any other suggestions?
>>
>> I personally prefer the 1st option at this moment. Please share your
>> thoughts, thanks!
>>
>> Ning.
>>
>

Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Posted by Brian Hulette <bh...@google.com>.
+1 for option (1).

Does interactive Beam have a way to set the runner once and have that apply
to future invocations?

I don't like (2), since it could lead to confusion with separate SQL
solutions offered by runners (e.g. Dataflow SQL [1], Spark SQL [2], Flink
SQL [3])

[1] https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro
[2] https://spark.apache.org/sql/
[3]
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/

On Wed, Sep 22, 2021 at 11:04 AM Ning Kang <ni...@google.com> wrote:

> Hi,
>
> *TL;DR*: If you are not a notebook/IPython or a Beam SQL user/developer,
> you may ignore this email. If you are interested in the topic, please
> continue.
>
> Recently, we merged an example
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb>
> notebook utilizing a new `beam_sql` IPython magic
> <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if
> you're not familiar with magics, here are some built-in examples
> <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to
> run Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for
> simplicity, we didn't expose the pipeline option to switch dialects, so
> it's only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
> Basically, its usage in notebooks is:
> [image: help.png]
> And an example query joining 2 PCollections:
> [image: join.png]
>
> It currently by default runs locally in a notebook instance on the
> InteractiveRunner (DirectRunner). We want to allow users to run a pipeline
> built from it on other runners too.
> Here are a few proposals:
>
>    1. Add another option in the beam_sql magic: -r, --runner specify the
>    runner name (case/space insensitive) such as a DataflowRunner or a
>    FlinkRunner.
>    2. Add a separate magic for each runner: dataflow_sql, flink_sql and
>    etc.
>    3. Create other runner instances to run the
>    pipeline: SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
>    4. Any other suggestions?
>
> I personally prefer the 1st option at this moment. Please share your
> thoughts, thanks!
>
> Ning.
>

Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Posted by Kyle Weaver <kc...@google.com>.
Hi Ning, for context, does interactive beam support other runners already,
or are there any plans to add this support in the future?

On Wed, Sep 22, 2021 at 11:05 AM Ning Kang <ni...@google.com> wrote:

> Hi,
>
> *TL;DR*: If you are not a notebook/IPython or a Beam SQL user/developer,
> you may ignore this email. If you are interested in the topic, please
> continue.
>
> Recently, we merged an example
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb>
> notebook utilizing a new `beam_sql` IPython magic
> <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if
> you're not familiar with magics, here are some built-in examples
> <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to
> run Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for
> simplicity, we didn't expose the pipeline option to switch dialects, so
> it's only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
> Basically, its usage in notebooks is:
> [image: help.png]
> And an example query joining 2 PCollections:
> [image: join.png]
>
> It currently by default runs locally in a notebook instance on the
> InteractiveRunner (DirectRunner). We want to allow users to run a pipeline
> built from it on other runners too.
> Here are a few proposals:
>
>    1. Add another option in the beam_sql magic: -r, --runner specify the
>    runner name (case/space insensitive) such as a DataflowRunner or a
>    FlinkRunner.
>    2. Add a separate magic for each runner: dataflow_sql, flink_sql and
>    etc.
>    3. Create other runner instances to run the
>    pipeline: SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
>    4. Any other suggestions?
>
> I personally prefer the 1st option at this moment. Please share your
> thoughts, thanks!
>
> Ning.
>