You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Rahul Iyer <ra...@twitter.com> on 2022/02/24 18:10:26 UTC

Support for beam-plugins in Python Runner v2

Good Morning/Afternoon/Evening folks,

The current support for beam-plugins is experimental and we would like to
have it as a first class member of the beam library for Python Runner v2.
This helps us load plugins into the runtime before starting the SdkHarness.
https://github.com/apache/beam/pull/16920 is a PR I created for this
purpose. Wanted to gather some thoughts around the approach here and have
it standardized. The current implementation of beam plugins allows users to
extend a class from BeamPlugin and it gets automatically populated in the
--beam_plugin PipelineOption, e.g.: FileSystem
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
This creates the pipeline option as,

--beam_plugin=[

  'apache_beam.io.aws.s3filesystem.S3FileSystem',

  'apache_beam.io.filesystem.FileSystem',

  'apache_beam.io.hadoopfilesystem.HadoopFileSystem',

  'apache_beam.io.localfilesystem.LocalFileSystem',

  'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',

  'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'

]

Another way is to provide a module via the --beam_plugin PipelineOption,
e.g.:

--beam_plugin='twitter.beam.rule_the_world'

The current implementation in the PR supports both these approaches but
would love to have a standardized way forward and have it documented. Would
love to hear your thoughts about this.

Thanks & Regards,
Rahul Iyer

Re: Support for beam-plugins in Python Runner v2

Posted by Robert Bradshaw <ro...@google.com>.
Awesome, thanks!

On Wed, Mar 2, 2022 at 6:45 PM Rahul Iyer <ra...@twitter.com> wrote:

> I have created a new PR (https://github.com/apache/beam/pull/16998)
> because the previous one closed for some reason and I cannot re-open it
> (apologies about that). Our use-case for this is to import some Twitter
> specific plugins on the worker. Please let me know if there is anything
> else needed on this.
>
> Thanks & Regards,
> Rahul Iyer
>
>
> On Tue, Mar 1, 2022 at 4:25 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Tue, Mar 1, 2022 at 12:36 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> +1 as well.
>>>
>>> One Beam user mentioned a while ago they needed to run smth like `import
>>> tensorflow_addons; tensorflow_addons.register.register_all()` as a very
>>> first operation once the Python process starts. Looks like importing a
>>> plugin as a module would allow exactly that.
>>>
>>> Does anyone know why FileSystems use the Plugin interface? Does this
>>> mean that some aspects of the FS functionality didn't work for portable
>>> pipelines all this time but we didn't notice it?
>>>
>>
>> It would be broken if someone tried to use a FileSystem that was not
>> imported by default.
>>
>>
>>>
>>>
>>>
>>> On Tue, Mar 1, 2022 at 10:16 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> +1 to this change which looks like an omission in runner v2.
>>>>
>>>> The intent of beam plugins is to enforce that modules imported during
>>>> pipeline construction will also get imported on the workers (e.g. if
>>>> certain setup or registration needs to be performed). (Python doesn't have
>>>> service registries like Java.) The other piece of this is populating the
>>>> pipeline option with BeamPlugin.get_all_plugin_paths() which looks like
>>>> it's only done on the Dataflow runner:
>>>> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511
>>>>
>>>> I have no idea why SdkContainerImageBuilder extends beam.Plugin
>>>> though...
>>>>
>>>>
>>>> On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Hi Rahul,
>>>>>
>>>>> Your change looks reasonable to me. And it brings dataflow python
>>>>> runners v1 and v2 closer.
>>>>>
>>>>> Related to removing the experimental, I have some questions: What is
>>>>> the use case for beam plugins? I do not see them being used often. IIRC the
>>>>> main use was importing modules before starting the worker. There is also a
>>>>> beam plugin clearly intended for runner v2 (custom containers) [1], how
>>>>> does that work on V2?
>>>>>
>>>>> Ahmet
>>>>>
>>>>> /cc @Yichi Zhang <zy...@google.com> @Valentyn Tymofieiev
>>>>> <va...@google.com> @Ryan Thompson <ry...@google.com>
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68
>>>>>
>>>>> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <ra...@twitter.com>
>>>>> wrote:
>>>>>
>>>>>> Good Morning/Afternoon/Evening folks,
>>>>>>
>>>>>> The current support for beam-plugins is experimental and we would
>>>>>> like to have it as a first class member of the beam library for Python
>>>>>> Runner v2. This helps us load plugins into the runtime before starting the
>>>>>> SdkHarness. https://github.com/apache/beam/pull/16920 is a PR I
>>>>>> created for this purpose. Wanted to gather some thoughts around the
>>>>>> approach here and have it standardized. The current implementation of beam
>>>>>> plugins allows users to extend a class from BeamPlugin and it gets
>>>>>> automatically populated in the --beam_plugin PipelineOption, e.g.:
>>>>>> FileSystem
>>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
>>>>>> This creates the pipeline option as,
>>>>>>
>>>>>> --beam_plugin=[
>>>>>>
>>>>>>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>>>>>>
>>>>>>   'apache_beam.io.filesystem.FileSystem',
>>>>>>
>>>>>>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>>>>>>
>>>>>>   'apache_beam.io.localfilesystem.LocalFileSystem',
>>>>>>
>>>>>>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>>>>>>
>>>>>>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>>>>>>
>>>>>> ]
>>>>>>
>>>>>> Another way is to provide a module via the --beam_plugin
>>>>>> PipelineOption, e.g.:
>>>>>>
>>>>>> --beam_plugin='twitter.beam.rule_the_world'
>>>>>>
>>>>>> The current implementation in the PR supports both these approaches
>>>>>> but would love to have a standardized way forward and have it documented.
>>>>>> Would love to hear your thoughts about this.
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Rahul Iyer
>>>>>>
>>>>>

Re: Support for beam-plugins in Python Runner v2

Posted by Rahul Iyer <ra...@twitter.com>.
I have created a new PR (https://github.com/apache/beam/pull/16998) because
the previous one closed for some reason and I cannot re-open it (apologies
about that). Our use-case for this is to import some Twitter specific
plugins on the worker. Please let me know if there is anything else needed
on this.

Thanks & Regards,
Rahul Iyer


On Tue, Mar 1, 2022 at 4:25 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, Mar 1, 2022 at 12:36 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> +1 as well.
>>
>> One Beam user mentioned a while ago they needed to run smth like `import
>> tensorflow_addons; tensorflow_addons.register.register_all()` as a very
>> first operation once the Python process starts. Looks like importing a
>> plugin as a module would allow exactly that.
>>
>> Does anyone know why FileSystems use the Plugin interface? Does this mean
>> that some aspects of the FS functionality didn't work for portable
>> pipelines all this time but we didn't notice it?
>>
>
> It would be broken if someone tried to use a FileSystem that was not
> imported by default.
>
>
>>
>>
>>
>> On Tue, Mar 1, 2022 at 10:16 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> +1 to this change which looks like an omission in runner v2.
>>>
>>> The intent of beam plugins is to enforce that modules imported during
>>> pipeline construction will also get imported on the workers (e.g. if
>>> certain setup or registration needs to be performed). (Python doesn't have
>>> service registries like Java.) The other piece of this is populating the
>>> pipeline option with BeamPlugin.get_all_plugin_paths() which looks like
>>> it's only done on the Dataflow runner:
>>> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511
>>>
>>> I have no idea why SdkContainerImageBuilder extends beam.Plugin though...
>>>
>>>
>>> On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Hi Rahul,
>>>>
>>>> Your change looks reasonable to me. And it brings dataflow python
>>>> runners v1 and v2 closer.
>>>>
>>>> Related to removing the experimental, I have some questions: What is
>>>> the use case for beam plugins? I do not see them being used often. IIRC the
>>>> main use was importing modules before starting the worker. There is also a
>>>> beam plugin clearly intended for runner v2 (custom containers) [1], how
>>>> does that work on V2?
>>>>
>>>> Ahmet
>>>>
>>>> /cc @Yichi Zhang <zy...@google.com> @Valentyn Tymofieiev
>>>> <va...@google.com> @Ryan Thompson <ry...@google.com>
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68
>>>>
>>>> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <ra...@twitter.com> wrote:
>>>>
>>>>> Good Morning/Afternoon/Evening folks,
>>>>>
>>>>> The current support for beam-plugins is experimental and we would like
>>>>> to have it as a first class member of the beam library for Python Runner
>>>>> v2. This helps us load plugins into the runtime before starting the
>>>>> SdkHarness. https://github.com/apache/beam/pull/16920 is a PR I
>>>>> created for this purpose. Wanted to gather some thoughts around the
>>>>> approach here and have it standardized. The current implementation of beam
>>>>> plugins allows users to extend a class from BeamPlugin and it gets
>>>>> automatically populated in the --beam_plugin PipelineOption, e.g.:
>>>>> FileSystem
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
>>>>> This creates the pipeline option as,
>>>>>
>>>>> --beam_plugin=[
>>>>>
>>>>>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>>>>>
>>>>>   'apache_beam.io.filesystem.FileSystem',
>>>>>
>>>>>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>>>>>
>>>>>   'apache_beam.io.localfilesystem.LocalFileSystem',
>>>>>
>>>>>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>>>>>
>>>>>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>>>>>
>>>>> ]
>>>>>
>>>>> Another way is to provide a module via the --beam_plugin
>>>>> PipelineOption, e.g.:
>>>>>
>>>>> --beam_plugin='twitter.beam.rule_the_world'
>>>>>
>>>>> The current implementation in the PR supports both these approaches
>>>>> but would love to have a standardized way forward and have it documented.
>>>>> Would love to hear your thoughts about this.
>>>>>
>>>>> Thanks & Regards,
>>>>> Rahul Iyer
>>>>>
>>>>

Re: Support for beam-plugins in Python Runner v2

Posted by Robert Bradshaw <ro...@google.com>.
On Tue, Mar 1, 2022 at 12:36 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> +1 as well.
>
> One Beam user mentioned a while ago they needed to run smth like `import
> tensorflow_addons; tensorflow_addons.register.register_all()` as a very
> first operation once the Python process starts. Looks like importing a
> plugin as a module would allow exactly that.
>
> Does anyone know why FileSystems use the Plugin interface? Does this mean
> that some aspects of the FS functionality didn't work for portable
> pipelines all this time but we didn't notice it?
>

It would be broken if someone tried to use a FileSystem that was not
imported by default.


>
>
>
> On Tue, Mar 1, 2022 at 10:16 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> +1 to this change which looks like an omission in runner v2.
>>
>> The intent of beam plugins is to enforce that modules imported during
>> pipeline construction will also get imported on the workers (e.g. if
>> certain setup or registration needs to be performed). (Python doesn't have
>> service registries like Java.) The other piece of this is populating the
>> pipeline option with BeamPlugin.get_all_plugin_paths() which looks like
>> it's only done on the Dataflow runner:
>> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511
>>
>> I have no idea why SdkContainerImageBuilder extends beam.Plugin though...
>>
>>
>> On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <al...@google.com> wrote:
>>
>>> Hi Rahul,
>>>
>>> Your change looks reasonable to me. And it brings dataflow python
>>> runners v1 and v2 closer.
>>>
>>> Related to removing the experimental, I have some questions: What is the
>>> use case for beam plugins? I do not see them being used often. IIRC the
>>> main use was importing modules before starting the worker. There is also a
>>> beam plugin clearly intended for runner v2 (custom containers) [1], how
>>> does that work on V2?
>>>
>>> Ahmet
>>>
>>> /cc @Yichi Zhang <zy...@google.com> @Valentyn Tymofieiev
>>> <va...@google.com> @Ryan Thompson <ry...@google.com>
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68
>>>
>>> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <ra...@twitter.com> wrote:
>>>
>>>> Good Morning/Afternoon/Evening folks,
>>>>
>>>> The current support for beam-plugins is experimental and we would like
>>>> to have it as a first class member of the beam library for Python Runner
>>>> v2. This helps us load plugins into the runtime before starting the
>>>> SdkHarness. https://github.com/apache/beam/pull/16920 is a PR I
>>>> created for this purpose. Wanted to gather some thoughts around the
>>>> approach here and have it standardized. The current implementation of beam
>>>> plugins allows users to extend a class from BeamPlugin and it gets
>>>> automatically populated in the --beam_plugin PipelineOption, e.g.:
>>>> FileSystem
>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
>>>> This creates the pipeline option as,
>>>>
>>>> --beam_plugin=[
>>>>
>>>>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>>>>
>>>>   'apache_beam.io.filesystem.FileSystem',
>>>>
>>>>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>>>>
>>>>   'apache_beam.io.localfilesystem.LocalFileSystem',
>>>>
>>>>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>>>>
>>>>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>>>>
>>>> ]
>>>>
>>>> Another way is to provide a module via the --beam_plugin
>>>> PipelineOption, e.g.:
>>>>
>>>> --beam_plugin='twitter.beam.rule_the_world'
>>>>
>>>> The current implementation in the PR supports both these approaches but
>>>> would love to have a standardized way forward and have it documented. Would
>>>> love to hear your thoughts about this.
>>>>
>>>> Thanks & Regards,
>>>> Rahul Iyer
>>>>
>>>

Re: Support for beam-plugins in Python Runner v2

Posted by Valentyn Tymofieiev <va...@google.com>.
+1 as well.

One Beam user mentioned a while ago they needed to run smth like `import
tensorflow_addons; tensorflow_addons.register.register_all()` as a very
first operation once the Python process starts. Looks like importing a
plugin as a module would allow exactly that.

Does anyone know why FileSystems use the Plugin interface? Does this mean
that some aspects of the FS functionality didn't work for portable
pipelines all this time but we didn't notice it?



On Tue, Mar 1, 2022 at 10:16 AM Robert Bradshaw <ro...@google.com> wrote:

> +1 to this change which looks like an omission in runner v2.
>
> The intent of beam plugins is to enforce that modules imported during
> pipeline construction will also get imported on the workers (e.g. if
> certain setup or registration needs to be performed). (Python doesn't have
> service registries like Java.) The other piece of this is populating the
> pipeline option with BeamPlugin.get_all_plugin_paths() which looks like
> it's only done on the Dataflow runner:
> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511
>
> I have no idea why SdkContainerImageBuilder extends beam.Plugin though...
>
>
> On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <al...@google.com> wrote:
>
>> Hi Rahul,
>>
>> Your change looks reasonable to me. And it brings dataflow python runners
>> v1 and v2 closer.
>>
>> Related to removing the experimental, I have some questions: What is the
>> use case for beam plugins? I do not see them being used often. IIRC the
>> main use was importing modules before starting the worker. There is also a
>> beam plugin clearly intended for runner v2 (custom containers) [1], how
>> does that work on V2?
>>
>> Ahmet
>>
>> /cc @Yichi Zhang <zy...@google.com> @Valentyn Tymofieiev
>> <va...@google.com> @Ryan Thompson <ry...@google.com>
>>
>> [1]
>> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68
>>
>> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <ra...@twitter.com> wrote:
>>
>>> Good Morning/Afternoon/Evening folks,
>>>
>>> The current support for beam-plugins is experimental and we would like
>>> to have it as a first class member of the beam library for Python Runner
>>> v2. This helps us load plugins into the runtime before starting the
>>> SdkHarness. https://github.com/apache/beam/pull/16920 is a PR I created
>>> for this purpose. Wanted to gather some thoughts around the approach here
>>> and have it standardized. The current implementation of beam plugins allows
>>> users to extend a class from BeamPlugin and it gets automatically populated
>>> in the --beam_plugin PipelineOption, e.g.: FileSystem
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
>>> This creates the pipeline option as,
>>>
>>> --beam_plugin=[
>>>
>>>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>>>
>>>   'apache_beam.io.filesystem.FileSystem',
>>>
>>>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>>>
>>>   'apache_beam.io.localfilesystem.LocalFileSystem',
>>>
>>>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>>>
>>>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>>>
>>> ]
>>>
>>> Another way is to provide a module via the --beam_plugin
>>> PipelineOption, e.g.:
>>>
>>> --beam_plugin='twitter.beam.rule_the_world'
>>>
>>> The current implementation in the PR supports both these approaches but
>>> would love to have a standardized way forward and have it documented. Would
>>> love to hear your thoughts about this.
>>>
>>> Thanks & Regards,
>>> Rahul Iyer
>>>
>>

Re: Support for beam-plugins in Python Runner v2

Posted by Robert Bradshaw <ro...@google.com>.
+1 to this change which looks like an omission in runner v2.

The intent of beam plugins is to enforce that modules imported during
pipeline construction will also get imported on the workers (e.g. if
certain setup or registration needs to be performed). (Python doesn't have
service registries like Java.) The other piece of this is populating the
pipeline option with BeamPlugin.get_all_plugin_paths() which looks like
it's only done on the Dataflow runner:
https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511

I have no idea why SdkContainerImageBuilder extends beam.Plugin though...


On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <al...@google.com> wrote:

> Hi Rahul,
>
> Your change looks reasonable to me. And it brings dataflow python runners
> v1 and v2 closer.
>
> Related to removing the experimental, I have some questions: What is the
> use case for beam plugins? I do not see them being used often. IIRC the
> main use was importing modules before starting the worker. There is also a
> beam plugin clearly intended for runner v2 (custom containers) [1], how
> does that work on V2?
>
> Ahmet
>
> /cc @Yichi Zhang <zy...@google.com> @Valentyn Tymofieiev
> <va...@google.com> @Ryan Thompson <ry...@google.com>
>
> [1]
> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68
>
> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <ra...@twitter.com> wrote:
>
>> Good Morning/Afternoon/Evening folks,
>>
>> The current support for beam-plugins is experimental and we would like to
>> have it as a first class member of the beam library for Python Runner v2.
>> This helps us load plugins into the runtime before starting the SdkHarness.
>> https://github.com/apache/beam/pull/16920 is a PR I created for this
>> purpose. Wanted to gather some thoughts around the approach here and have
>> it standardized. The current implementation of beam plugins allows users to
>> extend a class from BeamPlugin and it gets automatically populated in the
>> --beam_plugin PipelineOption, e.g.: FileSystem
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
>> This creates the pipeline option as,
>>
>> --beam_plugin=[
>>
>>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>>
>>   'apache_beam.io.filesystem.FileSystem',
>>
>>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>>
>>   'apache_beam.io.localfilesystem.LocalFileSystem',
>>
>>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>>
>>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>>
>> ]
>>
>> Another way is to provide a module via the --beam_plugin PipelineOption,
>> e.g.:
>>
>> --beam_plugin='twitter.beam.rule_the_world'
>>
>> The current implementation in the PR supports both these approaches but
>> would love to have a standardized way forward and have it documented. Would
>> love to hear your thoughts about this.
>>
>> Thanks & Regards,
>> Rahul Iyer
>>
>

Re: Support for beam-plugins in Python Runner v2

Posted by Ahmet Altay <al...@google.com>.
Hi Rahul,

Your change looks reasonable to me. And it brings dataflow python runners
v1 and v2 closer.

Related to removing the experimental, I have some questions: What is the
use case for beam plugins? I do not see them being used often. IIRC the
main use was importing modules before starting the worker. There is also a
beam plugin clearly intended for runner v2 (custom containers) [1], how
does that work on V2?

Ahmet

/cc @Yichi Zhang <zy...@google.com> @Valentyn Tymofieiev
<va...@google.com> @Ryan Thompson <ry...@google.com>

[1]
https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68

On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <ra...@twitter.com> wrote:

> Good Morning/Afternoon/Evening folks,
>
> The current support for beam-plugins is experimental and we would like to
> have it as a first class member of the beam library for Python Runner v2.
> This helps us load plugins into the runtime before starting the SdkHarness.
> https://github.com/apache/beam/pull/16920 is a PR I created for this
> purpose. Wanted to gather some thoughts around the approach here and have
> it standardized. The current implementation of beam plugins allows users to
> extend a class from BeamPlugin and it gets automatically populated in the
> --beam_plugin PipelineOption, e.g.: FileSystem
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
> This creates the pipeline option as,
>
> --beam_plugin=[
>
>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>
>   'apache_beam.io.filesystem.FileSystem',
>
>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>
>   'apache_beam.io.localfilesystem.LocalFileSystem',
>
>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>
>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>
> ]
>
> Another way is to provide a module via the --beam_plugin PipelineOption,
> e.g.:
>
> --beam_plugin='twitter.beam.rule_the_world'
>
> The current implementation in the PR supports both these approaches but
> would love to have a standardized way forward and have it documented. Would
> love to hear your thoughts about this.
>
> Thanks & Regards,
> Rahul Iyer
>