You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Sumit Desai via user <us...@beam.apache.org> on 2023/12/18 07:35:37 UTC

Specifying dataflow template location with Apache beam Python SDK

I am creating an Apache beam pipeline using Python SDK.I want to use some
standard template of dataflow (this one
<https://console.cloud.google.com/gcr/images/dataflow-templates-base/global/python310-template-launcher-base?tab=info>).
But when I am specifying it using 'template_location' key while creating
pipeline_options object, I am getting an error `FileNotFoundError: [Errno
2] No such file or directory: '
gcr.io/dataflow-templates-base/python310-template-launcher-base'`

I also tried to specify the complete version `
gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
but got the same error. Can someone suggest what I might be doing wrong?
The code snippet to create pipeline_options is as follows-

def __create_pipeline_options_dataflow(job_name):


    # Set up the Dataflow runner options
    gcp_project_id = os.environ.get(GCP_PROJECT_ID)
    # TODO:Move to environmental variables
    pipeline_options = {
        'project': gcp_project_id,
        'region': "us-east1",
        'job_name': job_name,  # Provide a unique job name
        'temp_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
        'staging_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
        'runner': 'DataflowRunner',
        'save_main_session': True,
        'service_account_email': service_account,
        # 'network': f'projects/{gcp_project_id}/global/networks/default',
        # 'subnetwork':
f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
        'template_location': '
gcr.io/dataflow-templates-base/python310-template-launcher-base'

    }
    logger.debug(f"pipeline_options created as {pipeline_options}")
    return pipeline_options

Re: Specifying dataflow template location with Apache beam Python SDK

Posted by Sumit Desai via user <us...@beam.apache.org>.
Thanks all. Yes I was under a misunderstanding that we can directly use one
of these templates as a base without creating a custom template. Thanks for
clarifying it for me.

Regards,
Sumit Desai

On Mon, 18 Dec 2023, 10:34 pm Bruno Volpato via user, <us...@beam.apache.org>
wrote:

> Right, there's some misunderstanding here, so Bartosz and XQ's inputs are
> correct.
>
> Just want to add that the template_location parameter is the GCS path that
> you want to store your template on, and not the image reference of the base
> image.
> The GCR path that you are trying to use is used in the Dockerfile in case
> you are trying to use a Flex Template (see here:
> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images
> ).
>
> Best,
> Bruno
>
>
>
>
> On Mon, Dec 18, 2023 at 11:39 AM XQ Hu via user <us...@beam.apache.org>
> wrote:
>
>>
>> https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates
>> has a full example about how to create your own flex template. FYI.
>>
>> On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user <
>> user@beam.apache.org> wrote:
>>
>>> Hi Sumit,
>>> could you elaborate a little bit more on what you are trying to achieve
>>> with the templates?
>>>
>>> As far as I know, these base Docker images serve as base images for your
>>> own custom templates.
>>> If you want to use an existing template, you can use one of these:
>>> https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
>>> .
>>> To run it, you just need to invoke `gcloud dataflow jobs run... ` or
>>> equivalent command (
>>> https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
>>> Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
>>> Create Job From Template).
>>>
>>> If you want to create your own template (ie a reusable Dataflow
>>> pipeline) take a look at this page:
>>> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
>>> This will let you package your own pipeline as a template. You'll be able
>>> to launch it with the `gcloud dataflow jobs run...` command.
>>> If you want to create a custom container image, which gives you more
>>> control over the environment and dependencies, you can create your own,
>>> custom Docker image. That's where you'll use the base image you mentioned.
>>> See this page for an example:
>>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
>>> .
>>>
>>> I hope this helps, let me know if you have any other questions.
>>>
>>> Cheers,
>>> Bartosz Zablocki
>>>
>>> On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> I am creating an Apache beam pipeline using Python SDK.I want to use
>>>> some standard template of dataflow (this one
>>>> <https://console.cloud.google.com/gcr/images/dataflow-templates-base/global/python310-template-launcher-base?tab=info>).
>>>> But when I am specifying it using 'template_location' key while creating
>>>> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
>>>> 2] No such file or directory: '
>>>> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
>>>> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base'>
>>>>
>>>> I also tried to specify the complete version `
>>>> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
>>>> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00>
>>>> but got the same error. Can someone suggest what I might be doing wrong?
>>>> The code snippet to create pipeline_options is as follows-
>>>>
>>>> def __create_pipeline_options_dataflow(job_name):
>>>>
>>>>
>>>>     # Set up the Dataflow runner options
>>>>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>>>     # TODO:Move to environmental variables
>>>>     pipeline_options = {
>>>>         'project': gcp_project_id,
>>>>         'region': "us-east1",
>>>>         'job_name': job_name,  # Provide a unique job name
>>>>         'temp_location':
>>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>>         'staging_location':
>>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>>         'runner': 'DataflowRunner',
>>>>         'save_main_session': True,
>>>>         'service_account_email': service_account,
>>>>         # 'network':
>>>> f'projects/{gcp_project_id}/global/networks/default',
>>>>         # 'subnetwork':
>>>> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
>>>>         'template_location': '
>>>> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>>>
>>>>     }
>>>>     logger.debug(f"pipeline_options created as {pipeline_options}")
>>>>     return pipeline_options
>>>>
>>>>
>>>>

Re: Specifying dataflow template location with Apache beam Python SDK

Posted by Bruno Volpato via user <us...@beam.apache.org>.
Right, there's some misunderstanding here, so Bartosz and XQ's inputs are
correct.

Just want to add that the template_location parameter is the GCS path that
you want to store your template on, and not the image reference of the base
image.
The GCR path that you are trying to use is used in the Dockerfile in case
you are trying to use a Flex Template (see here:
https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images
).

Best,
Bruno




On Mon, Dec 18, 2023 at 11:39 AM XQ Hu via user <us...@beam.apache.org>
wrote:

>
> https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates
> has a full example about how to create your own flex template. FYI.
>
> On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user <
> user@beam.apache.org> wrote:
>
>> Hi Sumit,
>> could you elaborate a little bit more on what you are trying to achieve
>> with the templates?
>>
>> As far as I know, these base Docker images serve as base images for your
>> own custom templates.
>> If you want to use an existing template, you can use one of these:
>> https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
>> .
>> To run it, you just need to invoke `gcloud dataflow jobs run... ` or
>> equivalent command (
>> https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
>> Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
>> Create Job From Template).
>>
>> If you want to create your own template (ie a reusable Dataflow pipeline)
>> take a look at this page:
>> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
>> This will let you package your own pipeline as a template. You'll be able
>> to launch it with the `gcloud dataflow jobs run...` command.
>> If you want to create a custom container image, which gives you more
>> control over the environment and dependencies, you can create your own,
>> custom Docker image. That's where you'll use the base image you mentioned.
>> See this page for an example:
>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
>> .
>>
>> I hope this helps, let me know if you have any other questions.
>>
>> Cheers,
>> Bartosz Zablocki
>>
>> On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user <
>> user@beam.apache.org> wrote:
>>
>>> I am creating an Apache beam pipeline using Python SDK.I want to use
>>> some standard template of dataflow (this one
>>> <https://console.cloud.google.com/gcr/images/dataflow-templates-base/global/python310-template-launcher-base?tab=info>).
>>> But when I am specifying it using 'template_location' key while creating
>>> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
>>> 2] No such file or directory: '
>>> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
>>> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base'>
>>>
>>> I also tried to specify the complete version `
>>> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
>>> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00>
>>> but got the same error. Can someone suggest what I might be doing wrong?
>>> The code snippet to create pipeline_options is as follows-
>>>
>>> def __create_pipeline_options_dataflow(job_name):
>>>
>>>
>>>     # Set up the Dataflow runner options
>>>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>>     # TODO:Move to environmental variables
>>>     pipeline_options = {
>>>         'project': gcp_project_id,
>>>         'region': "us-east1",
>>>         'job_name': job_name,  # Provide a unique job name
>>>         'temp_location':
>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>         'staging_location':
>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>         'runner': 'DataflowRunner',
>>>         'save_main_session': True,
>>>         'service_account_email': service_account,
>>>         # 'network':
>>> f'projects/{gcp_project_id}/global/networks/default',
>>>         # 'subnetwork':
>>> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
>>>         'template_location': '
>>> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>>
>>>     }
>>>     logger.debug(f"pipeline_options created as {pipeline_options}")
>>>     return pipeline_options
>>>
>>>
>>>

Re: Specifying dataflow template location with Apache beam Python SDK

Posted by XQ Hu via user <us...@beam.apache.org>.
https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates
has a full example about how to create your own flex template. FYI.

On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user <
user@beam.apache.org> wrote:

> Hi Sumit,
> could you elaborate a little bit more on what you are trying to achieve
> with the templates?
>
> As far as I know, these base Docker images serve as base images for your
> own custom templates.
> If you want to use an existing template, you can use one of these:
> https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
> .
> To run it, you just need to invoke `gcloud dataflow jobs run... ` or
> equivalent command (
> https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
> Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
> Create Job From Template).
>
> If you want to create your own template (ie a reusable Dataflow pipeline)
> take a look at this page:
> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
> This will let you package your own pipeline as a template. You'll be able
> to launch it with the `gcloud dataflow jobs run...` command.
> If you want to create a custom container image, which gives you more
> control over the environment and dependencies, you can create your own,
> custom Docker image. That's where you'll use the base image you mentioned.
> See this page for an example:
> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
> .
>
> I hope this helps, let me know if you have any other questions.
>
> Cheers,
> Bartosz Zablocki
>
> On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user <us...@beam.apache.org>
> wrote:
>
>> I am creating an Apache beam pipeline using Python SDK.I want to use some
>> standard template of dataflow (this one
>> <https://console.cloud.google.com/gcr/images/dataflow-templates-base/global/python310-template-launcher-base?tab=info>).
>> But when I am specifying it using 'template_location' key while creating
>> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
>> 2] No such file or directory: '
>> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
>> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base'>
>>
>> I also tried to specify the complete version `
>> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
>> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00>
>> but got the same error. Can someone suggest what I might be doing wrong?
>> The code snippet to create pipeline_options is as follows-
>>
>> def __create_pipeline_options_dataflow(job_name):
>>
>>
>>     # Set up the Dataflow runner options
>>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>     # TODO:Move to environmental variables
>>     pipeline_options = {
>>         'project': gcp_project_id,
>>         'region': "us-east1",
>>         'job_name': job_name,  # Provide a unique job name
>>         'temp_location':
>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>         'staging_location':
>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>         'runner': 'DataflowRunner',
>>         'save_main_session': True,
>>         'service_account_email': service_account,
>>         # 'network': f'projects/{gcp_project_id}/global/networks/default',
>>         # 'subnetwork':
>> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
>>         'template_location': '
>> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>
>>     }
>>     logger.debug(f"pipeline_options created as {pipeline_options}")
>>     return pipeline_options
>>
>>
>>

Re: Specifying dataflow template location with Apache beam Python SDK

Posted by Bartosz Zabłocki via user <us...@beam.apache.org>.
Hi Sumit,
could you elaborate a little bit more on what you are trying to achieve
with the templates?

As far as I know, these base Docker images serve as base images for your
own custom templates.
If you want to use an existing template, you can use one of these:
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates.
To run it, you just need to invoke `gcloud dataflow jobs run... ` or
equivalent command (
https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
Create Job From Template).

If you want to create your own template (ie a reusable Dataflow pipeline)
take a look at this page:
https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
This will let you package your own pipeline as a template. You'll be able
to launch it with the `gcloud dataflow jobs run...` command.
If you want to create a custom container image, which gives you more
control over the environment and dependencies, you can create your own,
custom Docker image. That's where you'll use the base image you mentioned.
See this page for an example:
https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
.

I hope this helps, let me know if you have any other questions.

Cheers,
Bartosz Zablocki

On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user <us...@beam.apache.org>
wrote:

> I am creating an Apache beam pipeline using Python SDK.I want to use some
> standard template of dataflow (this one
> <https://console.cloud.google.com/gcr/images/dataflow-templates-base/global/python310-template-launcher-base?tab=info>).
> But when I am specifying it using 'template_location' key while creating
> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
> 2] No such file or directory: '
> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base'>
>
> I also tried to specify the complete version `
> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
> <http://gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00>
> but got the same error. Can someone suggest what I might be doing wrong?
> The code snippet to create pipeline_options is as follows-
>
> def __create_pipeline_options_dataflow(job_name):
>
>
>     # Set up the Dataflow runner options
>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>     # TODO:Move to environmental variables
>     pipeline_options = {
>         'project': gcp_project_id,
>         'region': "us-east1",
>         'job_name': job_name,  # Provide a unique job name
>         'temp_location':
> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>         'staging_location':
> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>         'runner': 'DataflowRunner',
>         'save_main_session': True,
>         'service_account_email': service_account,
>         # 'network': f'projects/{gcp_project_id}/global/networks/default',
>         # 'subnetwork':
> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
>         'template_location': '
> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>
>     }
>     logger.debug(f"pipeline_options created as {pipeline_options}")
>     return pipeline_options
>
>
>