You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Sumit Desai via user <us...@beam.apache.org> on 2023/12/19 17:25:35 UTC

Dataflow not able to find a module specified using extra_package

Hi all,
I have created a Dataflow pipeline in batch mode using Apache beam Python
SDK. I am using one non-public dependency 'uplight-telemetry'. I have
specified it using parameter extra_package while creating pipeline_options
object. However, the pipeline loading is failing with an error *No module
named 'uplight_telemetry'*.
The code to create pipeline_options is as following-

def __create_pipeline_options_dataflow(job_name):
    # Set up the Dataflow runner options
    gcp_project_id = os.environ.get(GCP_PROJECT_ID)
    current_dir = os.path.dirname(os.path.abspath(__file__))
    print("current_dir=", current_dir)
    setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
    print("Set-up file path=", setup_file_path)
    #TODO:Move file to proper location
    uplight_telemetry_tar_file_path=os.path.join(current_dir, '..',
'..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
    # TODO:Move to environmental variables
    pipeline_options = {
        'project': gcp_project_id,
        'region': "us-east1",
        'job_name': job_name,  # Provide a unique job name
        'temp_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
        'staging_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
        'runner': 'DataflowRunner',
        'save_main_session': True,
        'service_account_email': os.environ.get(SERVICE_ACCOUNT),
        # 'network': f'projects/{gcp_project_id}/global/networks/default',
        'subnetwork': os.environ.get(SUBNETWORK_URL),
        'setup_file': setup_file_path,
        'extra_package': uplight_telemetry_tar_file_path
        # 'template_location':
'gcr.io/dataflow-templates-base/python310-template-launcher-base'
    }
    print("Pipeline created for job-name", job_name)
    logger.debug(f"pipeline_options created as {pipeline_options}")
    return pipeline_options

Why is it not trying to install this package from extra_package?

Re: Dataflow not able to find a module specified using extra_package

Posted by Sumit Desai via user <us...@beam.apache.org>.
Thanks Anand and Robert. Using extra_packages and specifying it as list
worked.

Regards,
Sumit Desai

On Tue, Dec 19, 2023 at 11:45 PM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> And should it be a list of strings, rather than a string?
>
> On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user <
> user@beam.apache.org> wrote:
>
>> Can you try passing `extra_packages` instead of `extra_package` when
>> passing pipeline options as a dict?
>>
>> On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user <
>> user@beam.apache.org> wrote:
>>
>>> Hi all,
>>> I have created a Dataflow pipeline in batch mode using Apache beam
>>> Python SDK. I am using one non-public dependency 'uplight-telemetry'. I
>>> have specified it using parameter extra_package while creating
>>> pipeline_options object. However, the pipeline loading is failing with an
>>> error *No module named 'uplight_telemetry'*.
>>> The code to create pipeline_options is as following-
>>>
>>> def __create_pipeline_options_dataflow(job_name):
>>>     # Set up the Dataflow runner options
>>>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>>     current_dir = os.path.dirname(os.path.abspath(__file__))
>>>     print("current_dir=", current_dir)
>>>     setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
>>>     print("Set-up file path=", setup_file_path)
>>>     #TODO:Move file to proper location
>>>     uplight_telemetry_tar_file_path=os.path.join(current_dir, '..', '..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
>>>     # TODO:Move to environmental variables
>>>     pipeline_options = {
>>>         'project': gcp_project_id,
>>>         'region': "us-east1",
>>>         'job_name': job_name,  # Provide a unique job name
>>>         'temp_location': f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>         'staging_location': f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>         'runner': 'DataflowRunner',
>>>         'save_main_session': True,
>>>         'service_account_email': os.environ.get(SERVICE_ACCOUNT),
>>>         # 'network': f'projects/{gcp_project_id}/global/networks/default',
>>>         'subnetwork': os.environ.get(SUBNETWORK_URL),
>>>         'setup_file': setup_file_path,
>>>         'extra_package': uplight_telemetry_tar_file_path
>>>         # 'template_location': 'gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>>     }
>>>     print("Pipeline created for job-name", job_name)
>>>     logger.debug(f"pipeline_options created as {pipeline_options}")
>>>     return pipeline_options
>>>
>>> Why is it not trying to install this package from extra_package?
>>>
>>

Re: Dataflow not able to find a module specified using extra_package

Posted by Robert Bradshaw via user <us...@beam.apache.org>.
And should it be a list of strings, rather than a string?

On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user <us...@beam.apache.org>
wrote:

> Can you try passing `extra_packages` instead of `extra_package` when
> passing pipeline options as a dict?
>
> On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user <
> user@beam.apache.org> wrote:
>
>> Hi all,
>> I have created a Dataflow pipeline in batch mode using Apache beam Python
>> SDK. I am using one non-public dependency 'uplight-telemetry'. I have
>> specified it using parameter extra_package while creating pipeline_options
>> object. However, the pipeline loading is failing with an error *No
>> module named 'uplight_telemetry'*.
>> The code to create pipeline_options is as following-
>>
>> def __create_pipeline_options_dataflow(job_name):
>>     # Set up the Dataflow runner options
>>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>     current_dir = os.path.dirname(os.path.abspath(__file__))
>>     print("current_dir=", current_dir)
>>     setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
>>     print("Set-up file path=", setup_file_path)
>>     #TODO:Move file to proper location
>>     uplight_telemetry_tar_file_path=os.path.join(current_dir, '..', '..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
>>     # TODO:Move to environmental variables
>>     pipeline_options = {
>>         'project': gcp_project_id,
>>         'region': "us-east1",
>>         'job_name': job_name,  # Provide a unique job name
>>         'temp_location': f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>         'staging_location': f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>         'runner': 'DataflowRunner',
>>         'save_main_session': True,
>>         'service_account_email': os.environ.get(SERVICE_ACCOUNT),
>>         # 'network': f'projects/{gcp_project_id}/global/networks/default',
>>         'subnetwork': os.environ.get(SUBNETWORK_URL),
>>         'setup_file': setup_file_path,
>>         'extra_package': uplight_telemetry_tar_file_path
>>         # 'template_location': 'gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>     }
>>     print("Pipeline created for job-name", job_name)
>>     logger.debug(f"pipeline_options created as {pipeline_options}")
>>     return pipeline_options
>>
>> Why is it not trying to install this package from extra_package?
>>
>

Re: Dataflow not able to find a module specified using extra_package

Posted by Anand Inguva via user <us...@beam.apache.org>.
Can you try passing `extra_packages` instead of `extra_package` when
passing pipeline options as a dict?

On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user <us...@beam.apache.org>
wrote:

> Hi all,
> I have created a Dataflow pipeline in batch mode using Apache beam Python
> SDK. I am using one non-public dependency 'uplight-telemetry'. I have
> specified it using parameter extra_package while creating pipeline_options
> object. However, the pipeline loading is failing with an error *No module
> named 'uplight_telemetry'*.
> The code to create pipeline_options is as following-
>
> def __create_pipeline_options_dataflow(job_name):
>     # Set up the Dataflow runner options
>     gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>     current_dir = os.path.dirname(os.path.abspath(__file__))
>     print("current_dir=", current_dir)
>     setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
>     print("Set-up file path=", setup_file_path)
>     #TODO:Move file to proper location
>     uplight_telemetry_tar_file_path=os.path.join(current_dir, '..', '..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
>     # TODO:Move to environmental variables
>     pipeline_options = {
>         'project': gcp_project_id,
>         'region': "us-east1",
>         'job_name': job_name,  # Provide a unique job name
>         'temp_location': f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>         'staging_location': f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>         'runner': 'DataflowRunner',
>         'save_main_session': True,
>         'service_account_email': os.environ.get(SERVICE_ACCOUNT),
>         # 'network': f'projects/{gcp_project_id}/global/networks/default',
>         'subnetwork': os.environ.get(SUBNETWORK_URL),
>         'setup_file': setup_file_path,
>         'extra_package': uplight_telemetry_tar_file_path
>         # 'template_location': 'gcr.io/dataflow-templates-base/python310-template-launcher-base'
>     }
>     print("Pipeline created for job-name", job_name)
>     logger.debug(f"pipeline_options created as {pipeline_options}")
>     return pipeline_options
>
> Why is it not trying to install this package from extra_package?
>