You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Sayak Paul <sp...@gmail.com> on 2021/09/13 09:44:39 UTC

Local and global Python dependencies for DataflowRunner

Hi folks,

Have you ever faced an issue with local and global dependencies inside an
Apache Beam Pipeline while executing it on Dataflow?

My pipeline involves a few components from the other modules of the project
and I have set up a setup.py following the guidelines from [1]. What is
surprising to me is that four out of the five modules are working as
expected and for only one, Beam is complaining that it's not defined. Note
that it only happens when I run it using the DataflowRunner.

The error goes away when I include the module import inside the method that
starts my Beam pipeline. This is a hacky workaround IMO. More so because I
am not sure why the other module imports are working then.

Could anyone provide some hints?

[1] https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

Sayak Paul | sayak.dev

Re: Local and global Python dependencies for DataflowRunner

Posted by Sayak Paul <sp...@gmail.com>.
Just wanted to bump this up again.
Sayak Paul | sayak.dev



On Tue, Sep 14, 2021 at 7:41 AM Sayak Paul <sp...@gmail.com> wrote:

> Thank you! It worked.
>
> >  you should be able to restructure the pipeline package, so that the
> imports are not in the main module, similar to
> https://stackoverflow.com/a/58845832/5153670
>
> Could you expand a bit more on what you mean by "restructure"? Let me
> provide more context on how my project is structured.
>
> This is the high-level structure:
>
> |- my_package/
>     |- __init__.py
>     |- utils/
>         |- __init__.py
>         |- html.py
> |- scripts/
>     |- dataflow_runner.py
> |- setup.py
>
> dataflow_runner.py is the script that contains my Apache Beam pipeline
> inside main() and I execute dataflow_runner.py to start my pipeline run
> on Dataflow. html is the module that does not get identified when the
> pipeline is run. This is how the module is imported inside dataflow_runner
> :
>
> sys.path.append("..")
> from my_package.utils html
>
> I also make sure to specify the path of the setup.py like so ../setup.py
> from dataflow_runner.
>
> Please let me know if anything is unclear.
>
> Sayak Paul | sayak.dev
>
>
>
> On Tue, Sep 14, 2021 at 12:00 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Hi,
>> Try to  set --save_main_session=True
>> when you launch the pipeline. If that works, you should be able to
>> restructure the pipeline package, so that the imports are not in the main
>> module, similar to https://stackoverflow.com/a/58845832/5153670
>>
>>
>> On Mon, Sep 13, 2021 at 2:45 AM Sayak Paul <sp...@gmail.com> wrote:
>>
>>> Hi folks,
>>>
>>> Have you ever faced an issue with local and global dependencies inside
>>> an Apache Beam Pipeline while executing it on Dataflow?
>>>
>>> My pipeline involves a few components from the other modules of the
>>> project and I have set up a setup.py following the guidelines from [1].
>>> What is surprising to me is that four out of the five modules are working
>>> as expected and for only one, Beam is complaining that it's not defined.
>>> Note that it only happens when I run it using the DataflowRunner.
>>>
>>> The error goes away when I include the module import inside the method
>>> that starts my Beam pipeline. This is a hacky workaround IMO. More so
>>> because I am not sure why the other module imports are working then.
>>>
>>> Could anyone provide some hints?
>>>
>>> [1]
>>> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
>>>
>>> Sayak Paul | sayak.dev
>>>
>>>

Re: Local and global Python dependencies for DataflowRunner

Posted by Sayak Paul <sp...@gmail.com>.
Thank you! It worked.

>  you should be able to restructure the pipeline package, so that the
imports are not in the main module, similar to
https://stackoverflow.com/a/58845832/5153670

Could you expand a bit more on what you mean by "restructure"? Let me
provide more context on how my project is structured.

This is the high-level structure:

|- my_package/
    |- __init__.py
    |- utils/
        |- __init__.py
        |- html.py
|- scripts/
    |- dataflow_runner.py
|- setup.py

dataflow_runner.py is the script that contains my Apache Beam pipeline
inside main() and I execute dataflow_runner.py to start my pipeline run on
Dataflow. html is the module that does not get identified when the pipeline
is run. This is how the module is imported inside dataflow_runner:

sys.path.append("..")
from my_package.utils html

I also make sure to specify the path of the setup.py like so ../setup.py
from dataflow_runner.

Please let me know if anything is unclear.

Sayak Paul | sayak.dev



On Tue, Sep 14, 2021 at 12:00 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> Hi,
> Try to  set --save_main_session=True
> when you launch the pipeline. If that works, you should be able to
> restructure the pipeline package, so that the imports are not in the main
> module, similar to https://stackoverflow.com/a/58845832/5153670
>
>
> On Mon, Sep 13, 2021 at 2:45 AM Sayak Paul <sp...@gmail.com> wrote:
>
>> Hi folks,
>>
>> Have you ever faced an issue with local and global dependencies inside an
>> Apache Beam Pipeline while executing it on Dataflow?
>>
>> My pipeline involves a few components from the other modules of the
>> project and I have set up a setup.py following the guidelines from [1].
>> What is surprising to me is that four out of the five modules are working
>> as expected and for only one, Beam is complaining that it's not defined.
>> Note that it only happens when I run it using the DataflowRunner.
>>
>> The error goes away when I include the module import inside the method
>> that starts my Beam pipeline. This is a hacky workaround IMO. More so
>> because I am not sure why the other module imports are working then.
>>
>> Could anyone provide some hints?
>>
>> [1]
>> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
>>
>> Sayak Paul | sayak.dev
>>
>>

Re: Local and global Python dependencies for DataflowRunner

Posted by Valentyn Tymofieiev <va...@google.com>.
Hi,
Try to  set --save_main_session=True
when you launch the pipeline. If that works, you should be able to
restructure the pipeline package, so that the imports are not in the main
module, similar to https://stackoverflow.com/a/58845832/5153670


On Mon, Sep 13, 2021 at 2:45 AM Sayak Paul <sp...@gmail.com> wrote:

> Hi folks,
>
> Have you ever faced an issue with local and global dependencies inside an
> Apache Beam Pipeline while executing it on Dataflow?
>
> My pipeline involves a few components from the other modules of the
> project and I have set up a setup.py following the guidelines from [1].
> What is surprising to me is that four out of the five modules are working
> as expected and for only one, Beam is complaining that it's not defined.
> Note that it only happens when I run it using the DataflowRunner.
>
> The error goes away when I include the module import inside the method
> that starts my Beam pipeline. This is a hacky workaround IMO. More so
> because I am not sure why the other module imports are working then.
>
> Could anyone provide some hints?
>
> [1]
> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
>
> Sayak Paul | sayak.dev
>
>