You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Marco Mistroni <mm...@gmail.com> on 2020/01/16 22:00:14 UTC

beam main file with dependencies

Hello all
 i have written an apache beam workflow which i have splitted across two
file
- main_file.py  contains the pipeline
- utils.py (which contains few functions used in the pipeline)

I have created template  for this using the command below

python -m main_file.py --runner=dataflow --project=myproject
--template_location=gs://mybucket/my_template
--temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging

and i have attempted to create a job using this template.
However, when i kick off the job i am getting exceptions such as


Traceback (most recent call last): File
"/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
line 261, in loads return dill.loads(s) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads
return load(file, ignore) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load
obj = pik.load() File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in
find_class return StockUnpickler.find_class(self, module, name)
ImportError: No module named 'utils'
I am guessing i am missign some steps in packaging the application, or
perhaps some extra options to specify dependencies?
i would not imagine writing a whole workflow in one file, so this looks
like a standard usecase ?

kind regards

Re: beam main file with dependencies

Posted by André Rocha Silva <a....@portaltelemedicina.com.br>.
Marco

To add upon the others answers, there are 2 ways I add dependencies on my
jobs. In both cases, you need a setup.py like this:
from setuptools import setup, find_packages

setup(
name="dependencies",
version="0.0.1",
packages=find_packages(),
install_requires=[
'pymssql==2.1.4', 'google-cloud-storage==1.22.0'],
)

With only this on your setup file, you will be able to add dependencies.

1) add a setup file:
when you run you job, you have to add a --setup_file. So, it would be like
this:

python -m main_file.py --runner=dataflow --project=myproject
--template_location=gs://mybucket/my_template
--temp_location=gs://mybucket/temp
--staging_location=gs://mybucket/staging --setup_file home/path/to/ setup.py

2) extra package:
From your setup, you can create a package you add to your job. To do so,
you need to run:
python setup.py sdist
The file created from it you add to your job with the parameter
--extra_package

python -m main_file.py --runner=dataflow --project=myproject
--template_location=gs://mybucket/my_template
--temp_location=gs://mybucket/temp
--staging_location=gs://mybucket/staging --extra_package
dist/dependencies-0.0.1.tar.gz

Good luck!

André Rocha
Data Engineer

On Fri, Jan 17, 2020 at 8:35 AM Chris Swart <ch...@healx.io> wrote:

> Hey Marco, you will need to package your application in a module the
> Juliaset example shows you how you could go about it
> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset Best
> wishes, Chris
>
> On Thu, Jan 16, 2020 at 10:00 PM Marco Mistroni <mm...@gmail.com>
> wrote:
>
>> Hello all
>>  i have written an apache beam workflow which i have splitted across two
>> file
>> - main_file.py  contains the pipeline
>> - utils.py (which contains few functions used in the pipeline)
>>
>> I have created template  for this using the command below
>>
>> python -m main_file.py --runner=dataflow --project=myproject
>> --template_location=gs://mybucket/my_template
>> --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging
>>
>> and i have attempted to create a job using this template.
>> However, when i kick off the job i am getting exceptions such as
>>
>>
>> Traceback (most recent call last): File
>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>> line 261, in loads return dill.loads(s) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads
>> return load(file, ignore) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load
>> obj = pik.load() File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in
>> find_class return StockUnpickler.find_class(self, module, name)
>> ImportError: No module named 'utils'
>> I am guessing i am missign some steps in packaging the application, or
>> perhaps some extra options to specify dependencies?
>> i would not imagine writing a whole workflow in one file, so this looks
>> like a standard usecase ?
>>
>> kind regards
>>
>>
>>
>>
>>

Re: beam main file with dependencies

Posted by Chris Swart <ch...@healx.io>.
Hey Marco, you will need to package your application in a module the
Juliaset example shows you how you could go about it
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset
Best
wishes, Chris

On Thu, Jan 16, 2020 at 10:00 PM Marco Mistroni <mm...@gmail.com> wrote:

> Hello all
>  i have written an apache beam workflow which i have splitted across two
> file
> - main_file.py  contains the pipeline
> - utils.py (which contains few functions used in the pipeline)
>
> I have created template  for this using the command below
>
> python -m main_file.py --runner=dataflow --project=myproject
> --template_location=gs://mybucket/my_template
> --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging
>
> and i have attempted to create a job using this template.
> However, when i kick off the job i am getting exceptions such as
>
>
> Traceback (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 261, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads
> return load(file, ignore) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load
> obj = pik.load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in
> find_class return StockUnpickler.find_class(self, module, name)
> ImportError: No module named 'utils'
> I am guessing i am missign some steps in packaging the application, or
> perhaps some extra options to specify dependencies?
> i would not imagine writing a whole workflow in one file, so this looks
> like a standard usecase ?
>
> kind regards
>
>
>
>
>

Re: beam main file with dependencies

Posted by Robert Bradshaw <ro...@google.com>.
Yes, you'll need to bundle up these dependencies in a way that they
can be shipped to the workers. See
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

On Thu, Jan 16, 2020 at 2:00 PM Marco Mistroni <mm...@gmail.com> wrote:
>
> Hello all
>  i have written an apache beam workflow which i have splitted across two file
> - main_file.py  contains the pipeline
> - utils.py (which contains few functions used in the pipeline)
>
> I have created template  for this using the command below
>
> python -m main_file.py --runner=dataflow --project=myproject --template_location=gs://mybucket/my_template --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging
>
> and i have attempted to create a job using this template.
> However, when i kick off the job i am getting exceptions such as
>
>
> Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", line 261, in loads return dill.loads(s) File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in find_class return StockUnpickler.find_class(self, module, name) ImportError: No module named 'utils'
> I am guessing i am missign some steps in packaging the application, or perhaps some extra options to specify dependencies?
> i would not imagine writing a whole workflow in one file, so this looks like a standard usecase ?
>
> kind regards
>
>
>
>