You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org> on 2018/07/03 21:09:18 UTC

Help with adding python package dependencies when executing pyhton pipeline

Hello all,


I am using the python code to run my pipeline. similar to the following:

options = PipelineOptions()google_cloud_options =
options.view_as(GoogleCloudOptions)google_cloud_options.project =
'my-project-id'google_cloud_options.job_name =
'myjob'google_cloud_options.staging_location =
'gs://your-bucket-name-here/staging'google_cloud_options.temp_location
= 'gs://your-bucket-name-here/temp'options.view_as(StandardOptions).runner
= 'DataflowRunner'



I would like to add *pandas-gbq* package installation to my workers. What
would be the recommendation to do so. Can I add it to the
PipelineOptions()?
I remember that there are few options, one of them was with creating a
requirements text file but I can not remember where I saw it and if it is
the simplest way when running the pipeline from datalab.

Thanks you for any reference!

-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Re: Help with adding python package dependencies when executing pyhton pipeline

Posted by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org>.

Based on
https://stackoverflow.com/questions/44423769/how-to-use-google-cloud-storage-in-dataflow-pipeline-run-from-datalab
I tried this:
options = PipelineOptions(flags = ["--requirements_file",
"./requirements.txt"])
the requirements file was generated by:
pip freeze > requirements.txt

But it fires the following error:

CalledProcessError: Command '['/usr/local/envs/py2env/bin/python',
'-m', 'pip', 'install', '--download',
'/tmp/dataflow-requirements-cache', '-r', 'requirements.txt',
'--no-binary', ':all:']' returned non-zero exit status 1


any suggestion?
Thanks,
Eila

On Tue, Jul 3, 2018 at 5:25 PM, OrielResearch Eila Arich-Landkof <
eila@orielresearch.org> wrote:

> thank you. where do i add the reference to requirements.txt? can i do it
> from the pipline options code?
>
> On Tue, Jul 3, 2018 at 5:13 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Take a look at https://beam.apache.org/docume
>> ntation/sdks/python-pipeline-dependencies/
>>
>> On Tue, Jul 3, 2018 at 2:09 PM OrielResearch Eila Arich-Landkof <
>> eila@orielresearch.org> wrote:
>>
>>> Hello all,
>>>
>>>
>>> I am using the python code to run my pipeline. similar to the following:
>>>
>>> options = PipelineOptions()google_cloud_options = options.view_as(GoogleCloudOptions)google_cloud_options.project = 'my-project-id'google_cloud_options.job_name = 'myjob'google_cloud_options.staging_location = 'gs://your-bucket-name-here/staging'google_cloud_options.temp_location = 'gs://your-bucket-name-here/temp'options.view_as(StandardOptions).runner = 'DataflowRunner'
>>>
>>>
>>>
>>> I would like to add *pandas-gbq* package installation to my workers.
>>> What would be the recommendation to do so. Can I add it to the
>>> PipelineOptions()?
>>> I remember that there are few options, one of them was with creating a
>>> requirements text file but I can not remember where I saw it and if it is
>>> the simplest way when running the pipeline from datalab.
>>>
>>> Thanks you for any reference!
>>>
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>m/Deep-Le
>>> arning-In-Production/
>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>
>>>
>>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
> p.co <https://www.meetup.com/Deep-Learning-In-Production/>m/Deep-
> Learning-In-Production/
> <https://www.meetup.com/Deep-Learning-In-Production/>
>
>
>


-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Re: Help with adding python package dependencies when executing pyhton pipeline

Posted by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org>.

thank you. where do i add the reference to requirements.txt? can i do it
from the pipline options code?

On Tue, Jul 3, 2018 at 5:13 PM, Lukasz Cwik <lc...@google.com> wrote:

> Take a look at https://beam.apache.org/documentation/sdks/python-
> pipeline-dependencies/
>
> On Tue, Jul 3, 2018 at 2:09 PM OrielResearch Eila Arich-Landkof <
> eila@orielresearch.org> wrote:
>
>> Hello all,
>>
>>
>> I am using the python code to run my pipeline. similar to the following:
>>
>> options = PipelineOptions()google_cloud_options = options.view_as(GoogleCloudOptions)google_cloud_options.project = 'my-project-id'google_cloud_options.job_name = 'myjob'google_cloud_options.staging_location = 'gs://your-bucket-name-here/staging'google_cloud_options.temp_location = 'gs://your-bucket-name-here/temp'options.view_as(StandardOptions).runner = 'DataflowRunner'
>>
>>
>>
>> I would like to add *pandas-gbq* package installation to my workers.
>> What would be the recommendation to do so. Can I add it to the
>> PipelineOptions()?
>> I remember that there are few options, one of them was with creating a
>> requirements text file but I can not remember where I saw it and if it is
>> the simplest way when running the pipeline from datalab.
>>
>> Thanks you for any reference!
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>m/Deep-
>> Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>


-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Re: Help with adding python package dependencies when executing pyhton pipeline

Posted by Lukasz Cwik <lc...@google.com>.

Take a look at
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

On Tue, Jul 3, 2018 at 2:09 PM OrielResearch Eila Arich-Landkof <
eila@orielresearch.org> wrote:

> Hello all,
>
>
> I am using the python code to run my pipeline. similar to the following:
>
> options = PipelineOptions()google_cloud_options = options.view_as(GoogleCloudOptions)google_cloud_options.project = 'my-project-id'google_cloud_options.job_name = 'myjob'google_cloud_options.staging_location = 'gs://your-bucket-name-here/staging'google_cloud_options.temp_location = 'gs://your-bucket-name-here/temp'options.view_as(StandardOptions).runner = 'DataflowRunner'
>
>
>
> I would like to add *pandas-gbq* package installation to my workers. What
> would be the recommendation to do so. Can I add it to the
> PipelineOptions()?
> I remember that there are few options, one of them was with creating a
> requirements text file but I can not remember where I saw it and if it is
> the simplest way when running the pipeline from datalab.
>
> Thanks you for any reference!
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
> m/Deep-Learning-In-Production/
> <https://www.meetup.com/Deep-Learning-In-Production/>
>
>
>