You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Dmitry Demeshchuk <dm...@postmates.com> on 2017/06/05 20:56:06 UTC

Practices for running Python projects on Dataflow

Hi list,

Suppose, you have a private Python package that contains some code people
want to be sharing when writing their pipelines.

So, typically, the installation process of the package would be either

pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage

or

git clone git://git@github.com/mycompany/mypackage
python setup.py mypackage/setup.py

Now, the problem starts when we want to get that package into Dataflow.
Right now, to my understanding, DataflowRunner supports 3 approaches:

   1.

   Specifying a requirements_file parameter in the pipeline options. This
   basically must be a requirements.txt file.
   2.

   Specifying an extra_packages parameter in the pipeline options. This
   must be a list of tarballs, each of which contains a Python package
   packaged using distutils.
   3.

   Specifying a setup_file parameter in the pipeline options. This will
   just run the python path/to/my/setup.py package command and then send
   the files over the wire.

The best approach I could come up with was including an *additional*
setup.py into the package itself, so that when we install that package, the
setup.py file gets installed along with it. And then, I’d point the
setup_file option to that file.

This gist
<https://gist.github.com/doubleyou/be01226352372491babda7602022c506> shows
the basic approach in code. Both setup.py and options.py are supposed to be
present in the installed package.

It kind of works for me, with some caveats, but I just wanted to find out
if it’s a more decent way to handle my situation. I’m not keen on
specifying that private package as a git dependency, because of having to
worry about git credentials, but maybe there are other ways?

Thanks!
​
-- 
Best regards,
Dmitry Demeshchuk.

Re: Practices for running Python projects on Dataflow

Posted by "Morand, Sebastien" <se...@veolia.com>.
Ok I made so many changements, I have no more the problem.

Thanks!

*Sébastien MORAND*
Team Lead Solution Architect
Technology & Operations / Digital Factory
Veolia - Group Information Systems & Technology (IS&T)
Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
Bureau 0144C (Ouest)
30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
*www.veolia.com <http://www.veolia.com>*
<http://www.veolia.com>
<https://www.facebook.com/veoliaenvironment/>
<https://www.youtube.com/user/veoliaenvironnement>
<https://www.linkedin.com/company/veolia-environnement>
<https://twitter.com/veolia>

On 6 June 2017 at 01:54, Ahmet Altay <al...@google.com> wrote:

> Sébastien, what kind of an issue you had with using setup.py with
> installation_requires?
>
> On Mon, Jun 5, 2017 at 4:44 PM, Morand, Sebastien <
> sebastien.morand@veolia.com> wrote:
>
>> Hi,
>>
>> I ran into trouble when using setup.py with installation_requires. So I
>> basically ended up with setup.py with no installation requirements inside +
>> requirements.txt :
>>
>> PIPELINE_OPTIONS = [
>>     '--project={}'.format(projectname),
>>     '--runner=DataflowRunner',
>>     '--temp_location=gs://dataflow-run/temp',
>>     '--staging_location=gs://dataflow-run/staging',
>>     '--requirements_file=requirements.txt',
>>     '--save_main_session',
>>     '--setup_file=./setup.py'
>> ]
>>
>> with setup.py:
>> setup(
>>     name='MyProject',
>>     version='1.0',
>>     description='My Description',
>>     author='myself',
>>     author_email='me@whatever.com',
>>     url='http://myurl.whatever.com',
>>     package_dir={'': 'src'},
>>     packages=[
>>         'package1',
>>         'package1.subpackage1',
>>         'package1.subpackage2',
>>         'package2'
>>     ]
>> )
>>
>> Regards
>>
>> *Sébastien MORAND*
>> Team Lead Solution Architect
>> Technology & Operations / Digital Factory
>> Veolia - Group Information Systems & Technology (IS&T)
>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>> <+33%201%2085%2057%2071%2008>
>> Bureau 0144C (Ouest)
>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>> *www.veolia.com <http://www.veolia.com>*
>> <http://www.veolia.com>
>> <https://www.facebook.com/veoliaenvironment/>
>> <https://www.youtube.com/user/veoliaenvironnement>
>> <https://www.linkedin.com/company/veolia-environnement>
>> <https://twitter.com/veolia>
>>
>> On 5 June 2017 at 22:56, Dmitry Demeshchuk <dm...@postmates.com> wrote:
>>
>>> Hi list,
>>>
>>> Suppose, you have a private Python package that contains some code
>>> people want to be sharing when writing their pipelines.
>>>
>>> So, typically, the installation process of the package would be either
>>>
>>> pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage
>>>
>>> or
>>>
>>> git clone git://git@github.com/mycompany/mypackage
>>> python setup.py mypackage/setup.py
>>>
>>> Now, the problem starts when we want to get that package into Dataflow.
>>> Right now, to my understanding, DataflowRunner supports 3 approaches:
>>>
>>>    1.
>>>
>>>    Specifying a requirements_file parameter in the pipeline options.
>>>    This basically must be a requirements.txt file.
>>>    2.
>>>
>>>    Specifying an extra_packages parameter in the pipeline options. This
>>>    must be a list of tarballs, each of which contains a Python package
>>>    packaged using distutils.
>>>    3.
>>>
>>>    Specifying a setup_file parameter in the pipeline options. This will
>>>    just run the python path/to/my/setup.py package command and then
>>>    send the files over the wire.
>>>
>>> The best approach I could come up with was including an *additional*
>>> setup.py into the package itself, so that when we install that package,
>>> the setup.py file gets installed along with it. And then, I’d point the
>>> setup_file option to that file.
>>>
>>> This gist
>>> <https://gist.github.com/doubleyou/be01226352372491babda7602022c506>
>>> shows the basic approach in code. Both setup.py and options.py are
>>> supposed to be present in the installed package.
>>>
>>> It kind of works for me, with some caveats, but I just wanted to find
>>> out if it’s a more decent way to handle my situation. I’m not keen on
>>> specifying that private package as a git dependency, because of having to
>>> worry about git credentials, but maybe there are other ways?
>>>
>>> Thanks!
>>> ​
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> ------------------------------------------------------------
>> --------------------------------
>> This e-mail transmission (message and any attached files) may contain
>> information that is proprietary, privileged and/or confidential to Veolia
>> Environnement and/or its affiliates and is intended exclusively for the
>> person(s) to whom it is addressed. If you are not the intended recipient,
>> please notify the sender by return e-mail and delete all copies of this
>> e-mail, including all attachments. Unless expressly authorized, any use,
>> disclosure, publication, retransmission or dissemination of this e-mail
>> and/or of its attachments is strictly prohibited.
>>
>> Ce message electronique et ses fichiers attaches sont strictement
>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>> message par erreur, merci de le retourner a son emetteur et de le detruire
>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>> publication, la distribution, ou la reproduction non expressement
>> autorisees de ce message et de ses pieces attachees sont interdites.
>> ------------------------------------------------------------
>> --------------------------------
>>
>
>

-- 

--------------------------------------------------------------------------------------------
This e-mail transmission (message and any attached files) may contain 
information that is proprietary, privileged and/or confidential to Veolia 
Environnement and/or its affiliates and is intended exclusively for the 
person(s) to whom it is addressed. If you are not the intended recipient, 
please notify the sender by return e-mail and delete all copies of this 
e-mail, including all attachments. Unless expressly authorized, any use, 
disclosure, publication, retransmission or dissemination of this e-mail 
and/or of its attachments is strictly prohibited. 

Ce message electronique et ses fichiers attaches sont strictement 
confidentiels et peuvent contenir des elements dont Veolia Environnement 
et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc 
destines a l'usage de leurs seuls destinataires. Si vous avez recu ce 
message par erreur, merci de le retourner a son emetteur et de le detruire 
ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la 
publication, la distribution, ou la reproduction non expressement 
autorisees de ce message et de ses pieces attachees sont interdites.
--------------------------------------------------------------------------------------------

Re: Practices for running Python projects on Dataflow

Posted by Ahmet Altay <al...@google.com>.
Sébastien, what kind of an issue you had with using setup.py with
installation_requires?

On Mon, Jun 5, 2017 at 4:44 PM, Morand, Sebastien <
sebastien.morand@veolia.com> wrote:

> Hi,
>
> I ran into trouble when using setup.py with installation_requires. So I
> basically ended up with setup.py with no installation requirements inside +
> requirements.txt :
>
> PIPELINE_OPTIONS = [
>     '--project={}'.format(projectname),
>     '--runner=DataflowRunner',
>     '--temp_location=gs://dataflow-run/temp',
>     '--staging_location=gs://dataflow-run/staging',
>     '--requirements_file=requirements.txt',
>     '--save_main_session',
>     '--setup_file=./setup.py'
> ]
>
> with setup.py:
> setup(
>     name='MyProject',
>     version='1.0',
>     description='My Description',
>     author='myself',
>     author_email='me@whatever.com',
>     url='http://myurl.whatever.com',
>     package_dir={'': 'src'},
>     packages=[
>         'package1',
>         'package1.subpackage1',
>         'package1.subpackage2',
>         'package2'
>     ]
> )
>
> Regards
>
> *Sébastien MORAND*
> Team Lead Solution Architect
> Technology & Operations / Digital Factory
> Veolia - Group Information Systems & Technology (IS&T)
> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
> <+33%201%2085%2057%2071%2008>
> Bureau 0144C (Ouest)
> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
> *www.veolia.com <http://www.veolia.com>*
> <http://www.veolia.com>
> <https://www.facebook.com/veoliaenvironment/>
> <https://www.youtube.com/user/veoliaenvironnement>
> <https://www.linkedin.com/company/veolia-environnement>
> <https://twitter.com/veolia>
>
> On 5 June 2017 at 22:56, Dmitry Demeshchuk <dm...@postmates.com> wrote:
>
>> Hi list,
>>
>> Suppose, you have a private Python package that contains some code people
>> want to be sharing when writing their pipelines.
>>
>> So, typically, the installation process of the package would be either
>>
>> pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage
>>
>> or
>>
>> git clone git://git@github.com/mycompany/mypackage
>> python setup.py mypackage/setup.py
>>
>> Now, the problem starts when we want to get that package into Dataflow.
>> Right now, to my understanding, DataflowRunner supports 3 approaches:
>>
>>    1.
>>
>>    Specifying a requirements_file parameter in the pipeline options.
>>    This basically must be a requirements.txt file.
>>    2.
>>
>>    Specifying an extra_packages parameter in the pipeline options. This
>>    must be a list of tarballs, each of which contains a Python package
>>    packaged using distutils.
>>    3.
>>
>>    Specifying a setup_file parameter in the pipeline options. This will
>>    just run the python path/to/my/setup.py package command and then send
>>    the files over the wire.
>>
>> The best approach I could come up with was including an *additional*
>> setup.py into the package itself, so that when we install that package,
>> the setup.py file gets installed along with it. And then, I’d point the
>> setup_file option to that file.
>>
>> This gist
>> <https://gist.github.com/doubleyou/be01226352372491babda7602022c506>
>> shows the basic approach in code. Both setup.py and options.py are
>> supposed to be present in the installed package.
>>
>> It kind of works for me, with some caveats, but I just wanted to find out
>> if it’s a more decent way to handle my situation. I’m not keen on
>> specifying that private package as a git dependency, because of having to
>> worry about git credentials, but maybe there are other ways?
>>
>> Thanks!
>> ​
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> ------------------------------------------------------------
> --------------------------------
> This e-mail transmission (message and any attached files) may contain
> information that is proprietary, privileged and/or confidential to Veolia
> Environnement and/or its affiliates and is intended exclusively for the
> person(s) to whom it is addressed. If you are not the intended recipient,
> please notify the sender by return e-mail and delete all copies of this
> e-mail, including all attachments. Unless expressly authorized, any use,
> disclosure, publication, retransmission or dissemination of this e-mail
> and/or of its attachments is strictly prohibited.
>
> Ce message electronique et ses fichiers attaches sont strictement
> confidentiels et peuvent contenir des elements dont Veolia Environnement
> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
> message par erreur, merci de le retourner a son emetteur et de le detruire
> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
> publication, la distribution, ou la reproduction non expressement
> autorisees de ce message et de ses pieces attachees sont interdites.
> ------------------------------------------------------------
> --------------------------------
>

Re: Practices for running Python projects on Dataflow

Posted by "Morand, Sebastien" <se...@veolia.com>.
Hi,

I ran into trouble when using setup.py with installation_requires. So I
basically ended up with setup.py with no installation requirements inside +
requirements.txt :

PIPELINE_OPTIONS = [
    '--project={}'.format(projectname),
    '--runner=DataflowRunner',
    '--temp_location=gs://dataflow-run/temp',
    '--staging_location=gs://dataflow-run/staging',
    '--requirements_file=requirements.txt',
    '--save_main_session',
    '--setup_file=./setup.py'
]

with setup.py:
setup(
    name='MyProject',
    version='1.0',
    description='My Description',
    author='myself',
    author_email='me@whatever.com',
    url='http://myurl.whatever.com',
    package_dir={'': 'src'},
    packages=[
        'package1',
        'package1.subpackage1',
        'package1.subpackage2',
        'package2'
    ]
)

Regards

*Sébastien MORAND*
Team Lead Solution Architect
Technology & Operations / Digital Factory
Veolia - Group Information Systems & Technology (IS&T)
Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
Bureau 0144C (Ouest)
30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
*www.veolia.com <http://www.veolia.com>*
<http://www.veolia.com>
<https://www.facebook.com/veoliaenvironment/>
<https://www.youtube.com/user/veoliaenvironnement>
<https://www.linkedin.com/company/veolia-environnement>
<https://twitter.com/veolia>

On 5 June 2017 at 22:56, Dmitry Demeshchuk <dm...@postmates.com> wrote:

> Hi list,
>
> Suppose, you have a private Python package that contains some code people
> want to be sharing when writing their pipelines.
>
> So, typically, the installation process of the package would be either
>
> pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage
>
> or
>
> git clone git://git@github.com/mycompany/mypackage
> python setup.py mypackage/setup.py
>
> Now, the problem starts when we want to get that package into Dataflow.
> Right now, to my understanding, DataflowRunner supports 3 approaches:
>
>    1.
>
>    Specifying a requirements_file parameter in the pipeline options. This
>    basically must be a requirements.txt file.
>    2.
>
>    Specifying an extra_packages parameter in the pipeline options. This
>    must be a list of tarballs, each of which contains a Python package
>    packaged using distutils.
>    3.
>
>    Specifying a setup_file parameter in the pipeline options. This will
>    just run the python path/to/my/setup.py package command and then send
>    the files over the wire.
>
> The best approach I could come up with was including an *additional*
> setup.py into the package itself, so that when we install that package,
> the setup.py file gets installed along with it. And then, I’d point the
> setup_file option to that file.
>
> This gist
> <https://gist.github.com/doubleyou/be01226352372491babda7602022c506>
> shows the basic approach in code. Both setup.py and options.py are
> supposed to be present in the installed package.
>
> It kind of works for me, with some caveats, but I just wanted to find out
> if it’s a more decent way to handle my situation. I’m not keen on
> specifying that private package as a git dependency, because of having to
> worry about git credentials, but maybe there are other ways?
>
> Thanks!
> ​
> --
> Best regards,
> Dmitry Demeshchuk.
>

-- 

--------------------------------------------------------------------------------------------
This e-mail transmission (message and any attached files) may contain 
information that is proprietary, privileged and/or confidential to Veolia 
Environnement and/or its affiliates and is intended exclusively for the 
person(s) to whom it is addressed. If you are not the intended recipient, 
please notify the sender by return e-mail and delete all copies of this 
e-mail, including all attachments. Unless expressly authorized, any use, 
disclosure, publication, retransmission or dissemination of this e-mail 
and/or of its attachments is strictly prohibited. 

Ce message electronique et ses fichiers attaches sont strictement 
confidentiels et peuvent contenir des elements dont Veolia Environnement 
et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc 
destines a l'usage de leurs seuls destinataires. Si vous avez recu ce 
message par erreur, merci de le retourner a son emetteur et de le detruire 
ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la 
publication, la distribution, ou la reproduction non expressement 
autorisees de ce message et de ses pieces attachees sont interdites.
--------------------------------------------------------------------------------------------

Re: Practices for running Python projects on Dataflow

Posted by Dmitry Demeshchuk <dm...@postmates.com>.
That doesn't leave me an option to install the package from pip, though.
So, I can't put it into a requirements.txt file, for example.

On the other hand, it's supposed to be a one-of-a-kind thing, so maybe
giving it special treatment is not too bad.

At least, your approach, unlike mine, has slightly less magic in it (we'd
only need to pass the tarball location, but that's pretty trivial to do in
a clean way).


On Mon, Jun 5, 2017 at 2:17 PM, Robert Bradshaw <ro...@google.com> wrote:

> Probably option 2 would be the cleanest approach in your case, e.g. run
>
> git clone git://git@github.com/mycompany/mypackage
> python mypackage/setup.py sdist
>
> and then specifying extra_packages=dist/mypackage.tar.gz
>
> On Mon, Jun 5, 2017 at 1:56 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
> > Hi list,
> >
> > Suppose, you have a private Python package that contains some code people
> > want to be sharing when writing their pipelines.
> >
> > So, typically, the installation process of the package would be either
> >
> > pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage
> >
> > or
> >
> > git clone git://git@github.com/mycompany/mypackage
> > python setup.py mypackage/setup.py
> >
> > Now, the problem starts when we want to get that package into Dataflow.
> > Right now, to my understanding, DataflowRunner supports 3 approaches:
> >
> > Specifying a requirements_file parameter in the pipeline options. This
> > basically must be a requirements.txt file.
> >
> > Specifying an extra_packages parameter in the pipeline options. This
> must be
> > a list of tarballs, each of which contains a Python package packaged
> using
> > distutils.
> >
> > Specifying a setup_file parameter in the pipeline options. This will just
> > run the python path/to/my/setup.py package command and then send the
> files
> > over the wire.
> >
> > The best approach I could come up with was including an additional
> setup.py
> > into the package itself, so that when we install that package, the
> setup.py
> > file gets installed along with it. And then, I’d point the setup_file
> option
> > to that file.
> >
> > This gist shows the basic approach in code. Both setup.py and options.py
> are
> > supposed to be present in the installed package.
> >
> > It kind of works for me, with some caveats, but I just wanted to find
> out if
> > it’s a more decent way to handle my situation. I’m not keen on specifying
> > that private package as a git dependency, because of having to worry
> about
> > git credentials, but maybe there are other ways?
> >
> > Thanks!
> >
> > --
> > Best regards,
> > Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.

Re: Practices for running Python projects on Dataflow

Posted by Robert Bradshaw <ro...@google.com>.
Probably option 2 would be the cleanest approach in your case, e.g. run

git clone git://git@github.com/mycompany/mypackage
python mypackage/setup.py sdist

and then specifying extra_packages=dist/mypackage.tar.gz

On Mon, Jun 5, 2017 at 1:56 PM, Dmitry Demeshchuk <dm...@postmates.com> wrote:
> Hi list,
>
> Suppose, you have a private Python package that contains some code people
> want to be sharing when writing their pipelines.
>
> So, typically, the installation process of the package would be either
>
> pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage
>
> or
>
> git clone git://git@github.com/mycompany/mypackage
> python setup.py mypackage/setup.py
>
> Now, the problem starts when we want to get that package into Dataflow.
> Right now, to my understanding, DataflowRunner supports 3 approaches:
>
> Specifying a requirements_file parameter in the pipeline options. This
> basically must be a requirements.txt file.
>
> Specifying an extra_packages parameter in the pipeline options. This must be
> a list of tarballs, each of which contains a Python package packaged using
> distutils.
>
> Specifying a setup_file parameter in the pipeline options. This will just
> run the python path/to/my/setup.py package command and then send the files
> over the wire.
>
> The best approach I could come up with was including an additional setup.py
> into the package itself, so that when we install that package, the setup.py
> file gets installed along with it. And then, I’d point the setup_file option
> to that file.
>
> This gist shows the basic approach in code. Both setup.py and options.py are
> supposed to be present in the installed package.
>
> It kind of works for me, with some caveats, but I just wanted to find out if
> it’s a more decent way to handle my situation. I’m not keen on specifying
> that private package as a git dependency, because of having to worry about
> git credentials, but maybe there are other ways?
>
> Thanks!
>
> --
> Best regards,
> Dmitry Demeshchuk.