You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Dmitry Demeshchuk <dm...@postmates.com> on 2017/06/06 06:01:39 UTC

Installing non-native Python dependencies in Dataflow

Hi again, folks,

How should I go about installing Python packages that require to be built
and/or require native dependencies like shared libraries or such?

I guess, I could potentially build the C-based modules using the same
version of kernel and glibc that Dataflow is running, but doesn't seem like
there's any way to install shared libraries at these boxes, right?

Thanks!

-- 
Best regards,
Dmitry Demeshchuk.

Re: Installing non-native Python dependencies in Dataflow

Posted by Dmitry Demeshchuk <dm...@postmates.com>.
Thanks for all your help, Ahmet!

Comments inline.

On Thu, Jun 8, 2017 at 6:32 PM, Ahmet Altay <al...@google.com> wrote:

> Thank you for the update, some questions inline.
>
> On Thu, Jun 8, 2017 at 6:21 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> FYI, I tried to install a psycopg2 wheel from a file using the
>> "extra_packages" argument (although, wheels installation is apparently
>> still an experimental feature), but this led to a problem with ECS-2 vs
>> ECS-4 compatibility issues (looks like the Dataflow version of Python is
>> using ECS-2, while wheels for Linux generally use ECS-4).
>>
>
> What is ECS-2 vs ECS-4 problem, and what the compatibility issue?
>

Basically, when I would try to import psycopg2 inside a module, the
pipeline would die with the following error:

/usr/local/lib/python2.7/dist-packages/psycopg2/_psycopg.so: undefined
symbol: PyUnicodeUCS2_DecodeUTF8

This issue is explained in the official Python FAQ:
https://docs.python.org/2.7/faq/extending.html#when-importing-module-x-why-do-i-get-undefined-symbol-pyunicodeucs2
.

During Python compilation, there's a ./configure option that gets passed to
specify how many bytes are being used for Unicode. My guess would be that
Dataflow's Python uses 2.



>
>
>
>>
>> What ended up working for me ultimately, though, is an approach similar
>> to juliaset, with a few small differences: https://gist.gith
>> ub.com/doubleyou/27bf3abb0fc77a2bc9257e6adc5cfe8f
>>
>> Note two things here:
>>
>> 1. We import the "install" class from setuptools, not from distutils.
>> This, in fact, has been the core problem for me. I haven't yet tried if the
>> juliaset example works for me at all, but I strongly suspect that it may
>> not work exactly because of this issue.
>>
>
> Please let us know if juliaset does not work for you as is.
>

Will do! I'll try to find some time to test it out tomorrow.


>
>
>>
>> 2. We handle commands in a simpler fashion, by just using one single
>> class.
>>
>> I'll make a Jira ticket later today or tomorrow to reflect my findings,
>> maybe make a pull request if I confirm that juliaset is not universally
>> working either, if that's fine.
>>
>
> It would be great if you can share this information in a JIRA issue.
> Juliaset is only an example of running commands at setup time, it does not
> globally solve all possible issues.
>

Sounds good. I'll create a JIRA when I have enough input information to
provide and a clean reproduction case.


>
> Ahmet
>
>
>>
>> On Tue, Jun 6, 2017 at 8:46 PM, Dmitry Demeshchuk <dm...@postmates.com>
>> wrote:
>>
>>> Yeah, I wasn't really pinning it myself, it's one of the dependency
>>> packages that depends on that specific version.
>>>
>>> Thanks for the information, I'll try to explicitly install 33.1.1 and
>>> see if it changes anything.
>>>
>>> On Tue, Jun 6, 2017 at 7:13 PM, Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Pinning setuptools is generally not a good practice. The reason is at
>>>> installation time it might cause removal of the the setuptools that is
>>>> being used to install packages.
>>>>
>>>> FWIW, dataflow workers should have setuptools 33.1.1, which was
>>>> released in 2017/01/16.
>>>>
>>>> Ahmet
>>>>
>>>> On Tue, Jun 6, 2017 at 6:53 PM, Dmitry Demeshchuk <dmitry@postmates.com
>>>> > wrote:
>>>>
>>>>> Thanks, Ahmet, it really turned out that Stackdriver had more logs
>>>>> than just the Dataflow logs section.
>>>>>
>>>>> So, I ended up seeing this code that fails constantly:
>>>>>
>>>>> I    Running setup.py install for dataflow: started
>>>>> I      Running setup.py install for dataflow: finished with status 'error'
>>>>> I      Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile:
>>>>> I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>>>>> I         or: -c --help [cmd1 cmd2 ...]
>>>>> I         or: -c --help-commands
>>>>> I         or: -c cmd --help
>>>>> I
>>>>> I      error: option --single-version-externally-managed not recognized
>>>>> I
>>>>> I      ----------------------------------------
>>>>> I  Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-bXyST4-build/
>>>>> I  /usr/local/bin/pip failed with exit status 1
>>>>>
>>>>>
>>>>> This seems to mean that the natively installed setuptools are too old,
>>>>> and the new command has been generated with a newer version of setuptools
>>>>> (specifically, my project has setuptools==36.0.1 as a dependency of some
>>>>> package). I'm still digging more through the Stackdriver logs but so far
>>>>> couldn't find out the exact reason of the failure.
>>>>>
>>>>> Also talking to the Dataflow folks, maybe they'll have a better idea.
>>>>> I'll also try to compare this to the output of successful pipelines and see
>>>>> if it gives me any ideas.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <
>>>>>> dmitry@postmates.com> wrote:
>>>>>>
>>>>>>> Hi Ahmet,
>>>>>>>
>>>>>>> Thanks a lot for pointing out that doc, I somehow missed it from the
>>>>>>> official Python SDK page!
>>>>>>>
>>>>>>> One thing that comes to my mind is that generally one should
>>>>>>> probably use the 'install' command in setuptools, not 'build', like it's
>>>>>>> done in https://github.com/apache/beam/blob/master/sdks/python/ap
>>>>>>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being,
>>>>>>> the 'build' step seems to be executed on the original machine, not inside
>>>>>>> the runner's containers, while 'install' will be triggered inside of them.
>>>>>>> If I run a pipeline that uses setup.py with a "build" step, it fails due to
>>>>>>> being unable to "apt-get install libpq-dev" on a mac.
>>>>>>>
>>>>>>
>>>>>> Thank you. This example should similarly work in install commands I
>>>>>> believe. Also, if possible please file a JIRA issue with your ideas and we
>>>>>> can work on improving things.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'm still trying to make it work with either build or install steps,
>>>>>>> talking to the Dataflow folks in parallel to get more understanding of what
>>>>>>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>>>>>>> Stackdriver, only runtime logs, so it seems).
>>>>>>>
>>>>>>
>>>>>> Have you tried looking worker-startup logs? All of the logs should be
>>>>>> in stackdriver.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Please see Managing Python Pipeline Dependencies [1] for various
>>>>>>>> ways on installing additional dependencies. The section on non-python
>>>>>>>> dependencies is relevant to your question.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Ahmet
>>>>>>>>
>>>>>>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>>>>>>> ne-dependencies/
>>>>>>>>
>>>>>>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>>>>>>> sebastien.morand@veolia.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Interested too. Could be fine for instance to add sftp
>>>>>>>>> BoundedSource, but compilalation of paramiko with ssl library (and so
>>>>>>>>> installation of ssl-dev)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> *Sébastien MORAND*
>>>>>>>>> Team Lead Solution Architect
>>>>>>>>> Technology & Operations / Digital Factory
>>>>>>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>>>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>>>>>>> <+33%201%2085%2057%2071%2008>
>>>>>>>>> Bureau 0144C (Ouest)
>>>>>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>>>>>>> *www.veolia.com <http://www.veolia.com>*
>>>>>>>>> <http://www.veolia.com>
>>>>>>>>> <https://www.facebook.com/veoliaenvironment/>
>>>>>>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>>>>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>>>>>>> <https://twitter.com/veolia>
>>>>>>>>>
>>>>>>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi again, folks,
>>>>>>>>>>
>>>>>>>>>> How should I go about installing Python packages that require to
>>>>>>>>>> be built and/or require native dependencies like shared libraries or such?
>>>>>>>>>>
>>>>>>>>>> I guess, I could potentially build the C-based modules using the
>>>>>>>>>> same version of kernel and glibc that Dataflow is running, but doesn't seem
>>>>>>>>>> like there's any way to install shared libraries at these boxes, right?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Dmitry Demeshchuk.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> --------------------------------
>>>>>>>>> This e-mail transmission (message and any attached files) may
>>>>>>>>> contain information that is proprietary, privileged and/or confidential to
>>>>>>>>> Veolia Environnement and/or its affiliates and is intended exclusively for
>>>>>>>>> the person(s) to whom it is addressed. If you are not the intended
>>>>>>>>> recipient, please notify the sender by return e-mail and delete all copies
>>>>>>>>> of this e-mail, including all attachments. Unless expressly authorized, any
>>>>>>>>> use, disclosure, publication, retransmission or dissemination of this
>>>>>>>>> e-mail and/or of its attachments is strictly prohibited.
>>>>>>>>>
>>>>>>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>>>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>>>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>>>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>>>>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>>>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>>>>>>> publication, la distribution, ou la reproduction non expressement
>>>>>>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Dmitry Demeshchuk.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Dmitry Demeshchuk.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Installing non-native Python dependencies in Dataflow

Posted by Ahmet Altay <al...@google.com>.
Thank you for the update, some questions inline.

On Thu, Jun 8, 2017 at 6:21 PM, Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> FYI, I tried to install a psycopg2 wheel from a file using the
> "extra_packages" argument (although, wheels installation is apparently
> still an experimental feature), but this led to a problem with ECS-2 vs
> ECS-4 compatibility issues (looks like the Dataflow version of Python is
> using ECS-2, while wheels for Linux generally use ECS-4).
>

What is ECS-2 vs ECS-4 problem, and what the compatibility issue?



>
> What ended up working for me ultimately, though, is an approach similar to
> juliaset, with a few small differences: https://gist.github.com/doubleyou/
> 27bf3abb0fc77a2bc9257e6adc5cfe8f
>
> Note two things here:
>
> 1. We import the "install" class from setuptools, not from distutils.
> This, in fact, has been the core problem for me. I haven't yet tried if the
> juliaset example works for me at all, but I strongly suspect that it may
> not work exactly because of this issue.
>

Please let us know if juliaset does not work for you as is.


>
> 2. We handle commands in a simpler fashion, by just using one single class.
>
> I'll make a Jira ticket later today or tomorrow to reflect my findings,
> maybe make a pull request if I confirm that juliaset is not universally
> working either, if that's fine.
>

It would be great if you can share this information in a JIRA issue.
Juliaset is only an example of running commands at setup time, it does not
globally solve all possible issues.

Ahmet


>
> On Tue, Jun 6, 2017 at 8:46 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Yeah, I wasn't really pinning it myself, it's one of the dependency
>> packages that depends on that specific version.
>>
>> Thanks for the information, I'll try to explicitly install 33.1.1 and see
>> if it changes anything.
>>
>> On Tue, Jun 6, 2017 at 7:13 PM, Ahmet Altay <al...@google.com> wrote:
>>
>>> Pinning setuptools is generally not a good practice. The reason is at
>>> installation time it might cause removal of the the setuptools that is
>>> being used to install packages.
>>>
>>> FWIW, dataflow workers should have setuptools 33.1.1, which was released
>>> in 2017/01/16.
>>>
>>> Ahmet
>>>
>>> On Tue, Jun 6, 2017 at 6:53 PM, Dmitry Demeshchuk <dm...@postmates.com>
>>> wrote:
>>>
>>>> Thanks, Ahmet, it really turned out that Stackdriver had more logs than
>>>> just the Dataflow logs section.
>>>>
>>>> So, I ended up seeing this code that fails constantly:
>>>>
>>>> I    Running setup.py install for dataflow: started
>>>> I      Running setup.py install for dataflow: finished with status 'error'
>>>> I      Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile:
>>>> I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>>>> I         or: -c --help [cmd1 cmd2 ...]
>>>> I         or: -c --help-commands
>>>> I         or: -c cmd --help
>>>> I
>>>> I      error: option --single-version-externally-managed not recognized
>>>> I
>>>> I      ----------------------------------------
>>>> I  Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-bXyST4-build/
>>>> I  /usr/local/bin/pip failed with exit status 1
>>>>
>>>>
>>>> This seems to mean that the natively installed setuptools are too old,
>>>> and the new command has been generated with a newer version of setuptools
>>>> (specifically, my project has setuptools==36.0.1 as a dependency of some
>>>> package). I'm still digging more through the Stackdriver logs but so far
>>>> couldn't find out the exact reason of the failure.
>>>>
>>>> Also talking to the Dataflow folks, maybe they'll have a better idea.
>>>> I'll also try to compare this to the output of successful pipelines and see
>>>> if it gives me any ideas.
>>>>
>>>> Thank you.
>>>>
>>>> On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <
>>>>> dmitry@postmates.com> wrote:
>>>>>
>>>>>> Hi Ahmet,
>>>>>>
>>>>>> Thanks a lot for pointing out that doc, I somehow missed it from the
>>>>>> official Python SDK page!
>>>>>>
>>>>>> One thing that comes to my mind is that generally one should probably
>>>>>> use the 'install' command in setuptools, not 'build', like it's done in
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/ap
>>>>>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being,
>>>>>> the 'build' step seems to be executed on the original machine, not inside
>>>>>> the runner's containers, while 'install' will be triggered inside of them.
>>>>>> If I run a pipeline that uses setup.py with a "build" step, it fails due to
>>>>>> being unable to "apt-get install libpq-dev" on a mac.
>>>>>>
>>>>>
>>>>> Thank you. This example should similarly work in install commands I
>>>>> believe. Also, if possible please file a JIRA issue with your ideas and we
>>>>> can work on improving things.
>>>>>
>>>>>
>>>>>>
>>>>>> I'm still trying to make it work with either build or install steps,
>>>>>> talking to the Dataflow folks in parallel to get more understanding of what
>>>>>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>>>>>> Stackdriver, only runtime logs, so it seems).
>>>>>>
>>>>>
>>>>> Have you tried looking worker-startup logs? All of the logs should be
>>>>> in stackdriver.
>>>>>
>>>>>
>>>>>>
>>>>>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Please see Managing Python Pipeline Dependencies [1] for various
>>>>>>> ways on installing additional dependencies. The section on non-python
>>>>>>> dependencies is relevant to your question.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Ahmet
>>>>>>>
>>>>>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>>>>>> ne-dependencies/
>>>>>>>
>>>>>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>>>>>> sebastien.morand@veolia.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Interested too. Could be fine for instance to add sftp
>>>>>>>> BoundedSource, but compilalation of paramiko with ssl library (and so
>>>>>>>> installation of ssl-dev)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> *Sébastien MORAND*
>>>>>>>> Team Lead Solution Architect
>>>>>>>> Technology & Operations / Digital Factory
>>>>>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>>>>>> <+33%201%2085%2057%2071%2008>
>>>>>>>> Bureau 0144C (Ouest)
>>>>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>>>>>> *www.veolia.com <http://www.veolia.com>*
>>>>>>>> <http://www.veolia.com>
>>>>>>>> <https://www.facebook.com/veoliaenvironment/>
>>>>>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>>>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>>>>>> <https://twitter.com/veolia>
>>>>>>>>
>>>>>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi again, folks,
>>>>>>>>>
>>>>>>>>> How should I go about installing Python packages that require to
>>>>>>>>> be built and/or require native dependencies like shared libraries or such?
>>>>>>>>>
>>>>>>>>> I guess, I could potentially build the C-based modules using the
>>>>>>>>> same version of kernel and glibc that Dataflow is running, but doesn't seem
>>>>>>>>> like there's any way to install shared libraries at these boxes, right?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Dmitry Demeshchuk.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> --------------------------------
>>>>>>>> This e-mail transmission (message and any attached files) may
>>>>>>>> contain information that is proprietary, privileged and/or confidential to
>>>>>>>> Veolia Environnement and/or its affiliates and is intended exclusively for
>>>>>>>> the person(s) to whom it is addressed. If you are not the intended
>>>>>>>> recipient, please notify the sender by return e-mail and delete all copies
>>>>>>>> of this e-mail, including all attachments. Unless expressly authorized, any
>>>>>>>> use, disclosure, publication, retransmission or dissemination of this
>>>>>>>> e-mail and/or of its attachments is strictly prohibited.
>>>>>>>>
>>>>>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>>>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>>>>>> publication, la distribution, ou la reproduction non expressement
>>>>>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>>>>> ------------------------------------------------------------
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Dmitry Demeshchuk.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Dmitry Demeshchuk.
>>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Installing non-native Python dependencies in Dataflow

Posted by Dmitry Demeshchuk <dm...@postmates.com>.
FYI, I tried to install a psycopg2 wheel from a file using the
"extra_packages" argument (although, wheels installation is apparently
still an experimental feature), but this led to a problem with ECS-2 vs
ECS-4 compatibility issues (looks like the Dataflow version of Python is
using ECS-2, while wheels for Linux generally use ECS-4).

What ended up working for me ultimately, though, is an approach similar to
juliaset, with a few small differences:
https://gist.github.com/doubleyou/27bf3abb0fc77a2bc9257e6adc5cfe8f

Note two things here:

1. We import the "install" class from setuptools, not from distutils. This,
in fact, has been the core problem for me. I haven't yet tried if the
juliaset example works for me at all, but I strongly suspect that it may
not work exactly because of this issue.

2. We handle commands in a simpler fashion, by just using one single class.

I'll make a Jira ticket later today or tomorrow to reflect my findings,
maybe make a pull request if I confirm that juliaset is not universally
working either, if that's fine.

On Tue, Jun 6, 2017 at 8:46 PM, Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> Yeah, I wasn't really pinning it myself, it's one of the dependency
> packages that depends on that specific version.
>
> Thanks for the information, I'll try to explicitly install 33.1.1 and see
> if it changes anything.
>
> On Tue, Jun 6, 2017 at 7:13 PM, Ahmet Altay <al...@google.com> wrote:
>
>> Pinning setuptools is generally not a good practice. The reason is at
>> installation time it might cause removal of the the setuptools that is
>> being used to install packages.
>>
>> FWIW, dataflow workers should have setuptools 33.1.1, which was released
>> in 2017/01/16.
>>
>> Ahmet
>>
>> On Tue, Jun 6, 2017 at 6:53 PM, Dmitry Demeshchuk <dm...@postmates.com>
>> wrote:
>>
>>> Thanks, Ahmet, it really turned out that Stackdriver had more logs than
>>> just the Dataflow logs section.
>>>
>>> So, I ended up seeing this code that fails constantly:
>>>
>>> I    Running setup.py install for dataflow: started
>>> I      Running setup.py install for dataflow: finished with status 'error'
>>> I      Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile:
>>> I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>>> I         or: -c --help [cmd1 cmd2 ...]
>>> I         or: -c --help-commands
>>> I         or: -c cmd --help
>>> I
>>> I      error: option --single-version-externally-managed not recognized
>>> I
>>> I      ----------------------------------------
>>> I  Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-bXyST4-build/
>>> I  /usr/local/bin/pip failed with exit status 1
>>>
>>>
>>> This seems to mean that the natively installed setuptools are too old,
>>> and the new command has been generated with a newer version of setuptools
>>> (specifically, my project has setuptools==36.0.1 as a dependency of some
>>> package). I'm still digging more through the Stackdriver logs but so far
>>> couldn't find out the exact reason of the failure.
>>>
>>> Also talking to the Dataflow folks, maybe they'll have a better idea.
>>> I'll also try to compare this to the output of successful pipelines and see
>>> if it gives me any ideas.
>>>
>>> Thank you.
>>>
>>> On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <al...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <dmitry@postmates.com
>>>> > wrote:
>>>>
>>>>> Hi Ahmet,
>>>>>
>>>>> Thanks a lot for pointing out that doc, I somehow missed it from the
>>>>> official Python SDK page!
>>>>>
>>>>> One thing that comes to my mind is that generally one should probably
>>>>> use the 'install' command in setuptools, not 'build', like it's done in
>>>>> https://github.com/apache/beam/blob/master/sdks/python/ap
>>>>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being, the
>>>>> 'build' step seems to be executed on the original machine, not inside the
>>>>> runner's containers, while 'install' will be triggered inside of them. If I
>>>>> run a pipeline that uses setup.py with a "build" step, it fails due to
>>>>> being unable to "apt-get install libpq-dev" on a mac.
>>>>>
>>>>
>>>> Thank you. This example should similarly work in install commands I
>>>> believe. Also, if possible please file a JIRA issue with your ideas and we
>>>> can work on improving things.
>>>>
>>>>
>>>>>
>>>>> I'm still trying to make it work with either build or install steps,
>>>>> talking to the Dataflow folks in parallel to get more understanding of what
>>>>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>>>>> Stackdriver, only runtime logs, so it seems).
>>>>>
>>>>
>>>> Have you tried looking worker-startup logs? All of the logs should be
>>>> in stackdriver.
>>>>
>>>>
>>>>>
>>>>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Please see Managing Python Pipeline Dependencies [1] for various ways
>>>>>> on installing additional dependencies. The section on non-python
>>>>>> dependencies is relevant to your question.
>>>>>>
>>>>>> Thank you,
>>>>>> Ahmet
>>>>>>
>>>>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>>>>> ne-dependencies/
>>>>>>
>>>>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>>>>> sebastien.morand@veolia.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Interested too. Could be fine for instance to add sftp
>>>>>>> BoundedSource, but compilalation of paramiko with ssl library (and so
>>>>>>> installation of ssl-dev)
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Sébastien MORAND*
>>>>>>> Team Lead Solution Architect
>>>>>>> Technology & Operations / Digital Factory
>>>>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>>>>> <+33%201%2085%2057%2071%2008>
>>>>>>> Bureau 0144C (Ouest)
>>>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>>>>> *www.veolia.com <http://www.veolia.com>*
>>>>>>> <http://www.veolia.com>
>>>>>>> <https://www.facebook.com/veoliaenvironment/>
>>>>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>>>>> <https://twitter.com/veolia>
>>>>>>>
>>>>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi again, folks,
>>>>>>>>
>>>>>>>> How should I go about installing Python packages that require to be
>>>>>>>> built and/or require native dependencies like shared libraries or such?
>>>>>>>>
>>>>>>>> I guess, I could potentially build the C-based modules using the
>>>>>>>> same version of kernel and glibc that Dataflow is running, but doesn't seem
>>>>>>>> like there's any way to install shared libraries at these boxes, right?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Dmitry Demeshchuk.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> --------------------------------
>>>>>>> This e-mail transmission (message and any attached files) may
>>>>>>> contain information that is proprietary, privileged and/or confidential to
>>>>>>> Veolia Environnement and/or its affiliates and is intended exclusively for
>>>>>>> the person(s) to whom it is addressed. If you are not the intended
>>>>>>> recipient, please notify the sender by return e-mail and delete all copies
>>>>>>> of this e-mail, including all attachments. Unless expressly authorized, any
>>>>>>> use, disclosure, publication, retransmission or dissemination of this
>>>>>>> e-mail and/or of its attachments is strictly prohibited.
>>>>>>>
>>>>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>>>>> publication, la distribution, ou la reproduction non expressement
>>>>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>>>> ------------------------------------------------------------
>>>>>>> --------------------------------
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Dmitry Demeshchuk.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.

Re: Installing non-native Python dependencies in Dataflow

Posted by Dmitry Demeshchuk <dm...@postmates.com>.
Yeah, I wasn't really pinning it myself, it's one of the dependency
packages that depends on that specific version.

Thanks for the information, I'll try to explicitly install 33.1.1 and see
if it changes anything.

On Tue, Jun 6, 2017 at 7:13 PM, Ahmet Altay <al...@google.com> wrote:

> Pinning setuptools is generally not a good practice. The reason is at
> installation time it might cause removal of the the setuptools that is
> being used to install packages.
>
> FWIW, dataflow workers should have setuptools 33.1.1, which was released
> in 2017/01/16.
>
> Ahmet
>
> On Tue, Jun 6, 2017 at 6:53 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Thanks, Ahmet, it really turned out that Stackdriver had more logs than
>> just the Dataflow logs section.
>>
>> So, I ended up seeing this code that fails constantly:
>>
>> I    Running setup.py install for dataflow: started
>> I      Running setup.py install for dataflow: finished with status 'error'
>> I      Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile:
>> I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>> I         or: -c --help [cmd1 cmd2 ...]
>> I         or: -c --help-commands
>> I         or: -c cmd --help
>> I
>> I      error: option --single-version-externally-managed not recognized
>> I
>> I      ----------------------------------------
>> I  Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-bXyST4-build/
>> I  /usr/local/bin/pip failed with exit status 1
>>
>>
>> This seems to mean that the natively installed setuptools are too old,
>> and the new command has been generated with a newer version of setuptools
>> (specifically, my project has setuptools==36.0.1 as a dependency of some
>> package). I'm still digging more through the Stackdriver logs but so far
>> couldn't find out the exact reason of the failure.
>>
>> Also talking to the Dataflow folks, maybe they'll have a better idea.
>> I'll also try to compare this to the output of successful pipelines and see
>> if it gives me any ideas.
>>
>> Thank you.
>>
>> On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <dm...@postmates.com>
>>> wrote:
>>>
>>>> Hi Ahmet,
>>>>
>>>> Thanks a lot for pointing out that doc, I somehow missed it from the
>>>> official Python SDK page!
>>>>
>>>> One thing that comes to my mind is that generally one should probably
>>>> use the 'install' command in setuptools, not 'build', like it's done in
>>>> https://github.com/apache/beam/blob/master/sdks/python/ap
>>>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being, the
>>>> 'build' step seems to be executed on the original machine, not inside the
>>>> runner's containers, while 'install' will be triggered inside of them. If I
>>>> run a pipeline that uses setup.py with a "build" step, it fails due to
>>>> being unable to "apt-get install libpq-dev" on a mac.
>>>>
>>>
>>> Thank you. This example should similarly work in install commands I
>>> believe. Also, if possible please file a JIRA issue with your ideas and we
>>> can work on improving things.
>>>
>>>
>>>>
>>>> I'm still trying to make it work with either build or install steps,
>>>> talking to the Dataflow folks in parallel to get more understanding of what
>>>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>>>> Stackdriver, only runtime logs, so it seems).
>>>>
>>>
>>> Have you tried looking worker-startup logs? All of the logs should be in
>>> stackdriver.
>>>
>>>
>>>>
>>>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Please see Managing Python Pipeline Dependencies [1] for various ways
>>>>> on installing additional dependencies. The section on non-python
>>>>> dependencies is relevant to your question.
>>>>>
>>>>> Thank you,
>>>>> Ahmet
>>>>>
>>>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>>>> ne-dependencies/
>>>>>
>>>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>>>> sebastien.morand@veolia.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Interested too. Could be fine for instance to add sftp BoundedSource,
>>>>>> but compilalation of paramiko with ssl library (and so installation of
>>>>>> ssl-dev)
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> *Sébastien MORAND*
>>>>>> Team Lead Solution Architect
>>>>>> Technology & Operations / Digital Factory
>>>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>>>> <+33%201%2085%2057%2071%2008>
>>>>>> Bureau 0144C (Ouest)
>>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>>>> *www.veolia.com <http://www.veolia.com>*
>>>>>> <http://www.veolia.com>
>>>>>> <https://www.facebook.com/veoliaenvironment/>
>>>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>>>> <https://twitter.com/veolia>
>>>>>>
>>>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi again, folks,
>>>>>>>
>>>>>>> How should I go about installing Python packages that require to be
>>>>>>> built and/or require native dependencies like shared libraries or such?
>>>>>>>
>>>>>>> I guess, I could potentially build the C-based modules using the
>>>>>>> same version of kernel and glibc that Dataflow is running, but doesn't seem
>>>>>>> like there's any way to install shared libraries at these boxes, right?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Dmitry Demeshchuk.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> --------------------------------
>>>>>> This e-mail transmission (message and any attached files) may contain
>>>>>> information that is proprietary, privileged and/or confidential to Veolia
>>>>>> Environnement and/or its affiliates and is intended exclusively for the
>>>>>> person(s) to whom it is addressed. If you are not the intended recipient,
>>>>>> please notify the sender by return e-mail and delete all copies of this
>>>>>> e-mail, including all attachments. Unless expressly authorized, any use,
>>>>>> disclosure, publication, retransmission or dissemination of this e-mail
>>>>>> and/or of its attachments is strictly prohibited.
>>>>>>
>>>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>>>> publication, la distribution, ou la reproduction non expressement
>>>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>>> ------------------------------------------------------------
>>>>>> --------------------------------
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Dmitry Demeshchuk.
>>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Installing non-native Python dependencies in Dataflow

Posted by Ahmet Altay <al...@google.com>.
Pinning setuptools is generally not a good practice. The reason is at
installation time it might cause removal of the the setuptools that is
being used to install packages.

FWIW, dataflow workers should have setuptools 33.1.1, which was released in
2017/01/16.

Ahmet

On Tue, Jun 6, 2017 at 6:53 PM, Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> Thanks, Ahmet, it really turned out that Stackdriver had more logs than
> just the Dataflow logs section.
>
> So, I ended up seeing this code that fails constantly:
>
> I    Running setup.py install for dataflow: started
> I      Running setup.py install for dataflow: finished with status 'error'
> I      Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile:
> I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
> I         or: -c --help [cmd1 cmd2 ...]
> I         or: -c --help-commands
> I         or: -c cmd --help
> I
> I      error: option --single-version-externally-managed not recognized
> I
> I      ----------------------------------------
> I  Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-sHw6oI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-bXyST4-build/
> I  /usr/local/bin/pip failed with exit status 1
>
>
> This seems to mean that the natively installed setuptools are too old, and
> the new command has been generated with a newer version of setuptools
> (specifically, my project has setuptools==36.0.1 as a dependency of some
> package). I'm still digging more through the Stackdriver logs but so far
> couldn't find out the exact reason of the failure.
>
> Also talking to the Dataflow folks, maybe they'll have a better idea. I'll
> also try to compare this to the output of successful pipelines and see if
> it gives me any ideas.
>
> Thank you.
>
> On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <dm...@postmates.com>
>> wrote:
>>
>>> Hi Ahmet,
>>>
>>> Thanks a lot for pointing out that doc, I somehow missed it from the
>>> official Python SDK page!
>>>
>>> One thing that comes to my mind is that generally one should probably
>>> use the 'install' command in setuptools, not 'build', like it's done in
>>> https://github.com/apache/beam/blob/master/sdks/python/ap
>>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being, the
>>> 'build' step seems to be executed on the original machine, not inside the
>>> runner's containers, while 'install' will be triggered inside of them. If I
>>> run a pipeline that uses setup.py with a "build" step, it fails due to
>>> being unable to "apt-get install libpq-dev" on a mac.
>>>
>>
>> Thank you. This example should similarly work in install commands I
>> believe. Also, if possible please file a JIRA issue with your ideas and we
>> can work on improving things.
>>
>>
>>>
>>> I'm still trying to make it work with either build or install steps,
>>> talking to the Dataflow folks in parallel to get more understanding of what
>>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>>> Stackdriver, only runtime logs, so it seems).
>>>
>>
>> Have you tried looking worker-startup logs? All of the logs should be in
>> stackdriver.
>>
>>
>>>
>>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Please see Managing Python Pipeline Dependencies [1] for various ways
>>>> on installing additional dependencies. The section on non-python
>>>> dependencies is relevant to your question.
>>>>
>>>> Thank you,
>>>> Ahmet
>>>>
>>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>>> ne-dependencies/
>>>>
>>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>>> sebastien.morand@veolia.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Interested too. Could be fine for instance to add sftp BoundedSource,
>>>>> but compilalation of paramiko with ssl library (and so installation of
>>>>> ssl-dev)
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Sébastien MORAND*
>>>>> Team Lead Solution Architect
>>>>> Technology & Operations / Digital Factory
>>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>>> <+33%201%2085%2057%2071%2008>
>>>>> Bureau 0144C (Ouest)
>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>>> *www.veolia.com <http://www.veolia.com>*
>>>>> <http://www.veolia.com>
>>>>> <https://www.facebook.com/veoliaenvironment/>
>>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>>> <https://twitter.com/veolia>
>>>>>
>>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com>
>>>>> wrote:
>>>>>
>>>>>> Hi again, folks,
>>>>>>
>>>>>> How should I go about installing Python packages that require to be
>>>>>> built and/or require native dependencies like shared libraries or such?
>>>>>>
>>>>>> I guess, I could potentially build the C-based modules using the same
>>>>>> version of kernel and glibc that Dataflow is running, but doesn't seem like
>>>>>> there's any way to install shared libraries at these boxes, right?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Dmitry Demeshchuk.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------
>>>>> --------------------------------
>>>>> This e-mail transmission (message and any attached files) may contain
>>>>> information that is proprietary, privileged and/or confidential to Veolia
>>>>> Environnement and/or its affiliates and is intended exclusively for the
>>>>> person(s) to whom it is addressed. If you are not the intended recipient,
>>>>> please notify the sender by return e-mail and delete all copies of this
>>>>> e-mail, including all attachments. Unless expressly authorized, any use,
>>>>> disclosure, publication, retransmission or dissemination of this e-mail
>>>>> and/or of its attachments is strictly prohibited.
>>>>>
>>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>>> publication, la distribution, ou la reproduction non expressement
>>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>> ------------------------------------------------------------
>>>>> --------------------------------
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Installing non-native Python dependencies in Dataflow

Posted by Dmitry Demeshchuk <dm...@postmates.com>.
Thanks, Ahmet, it really turned out that Stackdriver had more logs than
just the Dataflow logs section.

So, I ended up seeing this code that fails constantly:

I    Running setup.py install for dataflow: started
I      Running setup.py install for dataflow: finished with status 'error'
I      Complete output from command /usr/bin/python -u -c "import
setuptools, tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize,
'open', open)(__file__);code=f.read().replace('\r\n',
'\n');f.close();exec(compile(code, __file__, 'exec'))" install
--record /tmp/pip-sHw6oI-record/install-record.txt
--single-version-externally-managed --compile:
I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
I         or: -c --help [cmd1 cmd2 ...]
I         or: -c --help-commands
I         or: -c cmd --help
I
I      error: option --single-version-externally-managed not recognized
I
I      ----------------------------------------
I  Command "/usr/bin/python -u -c "import setuptools,
tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize,
'open', open)(__file__);code=f.read().replace('\r\n',
'\n');f.close();exec(compile(code, __file__, 'exec'))" install
--record /tmp/pip-sHw6oI-record/install-record.txt
--single-version-externally-managed --compile" failed with error code
1 in /tmp/pip-bXyST4-build/
I  /usr/local/bin/pip failed with exit status 1


This seems to mean that the natively installed setuptools are too old, and
the new command has been generated with a newer version of setuptools
(specifically, my project has setuptools==36.0.1 as a dependency of some
package). I'm still digging more through the Stackdriver logs but so far
couldn't find out the exact reason of the failure.

Also talking to the Dataflow folks, maybe they'll have a better idea. I'll
also try to compare this to the output of successful pipelines and see if
it gives me any ideas.

Thank you.

On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <al...@google.com> wrote:

>
>
> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Hi Ahmet,
>>
>> Thanks a lot for pointing out that doc, I somehow missed it from the
>> official Python SDK page!
>>
>> One thing that comes to my mind is that generally one should probably use
>> the 'install' command in setuptools, not 'build', like it's done in
>> https://github.com/apache/beam/blob/master/sdks/python/ap
>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being, the
>> 'build' step seems to be executed on the original machine, not inside the
>> runner's containers, while 'install' will be triggered inside of them. If I
>> run a pipeline that uses setup.py with a "build" step, it fails due to
>> being unable to "apt-get install libpq-dev" on a mac.
>>
>
> Thank you. This example should similarly work in install commands I
> believe. Also, if possible please file a JIRA issue with your ideas and we
> can work on improving things.
>
>
>>
>> I'm still trying to make it work with either build or install steps,
>> talking to the Dataflow folks in parallel to get more understanding of what
>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>> Stackdriver, only runtime logs, so it seems).
>>
>
> Have you tried looking worker-startup logs? All of the logs should be in
> stackdriver.
>
>
>>
>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:
>>
>>> Hi,
>>>
>>> Please see Managing Python Pipeline Dependencies [1] for various ways on
>>> installing additional dependencies. The section on non-python dependencies
>>> is relevant to your question.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>> ne-dependencies/
>>>
>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>> sebastien.morand@veolia.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Interested too. Could be fine for instance to add sftp BoundedSource,
>>>> but compilalation of paramiko with ssl library (and so installation of
>>>> ssl-dev)
>>>>
>>>> Regards,
>>>>
>>>> *Sébastien MORAND*
>>>> Team Lead Solution Architect
>>>> Technology & Operations / Digital Factory
>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>> <+33%201%2085%2057%2071%2008>
>>>> Bureau 0144C (Ouest)
>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>> *www.veolia.com <http://www.veolia.com>*
>>>> <http://www.veolia.com>
>>>> <https://www.facebook.com/veoliaenvironment/>
>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>> <https://twitter.com/veolia>
>>>>
>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com>
>>>> wrote:
>>>>
>>>>> Hi again, folks,
>>>>>
>>>>> How should I go about installing Python packages that require to be
>>>>> built and/or require native dependencies like shared libraries or such?
>>>>>
>>>>> I guess, I could potentially build the C-based modules using the same
>>>>> version of kernel and glibc that Dataflow is running, but doesn't seem like
>>>>> there's any way to install shared libraries at these boxes, right?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Dmitry Demeshchuk.
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> --------------------------------
>>>> This e-mail transmission (message and any attached files) may contain
>>>> information that is proprietary, privileged and/or confidential to Veolia
>>>> Environnement and/or its affiliates and is intended exclusively for the
>>>> person(s) to whom it is addressed. If you are not the intended recipient,
>>>> please notify the sender by return e-mail and delete all copies of this
>>>> e-mail, including all attachments. Unless expressly authorized, any use,
>>>> disclosure, publication, retransmission or dissemination of this e-mail
>>>> and/or of its attachments is strictly prohibited.
>>>>
>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>> publication, la distribution, ou la reproduction non expressement
>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>> ------------------------------------------------------------
>>>> --------------------------------
>>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Installing non-native Python dependencies in Dataflow

Posted by Ahmet Altay <al...@google.com>.
On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> Hi Ahmet,
>
> Thanks a lot for pointing out that doc, I somehow missed it from the
> official Python SDK page!
>
> One thing that comes to my mind is that generally one should probably use
> the 'install' command in setuptools, not 'build', like it's done in
> https://github.com/apache/beam/blob/master/sdks/python/ap
> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being, the
> 'build' step seems to be executed on the original machine, not inside the
> runner's containers, while 'install' will be triggered inside of them. If I
> run a pipeline that uses setup.py with a "build" step, it fails due to
> being unable to "apt-get install libpq-dev" on a mac.
>

Thank you. This example should similarly work in install commands I
believe. Also, if possible please file a JIRA issue with your ideas and we
can work on improving things.


>
> I'm still trying to make it work with either build or install steps,
> talking to the Dataflow folks in parallel to get more understanding of what
> I'm doing wrong (Dataflow doesn't send out installation failure logs to
> Stackdriver, only runtime logs, so it seems).
>

Have you tried looking worker-startup logs? All of the logs should be in
stackdriver.


>
> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:
>
>> Hi,
>>
>> Please see Managing Python Pipeline Dependencies [1] for various ways on
>> installing additional dependencies. The section on non-python dependencies
>> is relevant to your question.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>> ne-dependencies/
>>
>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>> sebastien.morand@veolia.com> wrote:
>>
>>> Hi,
>>>
>>> Interested too. Could be fine for instance to add sftp BoundedSource,
>>> but compilalation of paramiko with ssl library (and so installation of
>>> ssl-dev)
>>>
>>> Regards,
>>>
>>> *Sébastien MORAND*
>>> Team Lead Solution Architect
>>> Technology & Operations / Digital Factory
>>> Veolia - Group Information Systems & Technology (IS&T)
>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>> <+33%201%2085%2057%2071%2008>
>>> Bureau 0144C (Ouest)
>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>> *www.veolia.com <http://www.veolia.com>*
>>> <http://www.veolia.com>
>>> <https://www.facebook.com/veoliaenvironment/>
>>> <https://www.youtube.com/user/veoliaenvironnement>
>>> <https://www.linkedin.com/company/veolia-environnement>
>>> <https://twitter.com/veolia>
>>>
>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com> wrote:
>>>
>>>> Hi again, folks,
>>>>
>>>> How should I go about installing Python packages that require to be
>>>> built and/or require native dependencies like shared libraries or such?
>>>>
>>>> I guess, I could potentially build the C-based modules using the same
>>>> version of kernel and glibc that Dataflow is running, but doesn't seem like
>>>> there's any way to install shared libraries at these boxes, right?
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Best regards,
>>>> Dmitry Demeshchuk.
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> --------------------------------
>>> This e-mail transmission (message and any attached files) may contain
>>> information that is proprietary, privileged and/or confidential to Veolia
>>> Environnement and/or its affiliates and is intended exclusively for the
>>> person(s) to whom it is addressed. If you are not the intended recipient,
>>> please notify the sender by return e-mail and delete all copies of this
>>> e-mail, including all attachments. Unless expressly authorized, any use,
>>> disclosure, publication, retransmission or dissemination of this e-mail
>>> and/or of its attachments is strictly prohibited.
>>>
>>> Ce message electronique et ses fichiers attaches sont strictement
>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>> publication, la distribution, ou la reproduction non expressement
>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>> ------------------------------------------------------------
>>> --------------------------------
>>>
>>
>>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Installing non-native Python dependencies in Dataflow

Posted by Dmitry Demeshchuk <dm...@postmates.com>.
Hi Ahmet,

Thanks a lot for pointing out that doc, I somehow missed it from the
official Python SDK page!

One thing that comes to my mind is that generally one should probably use
the 'install' command in setuptools, not 'build', like it's done in
https://github.com/apache/beam/blob/master/sdks/python/ap
ache_beam/examples/complete/juliaset/setup.py#L113. Reason being, the
'build' step seems to be executed on the original machine, not inside the
runner's containers, while 'install' will be triggered inside of them. If I
run a pipeline that uses setup.py with a "build" step, it fails due to
being unable to "apt-get install libpq-dev" on a mac.

I'm still trying to make it work with either build or install steps,
talking to the Dataflow folks in parallel to get more understanding of what
I'm doing wrong (Dataflow doesn't send out installation failure logs to
Stackdriver, only runtime logs, so it seems).

On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <al...@google.com> wrote:

> Hi,
>
> Please see Managing Python Pipeline Dependencies [1] for various ways on
> installing additional dependencies. The section on non-python dependencies
> is relevant to your question.
>
> Thank you,
> Ahmet
>
> [1] https://beam.apache.org/documentation/sdks/python-
> pipeline-dependencies/
>
> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
> sebastien.morand@veolia.com> wrote:
>
>> Hi,
>>
>> Interested too. Could be fine for instance to add sftp BoundedSource, but
>> compilalation of paramiko with ssl library (and so installation of ssl-dev)
>>
>> Regards,
>>
>> *Sébastien MORAND*
>> Team Lead Solution Architect
>> Technology & Operations / Digital Factory
>> Veolia - Group Information Systems & Technology (IS&T)
>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>> <+33%201%2085%2057%2071%2008>
>> Bureau 0144C (Ouest)
>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>> *www.veolia.com <http://www.veolia.com>*
>> <http://www.veolia.com>
>> <https://www.facebook.com/veoliaenvironment/>
>> <https://www.youtube.com/user/veoliaenvironnement>
>> <https://www.linkedin.com/company/veolia-environnement>
>> <https://twitter.com/veolia>
>>
>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com> wrote:
>>
>>> Hi again, folks,
>>>
>>> How should I go about installing Python packages that require to be
>>> built and/or require native dependencies like shared libraries or such?
>>>
>>> I guess, I could potentially build the C-based modules using the same
>>> version of kernel and glibc that Dataflow is running, but doesn't seem like
>>> there's any way to install shared libraries at these boxes, right?
>>>
>>> Thanks!
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> ------------------------------------------------------------
>> --------------------------------
>> This e-mail transmission (message and any attached files) may contain
>> information that is proprietary, privileged and/or confidential to Veolia
>> Environnement and/or its affiliates and is intended exclusively for the
>> person(s) to whom it is addressed. If you are not the intended recipient,
>> please notify the sender by return e-mail and delete all copies of this
>> e-mail, including all attachments. Unless expressly authorized, any use,
>> disclosure, publication, retransmission or dissemination of this e-mail
>> and/or of its attachments is strictly prohibited.
>>
>> Ce message electronique et ses fichiers attaches sont strictement
>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>> message par erreur, merci de le retourner a son emetteur et de le detruire
>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>> publication, la distribution, ou la reproduction non expressement
>> autorisees de ce message et de ses pieces attachees sont interdites.
>> ------------------------------------------------------------
>> --------------------------------
>>
>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Installing non-native Python dependencies in Dataflow

Posted by Ahmet Altay <al...@google.com>.
Hi,

Please see Managing Python Pipeline Dependencies [1] for various ways on
installing additional dependencies. The section on non-python dependencies
is relevant to your question.

Thank you,
Ahmet

[1] https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
sebastien.morand@veolia.com> wrote:

> Hi,
>
> Interested too. Could be fine for instance to add sftp BoundedSource, but
> compilalation of paramiko with ssl library (and so installation of ssl-dev)
>
> Regards,
>
> *Sébastien MORAND*
> Team Lead Solution Architect
> Technology & Operations / Digital Factory
> Veolia - Group Information Systems & Technology (IS&T)
> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
> <+33%201%2085%2057%2071%2008>
> Bureau 0144C (Ouest)
> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
> *www.veolia.com <http://www.veolia.com>*
> <http://www.veolia.com>
> <https://www.facebook.com/veoliaenvironment/>
> <https://www.youtube.com/user/veoliaenvironnement>
> <https://www.linkedin.com/company/veolia-environnement>
> <https://twitter.com/veolia>
>
> On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com> wrote:
>
>> Hi again, folks,
>>
>> How should I go about installing Python packages that require to be built
>> and/or require native dependencies like shared libraries or such?
>>
>> I guess, I could potentially build the C-based modules using the same
>> version of kernel and glibc that Dataflow is running, but doesn't seem like
>> there's any way to install shared libraries at these boxes, right?
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> ------------------------------------------------------------
> --------------------------------
> This e-mail transmission (message and any attached files) may contain
> information that is proprietary, privileged and/or confidential to Veolia
> Environnement and/or its affiliates and is intended exclusively for the
> person(s) to whom it is addressed. If you are not the intended recipient,
> please notify the sender by return e-mail and delete all copies of this
> e-mail, including all attachments. Unless expressly authorized, any use,
> disclosure, publication, retransmission or dissemination of this e-mail
> and/or of its attachments is strictly prohibited.
>
> Ce message electronique et ses fichiers attaches sont strictement
> confidentiels et peuvent contenir des elements dont Veolia Environnement
> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
> message par erreur, merci de le retourner a son emetteur et de le detruire
> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
> publication, la distribution, ou la reproduction non expressement
> autorisees de ce message et de ses pieces attachees sont interdites.
> ------------------------------------------------------------
> --------------------------------
>

Re: Installing non-native Python dependencies in Dataflow

Posted by "Morand, Sebastien" <se...@veolia.com>.
Hi,

Interested too. Could be fine for instance to add sftp BoundedSource, but
compilalation of paramiko with ssl library (and so installation of ssl-dev)

Regards,

*Sébastien MORAND*
Team Lead Solution Architect
Technology & Operations / Digital Factory
Veolia - Group Information Systems & Technology (IS&T)
Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
Bureau 0144C (Ouest)
30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
*www.veolia.com <http://www.veolia.com>*
<http://www.veolia.com>
<https://www.facebook.com/veoliaenvironment/>
<https://www.youtube.com/user/veoliaenvironnement>
<https://www.linkedin.com/company/veolia-environnement>
<https://twitter.com/veolia>

On 6 June 2017 at 08:01, Dmitry Demeshchuk <dm...@postmates.com> wrote:

> Hi again, folks,
>
> How should I go about installing Python packages that require to be built
> and/or require native dependencies like shared libraries or such?
>
> I guess, I could potentially build the C-based modules using the same
> version of kernel and glibc that Dataflow is running, but doesn't seem like
> there's any way to install shared libraries at these boxes, right?
>
> Thanks!
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

-- 

--------------------------------------------------------------------------------------------
This e-mail transmission (message and any attached files) may contain 
information that is proprietary, privileged and/or confidential to Veolia 
Environnement and/or its affiliates and is intended exclusively for the 
person(s) to whom it is addressed. If you are not the intended recipient, 
please notify the sender by return e-mail and delete all copies of this 
e-mail, including all attachments. Unless expressly authorized, any use, 
disclosure, publication, retransmission or dissemination of this e-mail 
and/or of its attachments is strictly prohibited. 

Ce message electronique et ses fichiers attaches sont strictement 
confidentiels et peuvent contenir des elements dont Veolia Environnement 
et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc 
destines a l'usage de leurs seuls destinataires. Si vous avez recu ce 
message par erreur, merci de le retourner a son emetteur et de le detruire 
ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la 
publication, la distribution, ou la reproduction non expressement 
autorisees de ce message et de ses pieces attachees sont interdites.
--------------------------------------------------------------------------------------------