You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Anwar AliKhan <an...@gmail.com> on 2020/06/06 20:16:07 UTC

Add python library

 " > Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
 "

This is weird !
I was hanging out here https://machinelearningmastery.com/start-here/.
When I came across this post.

The weird part is I was just wondering  how I can take one of the
projects(Open AI GYM taxi-vt2 in Python), a project I want to develop
further.

I want to run on Spark using Spark's parallelism features and GPU
capabilities,  when I am using bigger datasets . While installing the
workers (slaves)  doing the sliced dataset computations on the new 8GB RAM
Raspberry Pi (Linux).

Are any other documents on official website which shows how to do that,  or
any other location  , preferably showing full self contained examples?

On Fri, 5 Jun 2020, 09:02 Dark Crusader, <re...@gmail.com>
wrote:

> Hi Stone,
>
>
> I haven't tried it with .so files however I did use the approach he
> recommends to install my other dependencies.
> I Hope it helps.
>
> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <st...@gmail.com> wrote:
>
>> Hi,
>>
>> So my pyspark app depends on some python libraries, it is not a problem,
>> I pack all the dependencies into a file libs.zip, and then call
>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>
>> Then I encountered a problem, if any of my library has any binary file
>> dependency (like .so files), this approach does not work. Mainly because
>> when you set PYTHONPATH to a zip file, python does not look up needed
>> binary library (e.g. a .so file) inside the zip file, this is a python
>> *limitation*. So I got a workaround:
>>
>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>> directory
>> 2) When my python code starts, manually call *sys.path.insert(0,
>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>
>> This workaround works well for me. Then I got another problem: what if my
>> code in executor need python library that has binary code? Below is am
>> example:
>>
>> def do_something(p):
>>     ...
>>
>> rdd = sc.parallelize([
>>     {"x": 1, "y": 2},
>>     {"x": 2, "y": 3},
>>     {"x": 3, "y": 4},
>> ])
>> a = rdd.map(do_something)
>>
>> What if the function "do_something" need a python library that has
>> binary code? My current solution is, extract libs.zip into a NFS share (or
>> a SMB share) and manually do *sys.path.insert(0,
>> f"share_mount_dir/libs") *in my "do_something" function, but adding such
>> code in each function looks ugly, is there any better/elegant solution?
>>
>> Thanks,
>> Stone
>>
>>

Re: Add python library

Posted by Patrick McCarthy <pm...@dstillery.com.INVALID>.

I've found Anaconda encapsulates modules and dependencies and such nicely,
and you can deploy all the needed .so files and such by deploying a whole
conda environment.

I've used this method with success:
https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551

On Sat, Jun 6, 2020 at 4:16 PM Anwar AliKhan <an...@gmail.com>
wrote:

>  " > Have you looked into this article?
> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
>  "
>
> This is weird !
> I was hanging out here https://machinelearningmastery.com/start-here/.
> When I came across this post.
>
> The weird part is I was just wondering  how I can take one of the
> projects(Open AI GYM taxi-vt2 in Python), a project I want to develop
> further.
>
> I want to run on Spark using Spark's parallelism features and GPU
> capabilities,  when I am using bigger datasets . While installing the
> workers (slaves)  doing the sliced dataset computations on the new 8GB RAM
> Raspberry Pi (Linux).
>
> Are any other documents on official website which shows how to do that,
> or any other location  , preferably showing full self contained examples?
>
>
>
> On Fri, 5 Jun 2020, 09:02 Dark Crusader, <re...@gmail.com>
> wrote:
>
>> Hi Stone,
>>
>>
>> I haven't tried it with .so files however I did use the approach he
>> recommends to install my other dependencies.
>> I Hope it helps.
>>
>> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <st...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> So my pyspark app depends on some python libraries, it is not a problem,
>>> I pack all the dependencies into a file libs.zip, and then call
>>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>>
>>> Then I encountered a problem, if any of my library has any binary file
>>> dependency (like .so files), this approach does not work. Mainly because
>>> when you set PYTHONPATH to a zip file, python does not look up needed
>>> binary library (e.g. a .so file) inside the zip file, this is a python
>>> *limitation*. So I got a workaround:
>>>
>>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>>> directory
>>> 2) When my python code starts, manually call *sys.path.insert(0,
>>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>>
>>> This workaround works well for me. Then I got another problem: what if
>>> my code in executor need python library that has binary code? Below is am
>>> example:
>>>
>>> def do_something(p):
>>>     ...
>>>
>>> rdd = sc.parallelize([
>>>     {"x": 1, "y": 2},
>>>     {"x": 2, "y": 3},
>>>     {"x": 3, "y": 4},
>>> ])
>>> a = rdd.map(do_something)
>>>
>>> What if the function "do_something" need a python library that has
>>> binary code? My current solution is, extract libs.zip into a NFS share (or
>>> a SMB share) and manually do *sys.path.insert(0,
>>> f"share_mount_dir/libs") *in my "do_something" function, but adding
>>> such code in each function looks ugly, is there any better/elegant solution?
>>>
>>> Thanks,
>>> Stone
>>>
>>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016