You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Semet (JIRA)" <ji...@apache.org> on 2017/10/24 16:47:00 UTC

[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

    [ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217221#comment-16217221 ] 

Semet edited comment on SPARK-13587 at 10/24/17 4:46 PM:
---------------------------------------------------------

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies description formats, each node would download by itself the dependencies...
The strong argument for wheelhouse is that is only packages the libraries used by the project, not the complete environment. The drawback is that it may not work well with anaconda.


was (Author: gaetan@xeberon.net):
Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org