You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2018/10/22 18:53:00 UTC

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

    [ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659374#comment-16659374 ] 

Ruslan Dautkhanov commented on SPARK-13587:
-------------------------------------------

We're using conda environments shared across worker nodes through NFS. Has anyone used something like this?

Another option that' more direct to this jira's description is `conda-pack` and using yarn's `--archives` option to distribute it:

{code:bash}
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py
{code}

More details - https://conda.github.io/conda-pack/spark.html



> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Major
>
> Currently, it's not easy for user to add third party python packages in pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org