You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Fabian Höring (Jira)" <ji...@apache.org> on 2020/07/03 08:08:00 UTC
[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

    [ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150827#comment-17150827 ] 

Fabian Höring edited comment on SPARK-25433 at 7/3/20, 8:07 AM:
----------------------------------------------------------------

Yes, you can put me in the tickets. I can also contribute if you give me some guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool which is reused by our distributed TensorFlow and PySpark jobs. I also did a simple end to end docker standalone spark example with S3 storage (via minio) for integration tests. https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option property. I admit I was lazy getting into the spark code again but some easy fix would be just exposing the private options attribute to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder but personally I think it would clutter the code too much. All this should stay application specific and indeed makes more sense to be included into the doc.






was (Author: fhoering):
Yes, you can put me in the tickets. I can also contribute if you give me some guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool which is reused by our distributed TensorFlow jobs. I also did a simple end to end docker standalone spark example with S3 storage (via minio) for integration tests. https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option property. I admit I was lazy getting into the spark code again but some easy fix would be just exposing the private options attribute to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder but personally I think it would clutter the code too much. All this should stay application specific and indeed makes more sense to be included into the doc.





> Add support for PEX in PySpark
> ------------------------------
>
>                 Key: SPARK-25433
>                 URL: https://issues.apache.org/jira/browse/SPARK-25433
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.2
>            Reporter: Fabian Höring
>            Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing PYSPARK_PYTHON env variable should already work.
> I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the packages on each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works.
> And here is where pex comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org