You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Fabian Höring (Jira)" <ji...@apache.org> on 2020/08/14 13:09:00 UTC
[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package

    [ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177760#comment-17177760 ] 

Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:08 PM:
-----------------------------------------------------------------

[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc.
 [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time to have a look and give some feedback on the comments below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc the the spark code. So it seems inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it.


was (Author: fhoering):
[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc.
 [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time have a look and give some feedback on the comments below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc the the spark code. So it seems inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it.

> User Guide - Shipping Python Package
> ------------------------------------
>
>                 Key: SPARK-32187
>                 URL: https://issues.apache.org/jira/browse/SPARK-32187
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Documentation, PySpark
>    Affects Versions: 3.1.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org