You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Punya Biswal (JIRA)" <ji...@apache.org> on 2015/06/27 01:34:05 UTC

[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark

    [ https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603785#comment-14603785 ] 

Punya Biswal commented on SPARK-6764:
-------------------------------------

Some packages need to be installed on workers, it's not enough just to put archived versions on the PYTHONPATH. Is there a reason to avoid using pip on the workers?

> Add wheel package support for PySpark
> -------------------------------------
>
>                 Key: SPARK-6764
>                 URL: https://issues.apache.org/jira/browse/SPARK-6764
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy, PySpark
>            Reporter: Takao Magoori
>            Priority: Minor
>              Labels: newbie
>
> We can do _spark-submit_ with one or more Python packages (.egg,.zip and .jar) by *--py-files* option.
> h4. zip packaging
> Spark put a zip file on its working directory and adds the absolute path to Python's sys.path. When the user program imports it, [zipimport|https://docs.python.org/2.7/library/zipimport.html] is automatically invoked under the hood. That is, data-files and dynamic modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and .pyo.
> h4. egg packaging
> Spark put an egg file on its working directory and adds the absolute path to Python's sys.path. Unlike zipimport, egg can handle data files and dynamid modules as far as the author of the package uses [pkg_resources API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations] properly. But so many python modules does not use pkg_resources API, that causes "ImportError"or "No such file" error. Moreover, creating eggs of dependencies and further dependencies are troublesome job.
> h4. wheel packaging
> Supporting new Python standard package-format "[wheel|https://wheel.readthedocs.org/en/latest/]" would be nice. With wheel, we can do spark-submit with complex dependencies simply as follows.
> 1. Write requirements.txt file.
> {noformat}
> SQLAlchemy
> MySQL-python
> requests
> simplejson>=3.6.0,<=3.6.5
> pydoop
> {noformat}
> 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
> {noformat}
> $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement requirements.txt
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
> {noformat}
> If your pyspark driver is a package which consists of many modules,
> 1. Write setup.py for your pyspark driver package.
> {noformat}
> from setuptools import (
>     find_packages,
>     setup,
> )
> setup(
>     name='yourpkg',
>     version='0.0.1',
>     packages=find_packages(),
>     install_requires=[
>         'SQLAlchemy',
>         'MySQL-python',
>         'requests',
>         'simplejson>=3.6.0,<=3.6.5',
>         'pydoop',
>     ],
> )
> {noformat}
> 2. Do wheel packaging by only one command. Your driver package and all dependencies are wheel-ed.
> {noformat}
> your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver_bootstrap.py
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org