You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Davies Liu <da...@databricks.com> on 2015/08/07 00:14:51 UTC

Re: PySpark on PyPi

We could do that after 1.5 released, it will have same release cycle
as Spark in the future.

On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
<o....@lateral-thoughts.com> wrote:
> +1 (once again :) )
>
> 2015-07-28 14:51 GMT+02:00 Justin Uang <ju...@gmail.com>:
>>
>> // ping
>>
>> do we have any signoff from the pyspark devs to submit a PR to publish to
>> PyPI?
>>
>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <fr...@gmail.com>
>> wrote:
>>>
>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in
>>> steps that make it easier to use PySpark as an ordinary python library.
>>>
>>> You might want to check out this (https://github.com/minrk/findspark),
>>> started by Jupyter project devs, that offers one way to facilitate this
>>> stuff. I’ve also cced them here to join the conversation.
>>>
>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
>>> just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)`
>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
>>> set correctly on *both* workers and driver. That said, there’s definitely
>>> additional configuration / functionality that would require going through
>>> the proper submit scripts.
>>>
>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <pu...@gmail.com>
>>> wrote:
>>>
>>> I agree with everything Justin just said. An additional advantage of
>>> publishing PySpark's Python code in a standards-compliant way is the fact
>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
>>> way that pip can use. Contrast this with the current situation, where
>>> df.toPandas() exists in the Spark API but doesn't actually work until you
>>> install Pandas.
>>>
>>> Punya
>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <ju...@gmail.com>
>>> wrote:
>>>>
>>>> // + Davies for his comments
>>>> // + Punya for SA
>>>>
>>>> For development and CI, like Olivier mentioned, I think it would be
>>>> hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI.
>>>> If anyone wants to develop against PySpark APIs, they need to download the
>>>> distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
>>>> pytest, IDE code completion). Right now that involves adding python/ and
>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
>>>> dependencies, we would have to manually mirror all the PYTHONPATH munging in
>>>> the ./pyspark script. With a proper pyspark setup.py which declares its
>>>> dependencies, and a published distribution, depending on pyspark will just
>>>> be adding pyspark to my setup.py dependencies.
>>>>
>>>> Of course, if we actually want to run parts of pyspark that is backed by
>>>> Py4J calls, then we need the full spark distribution with either ./pyspark
>>>> or ./spark-submit, but for things like linting and development, the
>>>> PYTHONPATH munging is very annoying.
>>>>
>>>> I don't think the version-mismatch issues are a compelling reason to not
>>>> go ahead with PyPI publishing. At runtime, we should definitely enforce that
>>>> the version has to be exact, which means there is no backcompat nightmare as
>>>> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
>>>> This would mean that even if the user got his pip installed pyspark to
>>>> somehow get loaded before the spark distribution provided pyspark, then the
>>>> user would be alerted immediately.
>>>>
>>>> Davies, if you buy this, should me or someone on my team pick up
>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>> https://github.com/apache/spark/pull/464?
>>>>
>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>> <o....@lateral-thoughts.com> wrote:
>>>>>
>>>>> Ok, I get it. Now what can we do to improve the current situation,
>>>>> because right now if I want to set-up a CI env for PySpark, I have to :
>>>>> 1- download a pre-built version of pyspark and unzip it somewhere on
>>>>> every agent
>>>>> 2- define the SPARK_HOME env
>>>>> 3- symlink this distribution pyspark dir inside the python install dir
>>>>> site-packages/ directory
>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>>>> project), I have to (except if I'm mistaken)
>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory
>>>>> on every agent
>>>>> 5- add this jar-filled directory to the Spark distribution's additional
>>>>> classpath using the conf/spark-default file
>>>>>
>>>>> Then finally we can launch our unit/integration-tests.
>>>>> Some issues are related to spark-packages, some to the lack of
>>>>> python-based dependency, and some to the way SparkContext are launched when
>>>>> using pyspark.
>>>>> I think step 1 and 2 are fair enough
>>>>> 4 and 5 may already have solutions, I didn't check and considering
>>>>> spark-shell is downloading such dependencies automatically, I think if
>>>>> nothing's done yet it will (I guess ?).
>>>>>
>>>>> For step 3, maybe just adding a setup.py to the distribution would be
>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
>>>>> distribution in PyPi, maybe there's a better compromise ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier.
>>>>>
>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <je...@cs.berkeley.edu> a écrit
>>>>> :
>>>>>>
>>>>>> Couldn't we have a pip installable "pyspark" package that just serves
>>>>>> as a shim to an existing Spark installation? Or it could even download the
>>>>>> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
>>>>>> Spark doesn't play very well with the usual Python ecosystem. For example,
>>>>>> why do I need to use a strange incantation when booting up IPython if I want
>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be much nicer
>>>>>> to just type `from pyspark import SparkContext; sc =
>>>>>> SparkContext("local[4]")` in my notebook.
>>>>>>
>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass when
>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>>>>
>>>>>>
>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>>>>
>>>>>> -Jey
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <ro...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> This has been proposed before:
>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>>>>>
>>>>>>> There's currently tighter coupling between the Python and Java halves
>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>>>>>>> we'd run into tons of issues when users try to run a newer version of the
>>>>>>> Python half of PySpark against an older set of Java components or
>>>>>>> vice-versa.
>>>>>>>
>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>>>>> <o....@lateral-thoughts.com> wrote:
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>> Considering the python API as just a front needing the SPARK_HOME
>>>>>>>> defined anyway, I think it would be interesting to deploy the Python part of
>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python project
>>>>>>>> needing PySpark via pip.
>>>>>>>>
>>>>>>>> For now I just symlink the python/pyspark in my python install dir
>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>>>>>>> I can do the setup.py work or anything.
>>>>>>>>
>>>>>>>> What do you think ?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Olivier.
>>>>>>>
>>>>>>>
>>>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: PySpark on PyPi

Posted by westurner <we...@gmail.com>.

On Aug 20, 2015 4:57 PM, "Justin Uang [via Apache Spark Developers List]" <
ml-node+s1001551n13766h41@n3.nabble.com> wrote:
>
> One other question: Do we have consensus on publishing the
pip-installable source distribution to PyPI? If so, is that something that
the maintainers need to add to the process that they use to publish
releases?

A setup.py, Travis.yml, tox.ini (e.g cookiecutter)?
https://github.com/audreyr/cookiecutter-pypackage

https://wrdrd.com/docs/tools/#python-packages

* scripts=[]
* package_data / MANIFEST.in
* entry_points
   * console_scripts
   *
https://pythonhosted.org/setuptools/setuptools.html#eggsecutable-scripts

https://github.com/audreyr/cookiecutter-pypackage

... https://wrdrd.com/docs/consulting/knowledge-engineering#spark

>
> On Thu, Aug 20, 2015 at 5:44 PM Justin Uang <[hidden email]> wrote:
>>
>> I would prefer to just do it without the jar first as well. My hunch is
that to run spark the way it is intended, we need the wrapper scripts, like
spark-submit. Does anyone know authoritatively if that is the case?
>>
>> On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot <[hidden email]> wrote:
>>>
>>> +1
>>> But just to improve the error logging,
>>> would it be possible to add some warn logging in pyspark when the
SPARK_HOME env variable is pointing to a Spark distribution with a
different version from the pyspark package ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>> 2015-08-20 22:43 GMT+02:00 Brian Granger <[hidden email]>:
>>>>
>>>> I would start with just the plain python package without the JAR and
>>>> then see if it makes sense to add the JAR over time.
>>>>
>>>> On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez <[hidden email]> wrote:
>>>> > Hi all,
>>>> >
>>>> > I wanted to bubble up a conversation from the PR to this discussion
to see
>>>> > if there is support the idea of including a Spark assembly JAR in a
PyPI
>>>> > release of pyspark. @holdenk recommended this as she already does so
in the
>>>> > Sparkling Pandas package. Is this something people are interesting in
>>>> > pursuing?
>>>> >
>>>> > -Auberon
>>>> >
>>>> > On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger <[hidden email]>
wrote:
>>>> >>
>>>> >> Auberon, can you also post this to the Jupyter Google Group?
>>>> >>
>>>> >> On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <[hidden email]>
>>>> >> wrote:
>>>> >> > Hi all,
>>>> >> >
>>>> >> > I've created an updated PR for this based off of the previous
work of
>>>> >> > @prabinb:
>>>> >> > https://github.com/apache/spark/pull/8318
>>>> >> >
>>>> >> > I am not very familiar with python packaging; feedback is
appreciated.
>>>> >> >
>>>> >> > -Auberon
>>>> >> >
>>>> >> > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <[hidden email]> wrote:
>>>> >> >>
>>>> >> >>
>>>> >> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <[hidden email]>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> I would tentatively suggest also conda packaging.
>>>> >> >>
>>>> >> >>
>>>> >> >> A conda package has the advantage that it can be set up without
>>>> >> >> 'installing' the pyspark files, while the PyPI packaging is
still being
>>>> >> >> worked out. It can just add a pyspark.pth file pointing to
pyspark,
>>>> >> >> py4j
>>>> >> >> locations. But I think it's a really good idea to package with
conda.
>>>> >> >>
>>>> >> >> -MinRK
>>>> >> >>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> http://conda.pydata.org/docs/
>>>> >> >>>
>>>> >> >>> --Matthew Goodman
>>>> >> >>>
>>>> >> >>> =====================
>>>> >> >>> Check Out My Website: http://craneium.net
>>>> >> >>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>>>> >> >>>
>>>> >> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <[hidden email]>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> I think so, any contributions on this are welcome.
>>>> >> >>>>
>>>> >> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <[hidden
email]>
>>>> >> >>>> wrote:
>>>> >> >>>> > Sorry, trying to follow the context here. Does it look like
there
>>>> >> >>>> > is
>>>> >> >>>> > support for the idea of creating a setup.py file and pypi
package
>>>> >> >>>> > for
>>>> >> >>>> > pyspark?
>>>> >> >>>> >
>>>> >> >>>> > Cheers,
>>>> >> >>>> >
>>>> >> >>>> > Brian
>>>> >> >>>> >
>>>> >> >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <[hidden email]>
>>>> >> >>>> > wrote:
>>>> >> >>>> >> We could do that after 1.5 released, it will have same
release
>>>> >> >>>> >> cycle
>>>> >> >>>> >> as Spark in the future.
>>>> >> >>>> >>
>>>> >> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>>> >> >>>> >> <[hidden email]> wrote:
>>>> >> >>>> >>> +1 (once again :) )
>>>> >> >>>> >>>
>>>> >> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <[hidden email]>:
>>>> >> >>>> >>>>
>>>> >> >>>> >>>> // ping
>>>> >> >>>> >>>>
>>>> >> >>>> >>>> do we have any signoff from the pyspark devs to submit a
PR to
>>>> >> >>>> >>>> publish to
>>>> >> >>>> >>>> PyPI?
>>>> >> >>>> >>>>
>>>> >> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>>>> >> >>>> >>>> <[hidden email]>
>>>> >> >>>> >>>> wrote:
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>>> Hey all, great discussion, just wanted to +1 that I see
a lot
>>>> >> >>>> >>>>> of
>>>> >> >>>> >>>>> value in
>>>> >> >>>> >>>>> steps that make it easier to use PySpark as an ordinary
python
>>>> >> >>>> >>>>> library.
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>>> You might want to check out this
>>>> >> >>>> >>>>> (https://github.com/minrk/findspark),
>>>> >> >>>> >>>>> started by Jupyter project devs, that offers one way to
>>>> >> >>>> >>>>> facilitate
>>>> >> >>>> >>>>> this
>>>> >> >>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>>> Also, @Jey, I can also confirm that at least in some
scenarios
>>>> >> >>>> >>>>> (I’ve done
>>>> >> >>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible
to run
>>>> >> >>>> >>>>> PySpark jobs
>>>> >> >>>> >>>>> just using `from pyspark import SparkContext; sc =
>>>> >> >>>> >>>>> SparkContext(master=“X”)`
>>>> >> >>>> >>>>> so long as the environmental variables (PYTHONPATH and
>>>> >> >>>> >>>>> PYSPARK_PYTHON) are
>>>> >> >>>> >>>>> set correctly on *both* workers and driver. That said,
there’s
>>>> >> >>>> >>>>> definitely
>>>> >> >>>> >>>>> additional configuration / functionality that would
require
>>>> >> >>>> >>>>> going
>>>> >> >>>> >>>>> through
>>>> >> >>>> >>>>> the proper submit scripts.
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>>>> >> >>>> >>>>> <[hidden email]>
>>>> >> >>>> >>>>> wrote:
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>>> I agree with everything Justin just said. An additional
>>>> >> >>>> >>>>> advantage
>>>> >> >>>> >>>>> of
>>>> >> >>>> >>>>> publishing PySpark's Python code in a
standards-compliant way
>>>> >> >>>> >>>>> is
>>>> >> >>>> >>>>> the fact
>>>> >> >>>> >>>>> that we'll be able to declare transitive dependencies
(Pandas,
>>>> >> >>>> >>>>> Py4J) in a
>>>> >> >>>> >>>>> way that pip can use. Contrast this with the current
situation,
>>>> >> >>>> >>>>> where
>>>> >> >>>> >>>>> df.toPandas() exists in the Spark API but doesn't
actually work
>>>> >> >>>> >>>>> until you
>>>> >> >>>> >>>>> install Pandas.
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>>> Punya
>>>> >> >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>>>> >> >>>> >>>>> <[hidden email]>
>>>> >> >>>> >>>>> wrote:
>>>> >> >>>> >>>>>>
>>>> >> >>>> >>>>>> // + Davies for his comments
>>>> >> >>>> >>>>>> // + Punya for SA
>>>> >> >>>> >>>>>>
>>>> >> >>>> >>>>>> For development and CI, like Olivier mentioned, I think
it
>>>> >> >>>> >>>>>> would
>>>> >> >>>> >>>>>> be
>>>> >> >>>> >>>>>> hugely beneficial to publish pyspark (only code in the
python/
>>>> >> >>>> >>>>>> dir) on PyPI.
>>>> >> >>>> >>>>>> If anyone wants to develop against PySpark APIs, they
need to
>>>> >> >>>> >>>>>> download the
>>>> >> >>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all
the
>>>> >> >>>> >>>>>> tools
>>>> >> >>>> >>>>>> (pylint,
>>>> >> >>>> >>>>>> pytest, IDE code completion). Right now that involves
adding
>>>> >> >>>> >>>>>> python/ and
>>>> >> >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever
wants to
>>>> >> >>>> >>>>>> add more
>>>> >> >>>> >>>>>> dependencies, we would have to manually mirror all the
>>>> >> >>>> >>>>>> PYTHONPATH
>>>> >> >>>> >>>>>> munging in
>>>> >> >>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py
which
>>>> >> >>>> >>>>>> declares its
>>>> >> >>>> >>>>>> dependencies, and a published distribution, depending on
>>>> >> >>>> >>>>>> pyspark
>>>> >> >>>> >>>>>> will just
>>>> >> >>>> >>>>>> be adding pyspark to my setup.py dependencies.
>>>> >> >>>> >>>>>>
>>>> >> >>>> >>>>>> Of course, if we actually want to run parts of pyspark
that is
>>>> >> >>>> >>>>>> backed by
>>>> >> >>>> >>>>>> Py4J calls, then we need the full spark distribution
with
>>>> >> >>>> >>>>>> either
>>>> >> >>>> >>>>>> ./pyspark
>>>> >> >>>> >>>>>> or ./spark-submit, but for things like linting and
>>>> >> >>>> >>>>>> development,
>>>> >> >>>> >>>>>> the
>>>> >> >>>> >>>>>> PYTHONPATH munging is very annoying.
>>>> >> >>>> >>>>>>
>>>> >> >>>> >>>>>> I don't think the version-mismatch issues are a
compelling
>>>> >> >>>> >>>>>> reason
>>>> >> >>>> >>>>>> to not
>>>> >> >>>> >>>>>> go ahead with PyPI publishing. At runtime, we should
>>>> >> >>>> >>>>>> definitely
>>>> >> >>>> >>>>>> enforce that
>>>> >> >>>> >>>>>> the version has to be exact, which means there is no
>>>> >> >>>> >>>>>> backcompat
>>>> >> >>>> >>>>>> nightmare as
>>>> >> >>>> >>>>>> suggested by Davies in
>>>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>>>> >> >>>> >>>>>> This would mean that even if the user got his pip
installed
>>>> >> >>>> >>>>>> pyspark to
>>>> >> >>>> >>>>>> somehow get loaded before the spark distribution
provided
>>>> >> >>>> >>>>>> pyspark, then the
>>>> >> >>>> >>>>>> user would be alerted immediately.
>>>> >> >>>> >>>>>>
>>>> >> >>>> >>>>>> Davies, if you buy this, should me or someone on my
team pick
>>>> >> >>>> >>>>>> up
>>>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>> >> >>>> >>>>>> https://github.com/apache/spark/pull/464?
>>>> >> >>>> >>>>>>
>>>> >> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>> >> >>>> >>>>>> <[hidden email]> wrote:
>>>> >> >>>> >>>>>>>
>>>> >> >>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>>>> >> >>>> >>>>>>> situation,
>>>> >> >>>> >>>>>>> because right now if I want to set-up a CI env for
PySpark, I
>>>> >> >>>> >>>>>>> have to :
>>>> >> >>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>>>> >> >>>> >>>>>>> somewhere on
>>>> >> >>>> >>>>>>> every agent
>>>> >> >>>> >>>>>>> 2- define the SPARK_HOME env
>>>> >> >>>> >>>>>>> 3- symlink this distribution pyspark dir inside the
python
>>>> >> >>>> >>>>>>> install dir
>>>> >> >>>> >>>>>>> site-packages/ directory
>>>> >> >>>> >>>>>>> and if I rely on additional packages (like databricks'
>>>> >> >>>> >>>>>>> Spark-CSV
>>>> >> >>>> >>>>>>> project), I have to (except if I'm mistaken)
>>>> >> >>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a
specific
>>>> >> >>>> >>>>>>> directory
>>>> >> >>>> >>>>>>> on every agent
>>>> >> >>>> >>>>>>> 5- add this jar-filled directory to the Spark
distribution's
>>>> >> >>>> >>>>>>> additional
>>>> >> >>>> >>>>>>> classpath using the conf/spark-default file
>>>> >> >>>> >>>>>>>
>>>> >> >>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>>>> >> >>>> >>>>>>> Some issues are related to spark-packages, some to the
lack
>>>> >> >>>> >>>>>>> of
>>>> >> >>>> >>>>>>> python-based dependency, and some to the way
SparkContext are
>>>> >> >>>> >>>>>>> launched when
>>>> >> >>>> >>>>>>> using pyspark.
>>>> >> >>>> >>>>>>> I think step 1 and 2 are fair enough
>>>> >> >>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>>>> >> >>>> >>>>>>> considering
>>>> >> >>>> >>>>>>> spark-shell is downloading such dependencies
automatically, I
>>>> >> >>>> >>>>>>> think if
>>>> >> >>>> >>>>>>> nothing's done yet it will (I guess ?).
>>>> >> >>>> >>>>>>>
>>>> >> >>>> >>>>>>> For step 3, maybe just adding a setup.py to the
distribution
>>>> >> >>>> >>>>>>> would be
>>>> >> >>>> >>>>>>> enough, I'm not exactly advocating to distribute a
full 300Mb
>>>> >> >>>> >>>>>>> spark
>>>> >> >>>> >>>>>>> distribution in PyPi, maybe there's a better
compromise ?
>>>> >> >>>> >>>>>>>
>>>> >> >>>> >>>>>>> Regards,
>>>> >> >>>> >>>>>>>
>>>> >> >>>> >>>>>>> Olivier.
>>>> >> >>>> >>>>>>>
>>>> >> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam
>>>> >> >>>> >>>>>>> <[hidden email]>
>>>> >> >>>> >>>>>>> a écrit
>>>> >> >>>> >>>>>>> :
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package
that
>>>> >> >>>> >>>>>>>> just
>>>> >> >>>> >>>>>>>> serves
>>>> >> >>>> >>>>>>>> as a shim to an existing Spark installation? Or it
could
>>>> >> >>>> >>>>>>>> even
>>>> >> >>>> >>>>>>>> download the
>>>> >> >>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>>>> >> >>>> >>>>>>>> installation. Right now,
>>>> >> >>>> >>>>>>>> Spark doesn't play very well with the usual Python
>>>> >> >>>> >>>>>>>> ecosystem.
>>>> >> >>>> >>>>>>>> For example,
>>>> >> >>>> >>>>>>>> why do I need to use a strange incantation when
booting up
>>>> >> >>>> >>>>>>>> IPython if I want
>>>> >> >>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"?
It
>>>> >> >>>> >>>>>>>> would
>>>> >> >>>> >>>>>>>> be much nicer
>>>> >> >>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>>>> >> >>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>> I did a test and it seems like PySpark's basic
unit-tests do
>>>> >> >>>> >>>>>>>> pass when
>>>> >> >>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>>
PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>> >> >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>> -Jey
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>>>> >> >>>> >>>>>>>> <[hidden email]>
>>>> >> >>>> >>>>>>>> wrote:
>>>> >> >>>> >>>>>>>>>
>>>> >> >>>> >>>>>>>>> This has been proposed before:
>>>> >> >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>> >> >>>> >>>>>>>>>
>>>> >> >>>> >>>>>>>>> There's currently tighter coupling between the
Python and
>>>> >> >>>> >>>>>>>>> Java
>>>> >> >>>> >>>>>>>>> halves
>>>> >> >>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set;
if we
>>>> >> >>>> >>>>>>>>> did
>>>> >> >>>> >>>>>>>>> this, I bet
>>>> >> >>>> >>>>>>>>> we'd run into tons of issues when users try to run a
newer
>>>> >> >>>> >>>>>>>>> version of the
>>>> >> >>>> >>>>>>>>> Python half of PySpark against an older set of Java
>>>> >> >>>> >>>>>>>>> components
>>>> >> >>>> >>>>>>>>> or
>>>> >> >>>> >>>>>>>>> vice-versa.
>>>> >> >>>> >>>>>>>>>
>>>> >> >>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>> >> >>>> >>>>>>>>> <[hidden email]> wrote:
>>>> >> >>>> >>>>>>>>>>
>>>> >> >>>> >>>>>>>>>> Hi everyone,
>>>> >> >>>> >>>>>>>>>> Considering the python API as just a front needing
the
>>>> >> >>>> >>>>>>>>>> SPARK_HOME
>>>> >> >>>> >>>>>>>>>> defined anyway, I think it would be interesting to
deploy
>>>> >> >>>> >>>>>>>>>> the
>>>> >> >>>> >>>>>>>>>> Python part of
>>>> >> >>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies
in a
>>>> >> >>>> >>>>>>>>>> Python
>>>> >> >>>> >>>>>>>>>> project
>>>> >> >>>> >>>>>>>>>> needing PySpark via pip.
>>>> >> >>>> >>>>>>>>>>
>>>> >> >>>> >>>>>>>>>> For now I just symlink the python/pyspark in my
python
>>>> >> >>>> >>>>>>>>>> install dir
>>>> >> >>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint
tools to
>>>> >> >>>> >>>>>>>>>> work properly.
>>>> >> >>>> >>>>>>>>>> I can do the setup.py work or anything.
>>>> >> >>>> >>>>>>>>>>
>>>> >> >>>> >>>>>>>>>> What do you think ?
>>>> >> >>>> >>>>>>>>>>
>>>> >> >>>> >>>>>>>>>> Regards,
>>>> >> >>>> >>>>>>>>>>
>>>> >> >>>> >>>>>>>>>> Olivier.
>>>> >> >>>> >>>>>>>>>
>>>> >> >>>> >>>>>>>>>
>>>> >> >>>> >>>>>>>>
>>>> >> >>>> >>>>>
>>>> >> >>>> >>>
>>>> >> >>>> >
>>>> >> >>>> >
>>>> >> >>>> >
>>>> >> >>>> > --
>>>> >> >>>> > Brian E. Granger
>>>> >> >>>> > Cal Poly State University, San Luis Obispo
>>>> >> >>>> > @ellisonbg on Twitter and GitHub
>>>> >> >>>> > [hidden email] and [hidden email]
>>>> >> >>>>
>>>> >> >>>>
---------------------------------------------------------------------
>>>> >> >>>> To unsubscribe, e-mail: [hidden email]
>>>> >> >>>> For additional commands, e-mail: [hidden email]
>>>> >> >>>>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Brian E. Granger
>>>> >> Associate Professor of Physics and Data Science
>>>> >> Cal Poly State University, San Luis Obispo
>>>> >> @ellisonbg on Twitter and GitHub
>>>> >> [hidden email] and [hidden email]
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Brian E. Granger
>>>> Associate Professor of Physics and Data Science
>>>> Cal Poly State University, San Luis Obispo
>>>> @ellisonbg on Twitter and GitHub
>>>> [hidden email] and [hidden email]
>>>
>>>
>>>
>>>
>>> --
>>> Olivier Girardot | Associé
>>> [hidden email]
>>> +33 6 24 09 17 94
>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
below:
>
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13766.html
> To unsubscribe from PySpark on PyPi, click here.
> NAML




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13772.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: PySpark on PyPi

Posted by quasiben <qu...@gmail.com>.

I've help to build a conda installable spark packages in the past.  You can
an older recipe here:
https://github.com/conda/conda-recipes/tree/master/spark

And I've been updating packages here: 
https://anaconda.org/anaconda-cluster/spark

`conda install -c anaconda-cluster spark` 

The above should work for OSX/Linux-64 and py27/py34 

--Ben 




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13659.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: PySpark on PyPi

Posted by westurner <we...@gmail.com>.

westurner wrote
> 
> Matt Goodman wrote
>> I would tentatively suggest also conda packaging.
>> 
>> http://conda.pydata.org/docs/
> $ conda skeleton pypi pyspark
> # update git_tag and git_uri
> # add test commands (import pyspark; import pyspark.[...])
> 
> Docs for building conda packages for multiple operating systems and
> interpreters from PyPi packages:
> 
> *
> http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
> * https://github.com/audreyr/cookiecutter/issues/232

* conda meta.yaml can specify e.g. a test.sh script(s) that should return 0
 
  Docs: http://conda.pydata.org/docs/building/meta-yaml.html#test-section


Wes Turner wrote
> 
> Matt Goodman wrote
>> --Matthew Goodman
>> 
>> =====================
>> Check Out My Website: http://craneium.net
>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> 
>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu &lt;

>> davies@

>> &gt; wrote:
>> 
>>> I think so, any contributions on this are welcome.
>>>
>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger &lt;

>> ellisonbg@

>> &gt;
>>> wrote:
>>> > Sorry, trying to follow the context here. Does it look like there is
>>> > support for the idea of creating a setup.py file and pypi package for
>>> > pyspark?
>>> >
>>> > Cheers,
>>> >
>>> > Brian
>>> >
>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu &lt;

>> davies@

>> &gt;
>>> wrote:
>>> >> We could do that after 1.5 released, it will have same release cycle
>>> >> as Spark in the future.
>>> >>
>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>> >> &lt;

>> o.girardot@

>> &gt; wrote:
>>> >>> +1 (once again :) )
>>> >>>
>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang &lt;

>> justin.uang@

>> &gt;:
>>> >>>>
>>> >>>> // ping
>>> >>>>
>>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>>> publish to
>>> >>>> PyPI?
>>> >>>>
>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
>>> 

>> freeman.jeremy@

>>>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
>>> value in
>>> >>>>> steps that make it easier to use PySpark as an ordinary python
>>> library.
>>> >>>>>
>>> >>>>> You might want to check out this
>>> (https://github.com/minrk/findspark
>>> ),
>>> >>>>> started by Jupyter project devs, that offers one way to facilitate
>>> this
>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>>> >>>>>
>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>>> (I’ve
>>> done
>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>>> PySpark jobs
>>> >>>>> just using `from pyspark import SparkContext; sc =
>>> SparkContext(master=“X”)`
>>> >>>>> so long as the environmental variables (PYTHONPATH and
>>> PYSPARK_PYTHON) are
>>> >>>>> set correctly on *both* workers and driver. That said, there’s
>>> definitely
>>> >>>>> additional configuration / functionality that would require going
>>> through
>>> >>>>> the proper submit scripts.
>>> >>>>>
>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
>>> 

>> punya.biswal@

>>>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>> I agree with everything Justin just said. An additional advantage
>>> of
>>> >>>>> publishing PySpark's Python code in a standards-compliant way is
>>> the
>>> fact
>>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>>> Py4J)
>>> in a
>>> >>>>> way that pip can use. Contrast this with the current situation,
>>> where
>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>>> until you
>>> >>>>> install Pandas.
>>> >>>>>
>>> >>>>> Punya
>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang &lt;

>> justin.uang@

>> &gt;
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> // + Davies for his comments
>>> >>>>>> // + Punya for SA
>>> >>>>>>
>>> >>>>>> For development and CI, like Olivier mentioned, I think it would
>>> be
>>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>>> dir)
>>> on PyPI.
>>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>>> download the
>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
>>> (pylint,
>>> >>>>>> pytest, IDE code completion). Right now that involves adding
>>> python/ and
>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>>> add
>>> more
>>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
>>> munging in
>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>>> declares
>>> its
>>> >>>>>> dependencies, and a published distribution, depending on pyspark
>>> will just
>>> >>>>>> be adding pyspark to my setup.py dependencies.
>>> >>>>>>
>>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>>> backed by
>>> >>>>>> Py4J calls, then we need the full spark distribution with either
>>> ./pyspark
>>> >>>>>> or ./spark-submit, but for things like linting and development,
>>> the
>>> >>>>>> PYTHONPATH munging is very annoying.
>>> >>>>>>
>>> >>>>>> I don't think the version-mismatch issues are a compelling reason
>>> to not
>>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
>>> enforce that
>>> >>>>>> the version has to be exact, which means there is no backcompat
>>> nightmare as
>>> >>>>>> suggested by Davies in
>>> https://issues.apache.org/jira/browse/SPARK-1267.
>>> >>>>>> This would mean that even if the user got his pip installed
>>> pyspark
>>> to
>>> >>>>>> somehow get loaded before the spark distribution provided
>>> pyspark,
>>> then the
>>> >>>>>> user would be alerted immediately.
>>> >>>>>>
>>> >>>>>> Davies, if you buy this, should me or someone on my team pick up
>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>> >>>>>> https://github.com/apache/spark/pull/464?
>>> >>>>>>
>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>> >>>>>> &lt;

>> o.girardot@

>> &gt; wrote:
>>> >>>>>>>
>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>>> situation,
>>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>>> have
>>> to :
>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>>> somewhere
>>> on
>>> >>>>>>> every agent
>>> >>>>>>> 2- define the SPARK_HOME env
>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>>> install
>>> dir
>>> >>>>>>> site-packages/ directory
>>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>> >>>>>>> project), I have to (except if I'm mistaken)
>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>>> directory
>>> >>>>>>> on every agent
>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>>> additional
>>> >>>>>>> classpath using the conf/spark-default file
>>> >>>>>>>
>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>>> >>>>>>> Some issues are related to spark-packages, some to the lack of
>>> >>>>>>> python-based dependency, and some to the way SparkContext are
>>> launched when
>>> >>>>>>> using pyspark.
>>> >>>>>>> I think step 1 and 2 are fair enough
>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>>> considering
>>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>>> think if
>>> >>>>>>> nothing's done yet it will (I guess ?).
>>> >>>>>>>
>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>>> would
>>> be
>>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>>> spark
>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>>> >>>>>>>
>>> >>>>>>> Regards,
>>> >>>>>>>
>>> >>>>>>> Olivier.
>>> >>>>>>>
>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam &lt;

>> jey@.berkeley

>> &gt; a
>>> écrit
>>> >>>>>>> :
>>> >>>>>>>>
>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
>>> serves
>>> >>>>>>>> as a shim to an existing Spark installation? Or it could even
>>> download the
>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>>> installation.
>>> Right now,
>>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem.
>>> For
>>> example,
>>> >>>>>>>> why do I need to use a strange incantation when booting up
>>> IPython if I want
>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would
>>> be
>>> much nicer
>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>>> >>>>>>>>
>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>>> pass
>>> when
>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>> >>>>>>>>
>>> >>>>>>>> -Jey
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen &lt;

>> rosenville@

>> &gt; >
>>> >>>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> This has been proposed before:
>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>> >>>>>>>>>
>>> >>>>>>>>> There's currently tighter coupling between the Python and Java
>>> halves
>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
>>> this, I bet
>>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>>> version of the
>>> >>>>>>>>> Python half of PySpark against an older set of Java components
>>> or
>>> >>>>>>>>> vice-versa.
>>> >>>>>>>>>
>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>> >>>>>>>>> &lt;

>> o.girardot@

>> &gt; wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi everyone,
>>> >>>>>>>>>> Considering the python API as just a front needing the
>>> SPARK_HOME
>>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
>>> Python part of
>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
>>> project
>>> >>>>>>>>>> needing PySpark via pip.
>>> >>>>>>>>>>
>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>>> install
>>> dir
>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>>> work
>>> properly.
>>> >>>>>>>>>> I can do the setup.py work or anything.
>>> >>>>>>>>>>
>>> >>>>>>>>>> What do you think ?
>>> >>>>>>>>>>
>>> >>>>>>>>>> Regards,
>>> >>>>>>>>>>
>>> >>>>>>>>>> Olivier.
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>
>>> >>>
>>> >
>>> >
>>> >
>>> > --
>>> > Brian E. Granger
>>> > Cal Poly State University, San Luis Obispo
>>> > @ellisonbg on Twitter and GitHub
>>> > 

>> bgranger@

>>  and 

>> ellisonbg@

>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: 

>> dev-unsubscribe@.apache

>>> For additional commands, e-mail: 

>> dev-help@.apache

>>>
>>>





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13637.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: PySpark on PyPi

Posted by westurner <we...@gmail.com>.

Matt Goodman wrote
> I would tentatively suggest also conda packaging.
> 
> http://conda.pydata.org/docs/

$ conda skeleton pypi pyspark
# update git_tag and git_uri
# add test commands (import pyspark; import pyspark.[...])

Docs for building conda packages for multiple operating systems and
interpreters from PyPi packages:

*
http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
* https://github.com/audreyr/cookiecutter/issues/232


Matt Goodman wrote
> --Matthew Goodman
> 
> =====================
> Check Out My Website: http://craneium.net
> Find me on LinkedIn: http://tinyurl.com/d6wlch
> 
> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu &lt;

> davies@

> &gt; wrote:
> 
>> I think so, any contributions on this are welcome.
>>
>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger &lt;

> ellisonbg@

> &gt;
>> wrote:
>> > Sorry, trying to follow the context here. Does it look like there is
>> > support for the idea of creating a setup.py file and pypi package for
>> > pyspark?
>> >
>> > Cheers,
>> >
>> > Brian
>> >
>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu &lt;

> davies@

> &gt;
>> wrote:
>> >> We could do that after 1.5 released, it will have same release cycle
>> >> as Spark in the future.
>> >>
>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >> &lt;

> o.girardot@

> &gt; wrote:
>> >>> +1 (once again :) )
>> >>>
>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang &lt;

> justin.uang@

> &gt;:
>> >>>>
>> >>>> // ping
>> >>>>
>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>> publish to
>> >>>> PyPI?
>> >>>>
>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
>> 

> freeman.jeremy@

>>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
>> value in
>> >>>>> steps that make it easier to use PySpark as an ordinary python
>> library.
>> >>>>>
>> >>>>> You might want to check out this
>> (https://github.com/minrk/findspark
>> ),
>> >>>>> started by Jupyter project devs, that offers one way to facilitate
>> this
>> >>>>> stuff. I’ve also cced them here to join the conversation.
>> >>>>>
>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>> (I’ve
>> done
>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>> PySpark jobs
>> >>>>> just using `from pyspark import SparkContext; sc =
>> SparkContext(master=“X”)`
>> >>>>> so long as the environmental variables (PYTHONPATH and
>> PYSPARK_PYTHON) are
>> >>>>> set correctly on *both* workers and driver. That said, there’s
>> definitely
>> >>>>> additional configuration / functionality that would require going
>> through
>> >>>>> the proper submit scripts.
>> >>>>>
>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
>> 

> punya.biswal@

>>
>> >>>>> wrote:
>> >>>>>
>> >>>>> I agree with everything Justin just said. An additional advantage
>> of
>> >>>>> publishing PySpark's Python code in a standards-compliant way is
>> the
>> fact
>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>> Py4J)
>> in a
>> >>>>> way that pip can use. Contrast this with the current situation,
>> where
>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>> until you
>> >>>>> install Pandas.
>> >>>>>
>> >>>>> Punya
>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang &lt;

> justin.uang@

> &gt;
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> // + Davies for his comments
>> >>>>>> // + Punya for SA
>> >>>>>>
>> >>>>>> For development and CI, like Olivier mentioned, I think it would
>> be
>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>> dir)
>> on PyPI.
>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>> download the
>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
>> (pylint,
>> >>>>>> pytest, IDE code completion). Right now that involves adding
>> python/ and
>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
>> more
>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
>> munging in
>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>> declares
>> its
>> >>>>>> dependencies, and a published distribution, depending on pyspark
>> will just
>> >>>>>> be adding pyspark to my setup.py dependencies.
>> >>>>>>
>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>> backed by
>> >>>>>> Py4J calls, then we need the full spark distribution with either
>> ./pyspark
>> >>>>>> or ./spark-submit, but for things like linting and development,
>> the
>> >>>>>> PYTHONPATH munging is very annoying.
>> >>>>>>
>> >>>>>> I don't think the version-mismatch issues are a compelling reason
>> to not
>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
>> enforce that
>> >>>>>> the version has to be exact, which means there is no backcompat
>> nightmare as
>> >>>>>> suggested by Davies in
>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >>>>>> This would mean that even if the user got his pip installed
>> pyspark
>> to
>> >>>>>> somehow get loaded before the spark distribution provided pyspark,
>> then the
>> >>>>>> user would be alerted immediately.
>> >>>>>>
>> >>>>>> Davies, if you buy this, should me or someone on my team pick up
>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> >>>>>> https://github.com/apache/spark/pull/464?
>> >>>>>>
>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>> >>>>>> &lt;

> o.girardot@

> &gt; wrote:
>> >>>>>>>
>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>> situation,
>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>> have
>> to :
>> >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere
>> on
>> >>>>>>> every agent
>> >>>>>>> 2- define the SPARK_HOME env
>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>> install
>> dir
>> >>>>>>> site-packages/ directory
>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>> >>>>>>> project), I have to (except if I'm mistaken)
>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>> directory
>> >>>>>>> on every agent
>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>> additional
>> >>>>>>> classpath using the conf/spark-default file
>> >>>>>>>
>> >>>>>>> Then finally we can launch our unit/integration-tests.
>> >>>>>>> Some issues are related to spark-packages, some to the lack of
>> >>>>>>> python-based dependency, and some to the way SparkContext are
>> launched when
>> >>>>>>> using pyspark.
>> >>>>>>> I think step 1 and 2 are fair enough
>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>> considering
>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>> think if
>> >>>>>>> nothing's done yet it will (I guess ?).
>> >>>>>>>
>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>> would
>> be
>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>> spark
>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>>
>> >>>>>>> Olivier.
>> >>>>>>>
>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam &lt;

> jey@.berkeley

> &gt; a
>> écrit
>> >>>>>>> :
>> >>>>>>>>
>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
>> serves
>> >>>>>>>> as a shim to an existing Spark installation? Or it could even
>> download the
>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation.
>> Right now,
>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem.
>> For
>> example,
>> >>>>>>>> why do I need to use a strange incantation when booting up
>> IPython if I want
>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be
>> much nicer
>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>> >>>>>>>>
>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>> pass
>> when
>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>> >>>>>>>>
>> >>>>>>>> -Jey
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen &lt;

> rosenville@

> &gt; >
>> >>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> This has been proposed before:
>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>> >>>>>>>>>
>> >>>>>>>>> There's currently tighter coupling between the Python and Java
>> halves
>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
>> this, I bet
>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>> version of the
>> >>>>>>>>> Python half of PySpark against an older set of Java components
>> or
>> >>>>>>>>> vice-versa.
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>> >>>>>>>>> &lt;

> o.girardot@

> &gt; wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi everyone,
>> >>>>>>>>>> Considering the python API as just a front needing the
>> SPARK_HOME
>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
>> Python part of
>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
>> project
>> >>>>>>>>>> needing PySpark via pip.
>> >>>>>>>>>>
>> >>>>>>>>>> For now I just symlink the python/pyspark in my python install
>> dir
>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>> work
>> properly.
>> >>>>>>>>>> I can do the setup.py work or anything.
>> >>>>>>>>>>
>> >>>>>>>>>> What do you think ?
>> >>>>>>>>>>
>> >>>>>>>>>> Regards,
>> >>>>>>>>>>
>> >>>>>>>>>> Olivier.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> >>>
>> >
>> >
>> >
>> > --
>> > Brian E. Granger
>> > Cal Poly State University, San Luis Obispo
>> > @ellisonbg on Twitter and GitHub
>> > 

> bgranger@

>  and 

> ellisonbg@

>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> dev-unsubscribe@.apache

>> For additional commands, e-mail: 

> dev-help@.apache

>>
>>





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13635.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: PySpark on PyPi

Posted by Justin Uang <ju...@gmail.com>.

One other question: Do we have consensus on publishing the pip-installable
source distribution to PyPI? If so, is that something that the maintainers
need to add to the process that they use to publish releases?

On Thu, Aug 20, 2015 at 5:44 PM Justin Uang <ju...@gmail.com> wrote:

> I would prefer to just do it without the jar first as well. My hunch is
> that to run spark the way it is intended, we need the wrapper scripts, like
> spark-submit. Does anyone know authoritatively if that is the case?
>
> On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> +1
>> But just to improve the error logging,
>> would it be possible to add some warn logging in pyspark when the
>> SPARK_HOME env variable is pointing to a Spark distribution with a
>> different version from the pyspark package ?
>>
>> Regards,
>>
>> Olivier.
>>
>> 2015-08-20 22:43 GMT+02:00 Brian Granger <el...@gmail.com>:
>>
>>> I would start with just the plain python package without the JAR and
>>> then see if it makes sense to add the JAR over time.
>>>
>>> On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez <au...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> >
>>> > I wanted to bubble up a conversation from the PR to this discussion to
>>> see
>>> > if there is support the idea of including a Spark assembly JAR in a
>>> PyPI
>>> > release of pyspark. @holdenk recommended this as she already does so
>>> in the
>>> > Sparkling Pandas package. Is this something people are interesting in
>>> > pursuing?
>>> >
>>> > -Auberon
>>> >
>>> > On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger <el...@gmail.com>
>>> wrote:
>>> >>
>>> >> Auberon, can you also post this to the Jupyter Google Group?
>>> >>
>>> >> On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <
>>> auberon.lopez@gmail.com>
>>> >> wrote:
>>> >> > Hi all,
>>> >> >
>>> >> > I've created an updated PR for this based off of the previous work
>>> of
>>> >> > @prabinb:
>>> >> > https://github.com/apache/spark/pull/8318
>>> >> >
>>> >> > I am not very familiar with python packaging; feedback is
>>> appreciated.
>>> >> >
>>> >> > -Auberon
>>> >> >
>>> >> > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <be...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >>
>>> >> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <meawoppl@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I would tentatively suggest also conda packaging.
>>> >> >>
>>> >> >>
>>> >> >> A conda package has the advantage that it can be set up without
>>> >> >> 'installing' the pyspark files, while the PyPI packaging is still
>>> being
>>> >> >> worked out. It can just add a pyspark.pth file pointing to pyspark,
>>> >> >> py4j
>>> >> >> locations. But I think it's a really good idea to package with
>>> conda.
>>> >> >>
>>> >> >> -MinRK
>>> >> >>
>>> >> >>>
>>> >> >>>
>>> >> >>> http://conda.pydata.org/docs/
>>> >> >>>
>>> >> >>> --Matthew Goodman
>>> >> >>>
>>> >> >>> =====================
>>> >> >>> Check Out My Website: http://craneium.net
>>> >> >>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>>> >> >>>
>>> >> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <
>>> davies@databricks.com>
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> I think so, any contributions on this are welcome.
>>> >> >>>>
>>> >> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <
>>> ellisonbg@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>> > Sorry, trying to follow the context here. Does it look like
>>> there
>>> >> >>>> > is
>>> >> >>>> > support for the idea of creating a setup.py file and pypi
>>> package
>>> >> >>>> > for
>>> >> >>>> > pyspark?
>>> >> >>>> >
>>> >> >>>> > Cheers,
>>> >> >>>> >
>>> >> >>>> > Brian
>>> >> >>>> >
>>> >> >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <
>>> davies@databricks.com>
>>> >> >>>> > wrote:
>>> >> >>>> >> We could do that after 1.5 released, it will have same release
>>> >> >>>> >> cycle
>>> >> >>>> >> as Spark in the future.
>>> >> >>>> >>
>>> >> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>> >> >>>> >> <o....@lateral-thoughts.com> wrote:
>>> >> >>>> >>> +1 (once again :) )
>>> >> >>>> >>>
>>> >> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <
>>> justin.uang@gmail.com>:
>>> >> >>>> >>>>
>>> >> >>>> >>>> // ping
>>> >> >>>> >>>>
>>> >> >>>> >>>> do we have any signoff from the pyspark devs to submit a PR
>>> to
>>> >> >>>> >>>> publish to
>>> >> >>>> >>>> PyPI?
>>> >> >>>> >>>>
>>> >> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>>> >> >>>> >>>> <fr...@gmail.com>
>>> >> >>>> >>>> wrote:
>>> >> >>>> >>>>>
>>> >> >>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a
>>> lot
>>> >> >>>> >>>>> of
>>> >> >>>> >>>>> value in
>>> >> >>>> >>>>> steps that make it easier to use PySpark as an ordinary
>>> python
>>> >> >>>> >>>>> library.
>>> >> >>>> >>>>>
>>> >> >>>> >>>>> You might want to check out this
>>> >> >>>> >>>>> (https://github.com/minrk/findspark),
>>> >> >>>> >>>>> started by Jupyter project devs, that offers one way to
>>> >> >>>> >>>>> facilitate
>>> >> >>>> >>>>> this
>>> >> >>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>>> >> >>>> >>>>>
>>> >> >>>> >>>>> Also, @Jey, I can also confirm that at least in some
>>> scenarios
>>> >> >>>> >>>>> (I’ve done
>>> >> >>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to
>>> run
>>> >> >>>> >>>>> PySpark jobs
>>> >> >>>> >>>>> just using `from pyspark import SparkContext; sc =
>>> >> >>>> >>>>> SparkContext(master=“X”)`
>>> >> >>>> >>>>> so long as the environmental variables (PYTHONPATH and
>>> >> >>>> >>>>> PYSPARK_PYTHON) are
>>> >> >>>> >>>>> set correctly on *both* workers and driver. That said,
>>> there’s
>>> >> >>>> >>>>> definitely
>>> >> >>>> >>>>> additional configuration / functionality that would require
>>> >> >>>> >>>>> going
>>> >> >>>> >>>>> through
>>> >> >>>> >>>>> the proper submit scripts.
>>> >> >>>> >>>>>
>>> >> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>>> >> >>>> >>>>> <pu...@gmail.com>
>>> >> >>>> >>>>> wrote:
>>> >> >>>> >>>>>
>>> >> >>>> >>>>> I agree with everything Justin just said. An additional
>>> >> >>>> >>>>> advantage
>>> >> >>>> >>>>> of
>>> >> >>>> >>>>> publishing PySpark's Python code in a standards-compliant
>>> way
>>> >> >>>> >>>>> is
>>> >> >>>> >>>>> the fact
>>> >> >>>> >>>>> that we'll be able to declare transitive dependencies
>>> (Pandas,
>>> >> >>>> >>>>> Py4J) in a
>>> >> >>>> >>>>> way that pip can use. Contrast this with the current
>>> situation,
>>> >> >>>> >>>>> where
>>> >> >>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually
>>> work
>>> >> >>>> >>>>> until you
>>> >> >>>> >>>>> install Pandas.
>>> >> >>>> >>>>>
>>> >> >>>> >>>>> Punya
>>> >> >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>>> >> >>>> >>>>> <ju...@gmail.com>
>>> >> >>>> >>>>> wrote:
>>> >> >>>> >>>>>>
>>> >> >>>> >>>>>> // + Davies for his comments
>>> >> >>>> >>>>>> // + Punya for SA
>>> >> >>>> >>>>>>
>>> >> >>>> >>>>>> For development and CI, like Olivier mentioned, I think it
>>> >> >>>> >>>>>> would
>>> >> >>>> >>>>>> be
>>> >> >>>> >>>>>> hugely beneficial to publish pyspark (only code in the
>>> python/
>>> >> >>>> >>>>>> dir) on PyPI.
>>> >> >>>> >>>>>> If anyone wants to develop against PySpark APIs, they
>>> need to
>>> >> >>>> >>>>>> download the
>>> >> >>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all
>>> the
>>> >> >>>> >>>>>> tools
>>> >> >>>> >>>>>> (pylint,
>>> >> >>>> >>>>>> pytest, IDE code completion). Right now that involves
>>> adding
>>> >> >>>> >>>>>> python/ and
>>> >> >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever
>>> wants to
>>> >> >>>> >>>>>> add more
>>> >> >>>> >>>>>> dependencies, we would have to manually mirror all the
>>> >> >>>> >>>>>> PYTHONPATH
>>> >> >>>> >>>>>> munging in
>>> >> >>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>>> >> >>>> >>>>>> declares its
>>> >> >>>> >>>>>> dependencies, and a published distribution, depending on
>>> >> >>>> >>>>>> pyspark
>>> >> >>>> >>>>>> will just
>>> >> >>>> >>>>>> be adding pyspark to my setup.py dependencies.
>>> >> >>>> >>>>>>
>>> >> >>>> >>>>>> Of course, if we actually want to run parts of pyspark
>>> that is
>>> >> >>>> >>>>>> backed by
>>> >> >>>> >>>>>> Py4J calls, then we need the full spark distribution with
>>> >> >>>> >>>>>> either
>>> >> >>>> >>>>>> ./pyspark
>>> >> >>>> >>>>>> or ./spark-submit, but for things like linting and
>>> >> >>>> >>>>>> development,
>>> >> >>>> >>>>>> the
>>> >> >>>> >>>>>> PYTHONPATH munging is very annoying.
>>> >> >>>> >>>>>>
>>> >> >>>> >>>>>> I don't think the version-mismatch issues are a compelling
>>> >> >>>> >>>>>> reason
>>> >> >>>> >>>>>> to not
>>> >> >>>> >>>>>> go ahead with PyPI publishing. At runtime, we should
>>> >> >>>> >>>>>> definitely
>>> >> >>>> >>>>>> enforce that
>>> >> >>>> >>>>>> the version has to be exact, which means there is no
>>> >> >>>> >>>>>> backcompat
>>> >> >>>> >>>>>> nightmare as
>>> >> >>>> >>>>>> suggested by Davies in
>>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>>> >> >>>> >>>>>> This would mean that even if the user got his pip
>>> installed
>>> >> >>>> >>>>>> pyspark to
>>> >> >>>> >>>>>> somehow get loaded before the spark distribution provided
>>> >> >>>> >>>>>> pyspark, then the
>>> >> >>>> >>>>>> user would be alerted immediately.
>>> >> >>>> >>>>>>
>>> >> >>>> >>>>>> Davies, if you buy this, should me or someone on my team
>>> pick
>>> >> >>>> >>>>>> up
>>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>> >> >>>> >>>>>> https://github.com/apache/spark/pull/464?
>>> >> >>>> >>>>>>
>>> >> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>> >> >>>> >>>>>> <o....@lateral-thoughts.com> wrote:
>>> >> >>>> >>>>>>>
>>> >> >>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>>> >> >>>> >>>>>>> situation,
>>> >> >>>> >>>>>>> because right now if I want to set-up a CI env for
>>> PySpark, I
>>> >> >>>> >>>>>>> have to :
>>> >> >>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>>> >> >>>> >>>>>>> somewhere on
>>> >> >>>> >>>>>>> every agent
>>> >> >>>> >>>>>>> 2- define the SPARK_HOME env
>>> >> >>>> >>>>>>> 3- symlink this distribution pyspark dir inside the
>>> python
>>> >> >>>> >>>>>>> install dir
>>> >> >>>> >>>>>>> site-packages/ directory
>>> >> >>>> >>>>>>> and if I rely on additional packages (like databricks'
>>> >> >>>> >>>>>>> Spark-CSV
>>> >> >>>> >>>>>>> project), I have to (except if I'm mistaken)
>>> >> >>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a
>>> specific
>>> >> >>>> >>>>>>> directory
>>> >> >>>> >>>>>>> on every agent
>>> >> >>>> >>>>>>> 5- add this jar-filled directory to the Spark
>>> distribution's
>>> >> >>>> >>>>>>> additional
>>> >> >>>> >>>>>>> classpath using the conf/spark-default file
>>> >> >>>> >>>>>>>
>>> >> >>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>>> >> >>>> >>>>>>> Some issues are related to spark-packages, some to the
>>> lack
>>> >> >>>> >>>>>>> of
>>> >> >>>> >>>>>>> python-based dependency, and some to the way
>>> SparkContext are
>>> >> >>>> >>>>>>> launched when
>>> >> >>>> >>>>>>> using pyspark.
>>> >> >>>> >>>>>>> I think step 1 and 2 are fair enough
>>> >> >>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>>> >> >>>> >>>>>>> considering
>>> >> >>>> >>>>>>> spark-shell is downloading such dependencies
>>> automatically, I
>>> >> >>>> >>>>>>> think if
>>> >> >>>> >>>>>>> nothing's done yet it will (I guess ?).
>>> >> >>>> >>>>>>>
>>> >> >>>> >>>>>>> For step 3, maybe just adding a setup.py to the
>>> distribution
>>> >> >>>> >>>>>>> would be
>>> >> >>>> >>>>>>> enough, I'm not exactly advocating to distribute a full
>>> 300Mb
>>> >> >>>> >>>>>>> spark
>>> >> >>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>>> >> >>>> >>>>>>>
>>> >> >>>> >>>>>>> Regards,
>>> >> >>>> >>>>>>>
>>> >> >>>> >>>>>>> Olivier.
>>> >> >>>> >>>>>>>
>>> >> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam
>>> >> >>>> >>>>>>> <je...@cs.berkeley.edu>
>>> >> >>>> >>>>>>> a écrit
>>> >> >>>> >>>>>>> :
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package
>>> that
>>> >> >>>> >>>>>>>> just
>>> >> >>>> >>>>>>>> serves
>>> >> >>>> >>>>>>>> as a shim to an existing Spark installation? Or it could
>>> >> >>>> >>>>>>>> even
>>> >> >>>> >>>>>>>> download the
>>> >> >>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>>> >> >>>> >>>>>>>> installation. Right now,
>>> >> >>>> >>>>>>>> Spark doesn't play very well with the usual Python
>>> >> >>>> >>>>>>>> ecosystem.
>>> >> >>>> >>>>>>>> For example,
>>> >> >>>> >>>>>>>> why do I need to use a strange incantation when booting
>>> up
>>> >> >>>> >>>>>>>> IPython if I want
>>> >> >>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It
>>> >> >>>> >>>>>>>> would
>>> >> >>>> >>>>>>>> be much nicer
>>> >> >>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>>> >> >>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>> I did a test and it seems like PySpark's basic
>>> unit-tests do
>>> >> >>>> >>>>>>>> pass when
>>> >> >>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>>
>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>> >> >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>> -Jey
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>>> >> >>>> >>>>>>>> <ro...@gmail.com>
>>> >> >>>> >>>>>>>> wrote:
>>> >> >>>> >>>>>>>>>
>>> >> >>>> >>>>>>>>> This has been proposed before:
>>> >> >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>> >> >>>> >>>>>>>>>
>>> >> >>>> >>>>>>>>> There's currently tighter coupling between the Python
>>> and
>>> >> >>>> >>>>>>>>> Java
>>> >> >>>> >>>>>>>>> halves
>>> >> >>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set;
>>> if we
>>> >> >>>> >>>>>>>>> did
>>> >> >>>> >>>>>>>>> this, I bet
>>> >> >>>> >>>>>>>>> we'd run into tons of issues when users try to run a
>>> newer
>>> >> >>>> >>>>>>>>> version of the
>>> >> >>>> >>>>>>>>> Python half of PySpark against an older set of Java
>>> >> >>>> >>>>>>>>> components
>>> >> >>>> >>>>>>>>> or
>>> >> >>>> >>>>>>>>> vice-versa.
>>> >> >>>> >>>>>>>>>
>>> >> >>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>> >> >>>> >>>>>>>>> <o....@lateral-thoughts.com> wrote:
>>> >> >>>> >>>>>>>>>>
>>> >> >>>> >>>>>>>>>> Hi everyone,
>>> >> >>>> >>>>>>>>>> Considering the python API as just a front needing the
>>> >> >>>> >>>>>>>>>> SPARK_HOME
>>> >> >>>> >>>>>>>>>> defined anyway, I think it would be interesting to
>>> deploy
>>> >> >>>> >>>>>>>>>> the
>>> >> >>>> >>>>>>>>>> Python part of
>>> >> >>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a
>>> >> >>>> >>>>>>>>>> Python
>>> >> >>>> >>>>>>>>>> project
>>> >> >>>> >>>>>>>>>> needing PySpark via pip.
>>> >> >>>> >>>>>>>>>>
>>> >> >>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>>> >> >>>> >>>>>>>>>> install dir
>>> >> >>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint
>>> tools to
>>> >> >>>> >>>>>>>>>> work properly.
>>> >> >>>> >>>>>>>>>> I can do the setup.py work or anything.
>>> >> >>>> >>>>>>>>>>
>>> >> >>>> >>>>>>>>>> What do you think ?
>>> >> >>>> >>>>>>>>>>
>>> >> >>>> >>>>>>>>>> Regards,
>>> >> >>>> >>>>>>>>>>
>>> >> >>>> >>>>>>>>>> Olivier.
>>> >> >>>> >>>>>>>>>
>>> >> >>>> >>>>>>>>>
>>> >> >>>> >>>>>>>>
>>> >> >>>> >>>>>
>>> >> >>>> >>>
>>> >> >>>> >
>>> >> >>>> >
>>> >> >>>> >
>>> >> >>>> > --
>>> >> >>>> > Brian E. Granger
>>> >> >>>> > Cal Poly State University, San Luis Obispo
>>> >> >>>> > @ellisonbg on Twitter and GitHub
>>> >> >>>> > bgranger@calpoly.edu and ellisonbg@gmail.com
>>> >> >>>>
>>> >> >>>>
>>> ---------------------------------------------------------------------
>>> >> >>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> >> >>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Brian E. Granger
>>> >> Associate Professor of Physics and Data Science
>>> >> Cal Poly State University, San Luis Obispo
>>> >> @ellisonbg on Twitter and GitHub
>>> >> bgranger@calpoly.edu and ellisonbg@gmail.com
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Brian E. Granger
>>> Associate Professor of Physics and Data Science
>>> Cal Poly State University, San Luis Obispo
>>> @ellisonbg on Twitter and GitHub
>>> bgranger@calpoly.edu and ellisonbg@gmail.com
>>>
>>
>>
>>
>> --
>> *Olivier Girardot* | Associé
>> o.girardot@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>

Re: PySpark on PyPi

Posted by Justin Uang <ju...@gmail.com>.

I would prefer to just do it without the jar first as well. My hunch is
that to run spark the way it is intended, we need the wrapper scripts, like
spark-submit. Does anyone know authoritatively if that is the case?

On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> +1
> But just to improve the error logging,
> would it be possible to add some warn logging in pyspark when the
> SPARK_HOME env variable is pointing to a Spark distribution with a
> different version from the pyspark package ?
>
> Regards,
>
> Olivier.
>
> 2015-08-20 22:43 GMT+02:00 Brian Granger <el...@gmail.com>:
>
>> I would start with just the plain python package without the JAR and
>> then see if it makes sense to add the JAR over time.
>>
>> On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez <au...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I wanted to bubble up a conversation from the PR to this discussion to
>> see
>> > if there is support the idea of including a Spark assembly JAR in a PyPI
>> > release of pyspark. @holdenk recommended this as she already does so in
>> the
>> > Sparkling Pandas package. Is this something people are interesting in
>> > pursuing?
>> >
>> > -Auberon
>> >
>> > On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger <el...@gmail.com>
>> wrote:
>> >>
>> >> Auberon, can you also post this to the Jupyter Google Group?
>> >>
>> >> On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <
>> auberon.lopez@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I've created an updated PR for this based off of the previous work of
>> >> > @prabinb:
>> >> > https://github.com/apache/spark/pull/8318
>> >> >
>> >> > I am not very familiar with python packaging; feedback is
>> appreciated.
>> >> >
>> >> > -Auberon
>> >> >
>> >> > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <be...@gmail.com>
>> wrote:
>> >> >>
>> >> >>
>> >> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <me...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I would tentatively suggest also conda packaging.
>> >> >>
>> >> >>
>> >> >> A conda package has the advantage that it can be set up without
>> >> >> 'installing' the pyspark files, while the PyPI packaging is still
>> being
>> >> >> worked out. It can just add a pyspark.pth file pointing to pyspark,
>> >> >> py4j
>> >> >> locations. But I think it's a really good idea to package with
>> conda.
>> >> >>
>> >> >> -MinRK
>> >> >>
>> >> >>>
>> >> >>>
>> >> >>> http://conda.pydata.org/docs/
>> >> >>>
>> >> >>> --Matthew Goodman
>> >> >>>
>> >> >>> =====================
>> >> >>> Check Out My Website: http://craneium.net
>> >> >>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> >> >>>
>> >> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <
>> davies@databricks.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> I think so, any contributions on this are welcome.
>> >> >>>>
>> >> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <
>> ellisonbg@gmail.com>
>> >> >>>> wrote:
>> >> >>>> > Sorry, trying to follow the context here. Does it look like
>> there
>> >> >>>> > is
>> >> >>>> > support for the idea of creating a setup.py file and pypi
>> package
>> >> >>>> > for
>> >> >>>> > pyspark?
>> >> >>>> >
>> >> >>>> > Cheers,
>> >> >>>> >
>> >> >>>> > Brian
>> >> >>>> >
>> >> >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <
>> davies@databricks.com>
>> >> >>>> > wrote:
>> >> >>>> >> We could do that after 1.5 released, it will have same release
>> >> >>>> >> cycle
>> >> >>>> >> as Spark in the future.
>> >> >>>> >>
>> >> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >> >>>> >> <o....@lateral-thoughts.com> wrote:
>> >> >>>> >>> +1 (once again :) )
>> >> >>>> >>>
>> >> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.uang@gmail.com
>> >:
>> >> >>>> >>>>
>> >> >>>> >>>> // ping
>> >> >>>> >>>>
>> >> >>>> >>>> do we have any signoff from the pyspark devs to submit a PR
>> to
>> >> >>>> >>>> publish to
>> >> >>>> >>>> PyPI?
>> >> >>>> >>>>
>> >> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>> >> >>>> >>>> <fr...@gmail.com>
>> >> >>>> >>>> wrote:
>> >> >>>> >>>>>
>> >> >>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a
>> lot
>> >> >>>> >>>>> of
>> >> >>>> >>>>> value in
>> >> >>>> >>>>> steps that make it easier to use PySpark as an ordinary
>> python
>> >> >>>> >>>>> library.
>> >> >>>> >>>>>
>> >> >>>> >>>>> You might want to check out this
>> >> >>>> >>>>> (https://github.com/minrk/findspark),
>> >> >>>> >>>>> started by Jupyter project devs, that offers one way to
>> >> >>>> >>>>> facilitate
>> >> >>>> >>>>> this
>> >> >>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>> >> >>>> >>>>>
>> >> >>>> >>>>> Also, @Jey, I can also confirm that at least in some
>> scenarios
>> >> >>>> >>>>> (I’ve done
>> >> >>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to
>> run
>> >> >>>> >>>>> PySpark jobs
>> >> >>>> >>>>> just using `from pyspark import SparkContext; sc =
>> >> >>>> >>>>> SparkContext(master=“X”)`
>> >> >>>> >>>>> so long as the environmental variables (PYTHONPATH and
>> >> >>>> >>>>> PYSPARK_PYTHON) are
>> >> >>>> >>>>> set correctly on *both* workers and driver. That said,
>> there’s
>> >> >>>> >>>>> definitely
>> >> >>>> >>>>> additional configuration / functionality that would require
>> >> >>>> >>>>> going
>> >> >>>> >>>>> through
>> >> >>>> >>>>> the proper submit scripts.
>> >> >>>> >>>>>
>> >> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>> >> >>>> >>>>> <pu...@gmail.com>
>> >> >>>> >>>>> wrote:
>> >> >>>> >>>>>
>> >> >>>> >>>>> I agree with everything Justin just said. An additional
>> >> >>>> >>>>> advantage
>> >> >>>> >>>>> of
>> >> >>>> >>>>> publishing PySpark's Python code in a standards-compliant
>> way
>> >> >>>> >>>>> is
>> >> >>>> >>>>> the fact
>> >> >>>> >>>>> that we'll be able to declare transitive dependencies
>> (Pandas,
>> >> >>>> >>>>> Py4J) in a
>> >> >>>> >>>>> way that pip can use. Contrast this with the current
>> situation,
>> >> >>>> >>>>> where
>> >> >>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually
>> work
>> >> >>>> >>>>> until you
>> >> >>>> >>>>> install Pandas.
>> >> >>>> >>>>>
>> >> >>>> >>>>> Punya
>> >> >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>> >> >>>> >>>>> <ju...@gmail.com>
>> >> >>>> >>>>> wrote:
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> // + Davies for his comments
>> >> >>>> >>>>>> // + Punya for SA
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> For development and CI, like Olivier mentioned, I think it
>> >> >>>> >>>>>> would
>> >> >>>> >>>>>> be
>> >> >>>> >>>>>> hugely beneficial to publish pyspark (only code in the
>> python/
>> >> >>>> >>>>>> dir) on PyPI.
>> >> >>>> >>>>>> If anyone wants to develop against PySpark APIs, they need
>> to
>> >> >>>> >>>>>> download the
>> >> >>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the
>> >> >>>> >>>>>> tools
>> >> >>>> >>>>>> (pylint,
>> >> >>>> >>>>>> pytest, IDE code completion). Right now that involves
>> adding
>> >> >>>> >>>>>> python/ and
>> >> >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever
>> wants to
>> >> >>>> >>>>>> add more
>> >> >>>> >>>>>> dependencies, we would have to manually mirror all the
>> >> >>>> >>>>>> PYTHONPATH
>> >> >>>> >>>>>> munging in
>> >> >>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>> >> >>>> >>>>>> declares its
>> >> >>>> >>>>>> dependencies, and a published distribution, depending on
>> >> >>>> >>>>>> pyspark
>> >> >>>> >>>>>> will just
>> >> >>>> >>>>>> be adding pyspark to my setup.py dependencies.
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> Of course, if we actually want to run parts of pyspark
>> that is
>> >> >>>> >>>>>> backed by
>> >> >>>> >>>>>> Py4J calls, then we need the full spark distribution with
>> >> >>>> >>>>>> either
>> >> >>>> >>>>>> ./pyspark
>> >> >>>> >>>>>> or ./spark-submit, but for things like linting and
>> >> >>>> >>>>>> development,
>> >> >>>> >>>>>> the
>> >> >>>> >>>>>> PYTHONPATH munging is very annoying.
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> I don't think the version-mismatch issues are a compelling
>> >> >>>> >>>>>> reason
>> >> >>>> >>>>>> to not
>> >> >>>> >>>>>> go ahead with PyPI publishing. At runtime, we should
>> >> >>>> >>>>>> definitely
>> >> >>>> >>>>>> enforce that
>> >> >>>> >>>>>> the version has to be exact, which means there is no
>> >> >>>> >>>>>> backcompat
>> >> >>>> >>>>>> nightmare as
>> >> >>>> >>>>>> suggested by Davies in
>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >> >>>> >>>>>> This would mean that even if the user got his pip installed
>> >> >>>> >>>>>> pyspark to
>> >> >>>> >>>>>> somehow get loaded before the spark distribution provided
>> >> >>>> >>>>>> pyspark, then the
>> >> >>>> >>>>>> user would be alerted immediately.
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> Davies, if you buy this, should me or someone on my team
>> pick
>> >> >>>> >>>>>> up
>> >> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> >> >>>> >>>>>> https://github.com/apache/spark/pull/464?
>> >> >>>> >>>>>>
>> >> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>> >> >>>> >>>>>> <o....@lateral-thoughts.com> wrote:
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>> >> >>>> >>>>>>> situation,
>> >> >>>> >>>>>>> because right now if I want to set-up a CI env for
>> PySpark, I
>> >> >>>> >>>>>>> have to :
>> >> >>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>> >> >>>> >>>>>>> somewhere on
>> >> >>>> >>>>>>> every agent
>> >> >>>> >>>>>>> 2- define the SPARK_HOME env
>> >> >>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>> >> >>>> >>>>>>> install dir
>> >> >>>> >>>>>>> site-packages/ directory
>> >> >>>> >>>>>>> and if I rely on additional packages (like databricks'
>> >> >>>> >>>>>>> Spark-CSV
>> >> >>>> >>>>>>> project), I have to (except if I'm mistaken)
>> >> >>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a
>> specific
>> >> >>>> >>>>>>> directory
>> >> >>>> >>>>>>> on every agent
>> >> >>>> >>>>>>> 5- add this jar-filled directory to the Spark
>> distribution's
>> >> >>>> >>>>>>> additional
>> >> >>>> >>>>>>> classpath using the conf/spark-default file
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>> >> >>>> >>>>>>> Some issues are related to spark-packages, some to the
>> lack
>> >> >>>> >>>>>>> of
>> >> >>>> >>>>>>> python-based dependency, and some to the way SparkContext
>> are
>> >> >>>> >>>>>>> launched when
>> >> >>>> >>>>>>> using pyspark.
>> >> >>>> >>>>>>> I think step 1 and 2 are fair enough
>> >> >>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>> >> >>>> >>>>>>> considering
>> >> >>>> >>>>>>> spark-shell is downloading such dependencies
>> automatically, I
>> >> >>>> >>>>>>> think if
>> >> >>>> >>>>>>> nothing's done yet it will (I guess ?).
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> For step 3, maybe just adding a setup.py to the
>> distribution
>> >> >>>> >>>>>>> would be
>> >> >>>> >>>>>>> enough, I'm not exactly advocating to distribute a full
>> 300Mb
>> >> >>>> >>>>>>> spark
>> >> >>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Regards,
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Olivier.
>> >> >>>> >>>>>>>
>> >> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam
>> >> >>>> >>>>>>> <je...@cs.berkeley.edu>
>> >> >>>> >>>>>>> a écrit
>> >> >>>> >>>>>>> :
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that
>> >> >>>> >>>>>>>> just
>> >> >>>> >>>>>>>> serves
>> >> >>>> >>>>>>>> as a shim to an existing Spark installation? Or it could
>> >> >>>> >>>>>>>> even
>> >> >>>> >>>>>>>> download the
>> >> >>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>> >> >>>> >>>>>>>> installation. Right now,
>> >> >>>> >>>>>>>> Spark doesn't play very well with the usual Python
>> >> >>>> >>>>>>>> ecosystem.
>> >> >>>> >>>>>>>> For example,
>> >> >>>> >>>>>>>> why do I need to use a strange incantation when booting
>> up
>> >> >>>> >>>>>>>> IPython if I want
>> >> >>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It
>> >> >>>> >>>>>>>> would
>> >> >>>> >>>>>>>> be much nicer
>> >> >>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>> >> >>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> I did a test and it seems like PySpark's basic
>> unit-tests do
>> >> >>>> >>>>>>>> pass when
>> >> >>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> >> >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> -Jey
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>> >> >>>> >>>>>>>> <ro...@gmail.com>
>> >> >>>> >>>>>>>> wrote:
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>> This has been proposed before:
>> >> >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>> There's currently tighter coupling between the Python
>> and
>> >> >>>> >>>>>>>>> Java
>> >> >>>> >>>>>>>>> halves
>> >> >>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if
>> we
>> >> >>>> >>>>>>>>> did
>> >> >>>> >>>>>>>>> this, I bet
>> >> >>>> >>>>>>>>> we'd run into tons of issues when users try to run a
>> newer
>> >> >>>> >>>>>>>>> version of the
>> >> >>>> >>>>>>>>> Python half of PySpark against an older set of Java
>> >> >>>> >>>>>>>>> components
>> >> >>>> >>>>>>>>> or
>> >> >>>> >>>>>>>>> vice-versa.
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>> >> >>>> >>>>>>>>> <o....@lateral-thoughts.com> wrote:
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> Hi everyone,
>> >> >>>> >>>>>>>>>> Considering the python API as just a front needing the
>> >> >>>> >>>>>>>>>> SPARK_HOME
>> >> >>>> >>>>>>>>>> defined anyway, I think it would be interesting to
>> deploy
>> >> >>>> >>>>>>>>>> the
>> >> >>>> >>>>>>>>>> Python part of
>> >> >>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a
>> >> >>>> >>>>>>>>>> Python
>> >> >>>> >>>>>>>>>> project
>> >> >>>> >>>>>>>>>> needing PySpark via pip.
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>> >> >>>> >>>>>>>>>> install dir
>> >> >>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint
>> tools to
>> >> >>>> >>>>>>>>>> work properly.
>> >> >>>> >>>>>>>>>> I can do the setup.py work or anything.
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> What do you think ?
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> Regards,
>> >> >>>> >>>>>>>>>>
>> >> >>>> >>>>>>>>>> Olivier.
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>>
>> >> >>>> >>>>>>>>
>> >> >>>> >>>>>
>> >> >>>> >>>
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Brian E. Granger
>> >> >>>> > Cal Poly State University, San Luis Obispo
>> >> >>>> > @ellisonbg on Twitter and GitHub
>> >> >>>> > bgranger@calpoly.edu and ellisonbg@gmail.com
>> >> >>>>
>> >> >>>>
>> ---------------------------------------------------------------------
>> >> >>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Brian E. Granger
>> >> Associate Professor of Physics and Data Science
>> >> Cal Poly State University, San Luis Obispo
>> >> @ellisonbg on Twitter and GitHub
>> >> bgranger@calpoly.edu and ellisonbg@gmail.com
>> >
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Associate Professor of Physics and Data Science
>> Cal Poly State University, San Luis Obispo
>> @ellisonbg on Twitter and GitHub
>> bgranger@calpoly.edu and ellisonbg@gmail.com
>>
>
>
>
> --
> *Olivier Girardot* | Associé
> o.girardot@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: PySpark on PyPi

Posted by Brian Granger <el...@gmail.com>.

I would start with just the plain python package without the JAR and
then see if it makes sense to add the JAR over time.

On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez <au...@gmail.com> wrote:
> Hi all,
>
> I wanted to bubble up a conversation from the PR to this discussion to see
> if there is support the idea of including a Spark assembly JAR in a PyPI
> release of pyspark. @holdenk recommended this as she already does so in the
> Sparkling Pandas package. Is this something people are interesting in
> pursuing?
>
> -Auberon
>
> On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger <el...@gmail.com> wrote:
>>
>> Auberon, can you also post this to the Jupyter Google Group?
>>
>> On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <au...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I've created an updated PR for this based off of the previous work of
>> > @prabinb:
>> > https://github.com/apache/spark/pull/8318
>> >
>> > I am not very familiar with python packaging; feedback is appreciated.
>> >
>> > -Auberon
>> >
>> > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <be...@gmail.com> wrote:
>> >>
>> >>
>> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <me...@gmail.com>
>> >> wrote:
>> >>>
>> >>> I would tentatively suggest also conda packaging.
>> >>
>> >>
>> >> A conda package has the advantage that it can be set up without
>> >> 'installing' the pyspark files, while the PyPI packaging is still being
>> >> worked out. It can just add a pyspark.pth file pointing to pyspark,
>> >> py4j
>> >> locations. But I think it's a really good idea to package with conda.
>> >>
>> >> -MinRK
>> >>
>> >>>
>> >>>
>> >>> http://conda.pydata.org/docs/
>> >>>
>> >>> --Matthew Goodman
>> >>>
>> >>> =====================
>> >>> Check Out My Website: http://craneium.net
>> >>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> >>>
>> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <da...@databricks.com>
>> >>> wrote:
>> >>>>
>> >>>> I think so, any contributions on this are welcome.
>> >>>>
>> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <el...@gmail.com>
>> >>>> wrote:
>> >>>> > Sorry, trying to follow the context here. Does it look like there
>> >>>> > is
>> >>>> > support for the idea of creating a setup.py file and pypi package
>> >>>> > for
>> >>>> > pyspark?
>> >>>> >
>> >>>> > Cheers,
>> >>>> >
>> >>>> > Brian
>> >>>> >
>> >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <da...@databricks.com>
>> >>>> > wrote:
>> >>>> >> We could do that after 1.5 released, it will have same release
>> >>>> >> cycle
>> >>>> >> as Spark in the future.
>> >>>> >>
>> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >>>> >> <o....@lateral-thoughts.com> wrote:
>> >>>> >>> +1 (once again :) )
>> >>>> >>>
>> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <ju...@gmail.com>:
>> >>>> >>>>
>> >>>> >>>> // ping
>> >>>> >>>>
>> >>>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>> >>>> >>>> publish to
>> >>>> >>>> PyPI?
>> >>>> >>>>
>> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>> >>>> >>>> <fr...@gmail.com>
>> >>>> >>>> wrote:
>> >>>> >>>>>
>> >>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot
>> >>>> >>>>> of
>> >>>> >>>>> value in
>> >>>> >>>>> steps that make it easier to use PySpark as an ordinary python
>> >>>> >>>>> library.
>> >>>> >>>>>
>> >>>> >>>>> You might want to check out this
>> >>>> >>>>> (https://github.com/minrk/findspark),
>> >>>> >>>>> started by Jupyter project devs, that offers one way to
>> >>>> >>>>> facilitate
>> >>>> >>>>> this
>> >>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>> >>>> >>>>>
>> >>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>> >>>> >>>>> (I’ve done
>> >>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>> >>>> >>>>> PySpark jobs
>> >>>> >>>>> just using `from pyspark import SparkContext; sc =
>> >>>> >>>>> SparkContext(master=“X”)`
>> >>>> >>>>> so long as the environmental variables (PYTHONPATH and
>> >>>> >>>>> PYSPARK_PYTHON) are
>> >>>> >>>>> set correctly on *both* workers and driver. That said, there’s
>> >>>> >>>>> definitely
>> >>>> >>>>> additional configuration / functionality that would require
>> >>>> >>>>> going
>> >>>> >>>>> through
>> >>>> >>>>> the proper submit scripts.
>> >>>> >>>>>
>> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>> >>>> >>>>> <pu...@gmail.com>
>> >>>> >>>>> wrote:
>> >>>> >>>>>
>> >>>> >>>>> I agree with everything Justin just said. An additional
>> >>>> >>>>> advantage
>> >>>> >>>>> of
>> >>>> >>>>> publishing PySpark's Python code in a standards-compliant way
>> >>>> >>>>> is
>> >>>> >>>>> the fact
>> >>>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>> >>>> >>>>> Py4J) in a
>> >>>> >>>>> way that pip can use. Contrast this with the current situation,
>> >>>> >>>>> where
>> >>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>> >>>> >>>>> until you
>> >>>> >>>>> install Pandas.
>> >>>> >>>>>
>> >>>> >>>>> Punya
>> >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>> >>>> >>>>> <ju...@gmail.com>
>> >>>> >>>>> wrote:
>> >>>> >>>>>>
>> >>>> >>>>>> // + Davies for his comments
>> >>>> >>>>>> // + Punya for SA
>> >>>> >>>>>>
>> >>>> >>>>>> For development and CI, like Olivier mentioned, I think it
>> >>>> >>>>>> would
>> >>>> >>>>>> be
>> >>>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>> >>>> >>>>>> dir) on PyPI.
>> >>>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>> >>>> >>>>>> download the
>> >>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the
>> >>>> >>>>>> tools
>> >>>> >>>>>> (pylint,
>> >>>> >>>>>> pytest, IDE code completion). Right now that involves adding
>> >>>> >>>>>> python/ and
>> >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>> >>>> >>>>>> add more
>> >>>> >>>>>> dependencies, we would have to manually mirror all the
>> >>>> >>>>>> PYTHONPATH
>> >>>> >>>>>> munging in
>> >>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>> >>>> >>>>>> declares its
>> >>>> >>>>>> dependencies, and a published distribution, depending on
>> >>>> >>>>>> pyspark
>> >>>> >>>>>> will just
>> >>>> >>>>>> be adding pyspark to my setup.py dependencies.
>> >>>> >>>>>>
>> >>>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>> >>>> >>>>>> backed by
>> >>>> >>>>>> Py4J calls, then we need the full spark distribution with
>> >>>> >>>>>> either
>> >>>> >>>>>> ./pyspark
>> >>>> >>>>>> or ./spark-submit, but for things like linting and
>> >>>> >>>>>> development,
>> >>>> >>>>>> the
>> >>>> >>>>>> PYTHONPATH munging is very annoying.
>> >>>> >>>>>>
>> >>>> >>>>>> I don't think the version-mismatch issues are a compelling
>> >>>> >>>>>> reason
>> >>>> >>>>>> to not
>> >>>> >>>>>> go ahead with PyPI publishing. At runtime, we should
>> >>>> >>>>>> definitely
>> >>>> >>>>>> enforce that
>> >>>> >>>>>> the version has to be exact, which means there is no
>> >>>> >>>>>> backcompat
>> >>>> >>>>>> nightmare as
>> >>>> >>>>>> suggested by Davies in
>> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >>>> >>>>>> This would mean that even if the user got his pip installed
>> >>>> >>>>>> pyspark to
>> >>>> >>>>>> somehow get loaded before the spark distribution provided
>> >>>> >>>>>> pyspark, then the
>> >>>> >>>>>> user would be alerted immediately.
>> >>>> >>>>>>
>> >>>> >>>>>> Davies, if you buy this, should me or someone on my team pick
>> >>>> >>>>>> up
>> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> >>>> >>>>>> https://github.com/apache/spark/pull/464?
>> >>>> >>>>>>
>> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>> >>>> >>>>>> <o....@lateral-thoughts.com> wrote:
>> >>>> >>>>>>>
>> >>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>> >>>> >>>>>>> situation,
>> >>>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>> >>>> >>>>>>> have to :
>> >>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>> >>>> >>>>>>> somewhere on
>> >>>> >>>>>>> every agent
>> >>>> >>>>>>> 2- define the SPARK_HOME env
>> >>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>> >>>> >>>>>>> install dir
>> >>>> >>>>>>> site-packages/ directory
>> >>>> >>>>>>> and if I rely on additional packages (like databricks'
>> >>>> >>>>>>> Spark-CSV
>> >>>> >>>>>>> project), I have to (except if I'm mistaken)
>> >>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>> >>>> >>>>>>> directory
>> >>>> >>>>>>> on every agent
>> >>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>> >>>> >>>>>>> additional
>> >>>> >>>>>>> classpath using the conf/spark-default file
>> >>>> >>>>>>>
>> >>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>> >>>> >>>>>>> Some issues are related to spark-packages, some to the lack
>> >>>> >>>>>>> of
>> >>>> >>>>>>> python-based dependency, and some to the way SparkContext are
>> >>>> >>>>>>> launched when
>> >>>> >>>>>>> using pyspark.
>> >>>> >>>>>>> I think step 1 and 2 are fair enough
>> >>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>> >>>> >>>>>>> considering
>> >>>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>> >>>> >>>>>>> think if
>> >>>> >>>>>>> nothing's done yet it will (I guess ?).
>> >>>> >>>>>>>
>> >>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>> >>>> >>>>>>> would be
>> >>>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>> >>>> >>>>>>> spark
>> >>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>> >>>> >>>>>>>
>> >>>> >>>>>>> Regards,
>> >>>> >>>>>>>
>> >>>> >>>>>>> Olivier.
>> >>>> >>>>>>>
>> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam
>> >>>> >>>>>>> <je...@cs.berkeley.edu>
>> >>>> >>>>>>> a écrit
>> >>>> >>>>>>> :
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that
>> >>>> >>>>>>>> just
>> >>>> >>>>>>>> serves
>> >>>> >>>>>>>> as a shim to an existing Spark installation? Or it could
>> >>>> >>>>>>>> even
>> >>>> >>>>>>>> download the
>> >>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>> >>>> >>>>>>>> installation. Right now,
>> >>>> >>>>>>>> Spark doesn't play very well with the usual Python
>> >>>> >>>>>>>> ecosystem.
>> >>>> >>>>>>>> For example,
>> >>>> >>>>>>>> why do I need to use a strange incantation when booting up
>> >>>> >>>>>>>> IPython if I want
>> >>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It
>> >>>> >>>>>>>> would
>> >>>> >>>>>>>> be much nicer
>> >>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>> >>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>> >>>> >>>>>>>> pass when
>> >>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> -Jey
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>> >>>> >>>>>>>> <ro...@gmail.com>
>> >>>> >>>>>>>> wrote:
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> This has been proposed before:
>> >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> There's currently tighter coupling between the Python and
>> >>>> >>>>>>>>> Java
>> >>>> >>>>>>>>> halves
>> >>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we
>> >>>> >>>>>>>>> did
>> >>>> >>>>>>>>> this, I bet
>> >>>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>> >>>> >>>>>>>>> version of the
>> >>>> >>>>>>>>> Python half of PySpark against an older set of Java
>> >>>> >>>>>>>>> components
>> >>>> >>>>>>>>> or
>> >>>> >>>>>>>>> vice-versa.
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>> >>>> >>>>>>>>> <o....@lateral-thoughts.com> wrote:
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> Hi everyone,
>> >>>> >>>>>>>>>> Considering the python API as just a front needing the
>> >>>> >>>>>>>>>> SPARK_HOME
>> >>>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy
>> >>>> >>>>>>>>>> the
>> >>>> >>>>>>>>>> Python part of
>> >>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a
>> >>>> >>>>>>>>>> Python
>> >>>> >>>>>>>>>> project
>> >>>> >>>>>>>>>> needing PySpark via pip.
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>> >>>> >>>>>>>>>> install dir
>> >>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>> >>>> >>>>>>>>>> work properly.
>> >>>> >>>>>>>>>> I can do the setup.py work or anything.
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> What do you think ?
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> Regards,
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> Olivier.
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>
>> >>>> >>>
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Brian E. Granger
>> >>>> > Cal Poly State University, San Luis Obispo
>> >>>> > @ellisonbg on Twitter and GitHub
>> >>>> > bgranger@calpoly.edu and ellisonbg@gmail.com
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Associate Professor of Physics and Data Science
>> Cal Poly State University, San Luis Obispo
>> @ellisonbg on Twitter and GitHub
>> bgranger@calpoly.edu and ellisonbg@gmail.com
>
>



-- 
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: PySpark on PyPi

Posted by Brian Granger <el...@gmail.com>.

Auberon, can you also post this to the Jupyter Google Group?

On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <au...@gmail.com> wrote:
> Hi all,
>
> I've created an updated PR for this based off of the previous work of
> @prabinb:
> https://github.com/apache/spark/pull/8318
>
> I am not very familiar with python packaging; feedback is appreciated.
>
> -Auberon
>
> On Mon, Aug 10, 2015 at 12:45 PM, MinRK <be...@gmail.com> wrote:
>>
>>
>> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <me...@gmail.com> wrote:
>>>
>>> I would tentatively suggest also conda packaging.
>>
>>
>> A conda package has the advantage that it can be set up without
>> 'installing' the pyspark files, while the PyPI packaging is still being
>> worked out. It can just add a pyspark.pth file pointing to pyspark, py4j
>> locations. But I think it's a really good idea to package with conda.
>>
>> -MinRK
>>
>>>
>>>
>>> http://conda.pydata.org/docs/
>>>
>>> --Matthew Goodman
>>>
>>> =====================
>>> Check Out My Website: http://craneium.net
>>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>>>
>>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <da...@databricks.com>
>>> wrote:
>>>>
>>>> I think so, any contributions on this are welcome.
>>>>
>>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <el...@gmail.com>
>>>> wrote:
>>>> > Sorry, trying to follow the context here. Does it look like there is
>>>> > support for the idea of creating a setup.py file and pypi package for
>>>> > pyspark?
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Brian
>>>> >
>>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <da...@databricks.com>
>>>> > wrote:
>>>> >> We could do that after 1.5 released, it will have same release cycle
>>>> >> as Spark in the future.
>>>> >>
>>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>>> >> <o....@lateral-thoughts.com> wrote:
>>>> >>> +1 (once again :) )
>>>> >>>
>>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <ju...@gmail.com>:
>>>> >>>>
>>>> >>>> // ping
>>>> >>>>
>>>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>>>> >>>> publish to
>>>> >>>> PyPI?
>>>> >>>>
>>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>>>> >>>> <fr...@gmail.com>
>>>> >>>> wrote:
>>>> >>>>>
>>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
>>>> >>>>> value in
>>>> >>>>> steps that make it easier to use PySpark as an ordinary python
>>>> >>>>> library.
>>>> >>>>>
>>>> >>>>> You might want to check out this
>>>> >>>>> (https://github.com/minrk/findspark),
>>>> >>>>> started by Jupyter project devs, that offers one way to facilitate
>>>> >>>>> this
>>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>>>> >>>>>
>>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>>>> >>>>> (I’ve done
>>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>>>> >>>>> PySpark jobs
>>>> >>>>> just using `from pyspark import SparkContext; sc =
>>>> >>>>> SparkContext(master=“X”)`
>>>> >>>>> so long as the environmental variables (PYTHONPATH and
>>>> >>>>> PYSPARK_PYTHON) are
>>>> >>>>> set correctly on *both* workers and driver. That said, there’s
>>>> >>>>> definitely
>>>> >>>>> additional configuration / functionality that would require going
>>>> >>>>> through
>>>> >>>>> the proper submit scripts.
>>>> >>>>>
>>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>>>> >>>>> <pu...@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>
>>>> >>>>> I agree with everything Justin just said. An additional advantage
>>>> >>>>> of
>>>> >>>>> publishing PySpark's Python code in a standards-compliant way is
>>>> >>>>> the fact
>>>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>>>> >>>>> Py4J) in a
>>>> >>>>> way that pip can use. Contrast this with the current situation,
>>>> >>>>> where
>>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>>>> >>>>> until you
>>>> >>>>> install Pandas.
>>>> >>>>>
>>>> >>>>> Punya
>>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>>>> >>>>> <ju...@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> // + Davies for his comments
>>>> >>>>>> // + Punya for SA
>>>> >>>>>>
>>>> >>>>>> For development and CI, like Olivier mentioned, I think it would
>>>> >>>>>> be
>>>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>>>> >>>>>> dir) on PyPI.
>>>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>>>> >>>>>> download the
>>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
>>>> >>>>>> (pylint,
>>>> >>>>>> pytest, IDE code completion). Right now that involves adding
>>>> >>>>>> python/ and
>>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>>>> >>>>>> add more
>>>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
>>>> >>>>>> munging in
>>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>>>> >>>>>> declares its
>>>> >>>>>> dependencies, and a published distribution, depending on pyspark
>>>> >>>>>> will just
>>>> >>>>>> be adding pyspark to my setup.py dependencies.
>>>> >>>>>>
>>>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>>>> >>>>>> backed by
>>>> >>>>>> Py4J calls, then we need the full spark distribution with either
>>>> >>>>>> ./pyspark
>>>> >>>>>> or ./spark-submit, but for things like linting and development,
>>>> >>>>>> the
>>>> >>>>>> PYTHONPATH munging is very annoying.
>>>> >>>>>>
>>>> >>>>>> I don't think the version-mismatch issues are a compelling reason
>>>> >>>>>> to not
>>>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
>>>> >>>>>> enforce that
>>>> >>>>>> the version has to be exact, which means there is no backcompat
>>>> >>>>>> nightmare as
>>>> >>>>>> suggested by Davies in
>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>>>> >>>>>> This would mean that even if the user got his pip installed
>>>> >>>>>> pyspark to
>>>> >>>>>> somehow get loaded before the spark distribution provided
>>>> >>>>>> pyspark, then the
>>>> >>>>>> user would be alerted immediately.
>>>> >>>>>>
>>>> >>>>>> Davies, if you buy this, should me or someone on my team pick up
>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>> >>>>>> https://github.com/apache/spark/pull/464?
>>>> >>>>>>
>>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>> >>>>>> <o....@lateral-thoughts.com> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>>>> >>>>>>> situation,
>>>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>>>> >>>>>>> have to :
>>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>>>> >>>>>>> somewhere on
>>>> >>>>>>> every agent
>>>> >>>>>>> 2- define the SPARK_HOME env
>>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>>>> >>>>>>> install dir
>>>> >>>>>>> site-packages/ directory
>>>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>>> >>>>>>> project), I have to (except if I'm mistaken)
>>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>>>> >>>>>>> directory
>>>> >>>>>>> on every agent
>>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>>>> >>>>>>> additional
>>>> >>>>>>> classpath using the conf/spark-default file
>>>> >>>>>>>
>>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>>>> >>>>>>> Some issues are related to spark-packages, some to the lack of
>>>> >>>>>>> python-based dependency, and some to the way SparkContext are
>>>> >>>>>>> launched when
>>>> >>>>>>> using pyspark.
>>>> >>>>>>> I think step 1 and 2 are fair enough
>>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>>>> >>>>>>> considering
>>>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>>>> >>>>>>> think if
>>>> >>>>>>> nothing's done yet it will (I guess ?).
>>>> >>>>>>>
>>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>>>> >>>>>>> would be
>>>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>>>> >>>>>>> spark
>>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>>>> >>>>>>>
>>>> >>>>>>> Regards,
>>>> >>>>>>>
>>>> >>>>>>> Olivier.
>>>> >>>>>>>
>>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <je...@cs.berkeley.edu>
>>>> >>>>>>> a écrit
>>>> >>>>>>> :
>>>> >>>>>>>>
>>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
>>>> >>>>>>>> serves
>>>> >>>>>>>> as a shim to an existing Spark installation? Or it could even
>>>> >>>>>>>> download the
>>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>>>> >>>>>>>> installation. Right now,
>>>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem.
>>>> >>>>>>>> For example,
>>>> >>>>>>>> why do I need to use a strange incantation when booting up
>>>> >>>>>>>> IPython if I want
>>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would
>>>> >>>>>>>> be much nicer
>>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>>>> >>>>>>>>
>>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>>>> >>>>>>>> pass when
>>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>> >>>>>>>>
>>>> >>>>>>>> -Jey
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>>>> >>>>>>>> <ro...@gmail.com>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> This has been proposed before:
>>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>> >>>>>>>>>
>>>> >>>>>>>>> There's currently tighter coupling between the Python and Java
>>>> >>>>>>>>> halves
>>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
>>>> >>>>>>>>> this, I bet
>>>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>>>> >>>>>>>>> version of the
>>>> >>>>>>>>> Python half of PySpark against an older set of Java components
>>>> >>>>>>>>> or
>>>> >>>>>>>>> vice-versa.
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>> >>>>>>>>> <o....@lateral-thoughts.com> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Hi everyone,
>>>> >>>>>>>>>> Considering the python API as just a front needing the
>>>> >>>>>>>>>> SPARK_HOME
>>>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
>>>> >>>>>>>>>> Python part of
>>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
>>>> >>>>>>>>>> project
>>>> >>>>>>>>>> needing PySpark via pip.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>>>> >>>>>>>>>> install dir
>>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>>>> >>>>>>>>>> work properly.
>>>> >>>>>>>>>> I can do the setup.py work or anything.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> What do you think ?
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Regards,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Olivier.
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>
>>>> >>>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Brian E. Granger
>>>> > Cal Poly State University, San Luis Obispo
>>>> > @ellisonbg on Twitter and GitHub
>>>> > bgranger@calpoly.edu and ellisonbg@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>
>>
>



-- 
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: PySpark on PyPi

Posted by Matt Goodman <me...@gmail.com>.

I would tentatively suggest also conda packaging.

http://conda.pydata.org/docs/

--Matthew Goodman

=====================
Check Out My Website: http://craneium.net
Find me on LinkedIn: http://tinyurl.com/d6wlch

On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <da...@databricks.com> wrote:

> I think so, any contributions on this are welcome.
>
> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <el...@gmail.com>
> wrote:
> > Sorry, trying to follow the context here. Does it look like there is
> > support for the idea of creating a setup.py file and pypi package for
> > pyspark?
> >
> > Cheers,
> >
> > Brian
> >
> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <da...@databricks.com>
> wrote:
> >> We could do that after 1.5 released, it will have same release cycle
> >> as Spark in the future.
> >>
> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
> >> <o....@lateral-thoughts.com> wrote:
> >>> +1 (once again :) )
> >>>
> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <ju...@gmail.com>:
> >>>>
> >>>> // ping
> >>>>
> >>>> do we have any signoff from the pyspark devs to submit a PR to
> publish to
> >>>> PyPI?
> >>>>
> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
> freeman.jeremy@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
> value in
> >>>>> steps that make it easier to use PySpark as an ordinary python
> library.
> >>>>>
> >>>>> You might want to check out this (https://github.com/minrk/findspark
> ),
> >>>>> started by Jupyter project devs, that offers one way to facilitate
> this
> >>>>> stuff. I’ve also cced them here to join the conversation.
> >>>>>
> >>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve
> done
> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
> PySpark jobs
> >>>>> just using `from pyspark import SparkContext; sc =
> SparkContext(master=“X”)`
> >>>>> so long as the environmental variables (PYTHONPATH and
> PYSPARK_PYTHON) are
> >>>>> set correctly on *both* workers and driver. That said, there’s
> definitely
> >>>>> additional configuration / functionality that would require going
> through
> >>>>> the proper submit scripts.
> >>>>>
> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
> punya.biswal@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> I agree with everything Justin just said. An additional advantage of
> >>>>> publishing PySpark's Python code in a standards-compliant way is the
> fact
> >>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J)
> in a
> >>>>> way that pip can use. Contrast this with the current situation, where
> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
> until you
> >>>>> install Pandas.
> >>>>>
> >>>>> Punya
> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <ju...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> // + Davies for his comments
> >>>>>> // + Punya for SA
> >>>>>>
> >>>>>> For development and CI, like Olivier mentioned, I think it would be
> >>>>>> hugely beneficial to publish pyspark (only code in the python/ dir)
> on PyPI.
> >>>>>> If anyone wants to develop against PySpark APIs, they need to
> download the
> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
> (pylint,
> >>>>>> pytest, IDE code completion). Right now that involves adding
> python/ and
> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
> more
> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
> munging in
> >>>>>> the ./pyspark script. With a proper pyspark setup.py which declares
> its
> >>>>>> dependencies, and a published distribution, depending on pyspark
> will just
> >>>>>> be adding pyspark to my setup.py dependencies.
> >>>>>>
> >>>>>> Of course, if we actually want to run parts of pyspark that is
> backed by
> >>>>>> Py4J calls, then we need the full spark distribution with either
> ./pyspark
> >>>>>> or ./spark-submit, but for things like linting and development, the
> >>>>>> PYTHONPATH munging is very annoying.
> >>>>>>
> >>>>>> I don't think the version-mismatch issues are a compelling reason
> to not
> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
> enforce that
> >>>>>> the version has to be exact, which means there is no backcompat
> nightmare as
> >>>>>> suggested by Davies in
> https://issues.apache.org/jira/browse/SPARK-1267.
> >>>>>> This would mean that even if the user got his pip installed pyspark
> to
> >>>>>> somehow get loaded before the spark distribution provided pyspark,
> then the
> >>>>>> user would be alerted immediately.
> >>>>>>
> >>>>>> Davies, if you buy this, should me or someone on my team pick up
> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
> >>>>>> https://github.com/apache/spark/pull/464?
> >>>>>>
> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
> >>>>>> <o....@lateral-thoughts.com> wrote:
> >>>>>>>
> >>>>>>> Ok, I get it. Now what can we do to improve the current situation,
> >>>>>>> because right now if I want to set-up a CI env for PySpark, I have
> to :
> >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere
> on
> >>>>>>> every agent
> >>>>>>> 2- define the SPARK_HOME env
> >>>>>>> 3- symlink this distribution pyspark dir inside the python install
> dir
> >>>>>>> site-packages/ directory
> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
> >>>>>>> project), I have to (except if I'm mistaken)
> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
> directory
> >>>>>>> on every agent
> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
> additional
> >>>>>>> classpath using the conf/spark-default file
> >>>>>>>
> >>>>>>> Then finally we can launch our unit/integration-tests.
> >>>>>>> Some issues are related to spark-packages, some to the lack of
> >>>>>>> python-based dependency, and some to the way SparkContext are
> launched when
> >>>>>>> using pyspark.
> >>>>>>> I think step 1 and 2 are fair enough
> >>>>>>> 4 and 5 may already have solutions, I didn't check and considering
> >>>>>>> spark-shell is downloading such dependencies automatically, I
> think if
> >>>>>>> nothing's done yet it will (I guess ?).
> >>>>>>>
> >>>>>>> For step 3, maybe just adding a setup.py to the distribution would
> be
> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> Olivier.
> >>>>>>>
> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <je...@cs.berkeley.edu> a
> écrit
> >>>>>>> :
> >>>>>>>>
> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
> serves
> >>>>>>>> as a shim to an existing Spark installation? Or it could even
> download the
> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation.
> Right now,
> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. For
> example,
> >>>>>>>> why do I need to use a strange incantation when booting up
> IPython if I want
> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be
> much nicer
> >>>>>>>> to just type `from pyspark import SparkContext; sc =
> >>>>>>>> SparkContext("local[4]")` in my notebook.
> >>>>>>>>
> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass
> when
> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
> >>>>>>>>
> >>>>>>>> -Jey
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenville@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> This has been proposed before:
> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
> >>>>>>>>>
> >>>>>>>>> There's currently tighter coupling between the Python and Java
> halves
> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
> this, I bet
> >>>>>>>>> we'd run into tons of issues when users try to run a newer
> version of the
> >>>>>>>>> Python half of PySpark against an older set of Java components or
> >>>>>>>>> vice-versa.
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
> >>>>>>>>> <o....@lateral-thoughts.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi everyone,
> >>>>>>>>>> Considering the python API as just a front needing the
> SPARK_HOME
> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
> Python part of
> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
> project
> >>>>>>>>>> needing PySpark via pip.
> >>>>>>>>>>
> >>>>>>>>>> For now I just symlink the python/pyspark in my python install
> dir
> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work
> properly.
> >>>>>>>>>> I can do the setup.py work or anything.
> >>>>>>>>>>
> >>>>>>>>>> What do you think ?
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>>
> >>>>>>>>>> Olivier.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>
> >
> >
> >
> > --
> > Brian E. Granger
> > Cal Poly State University, San Luis Obispo
> > @ellisonbg on Twitter and GitHub
> > bgranger@calpoly.edu and ellisonbg@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: PySpark on PyPi

Posted by Davies Liu <da...@databricks.com>.

I think so, any contributions on this are welcome.

On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <el...@gmail.com> wrote:
> Sorry, trying to follow the context here. Does it look like there is
> support for the idea of creating a setup.py file and pypi package for
> pyspark?
>
> Cheers,
>
> Brian
>
> On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <da...@databricks.com> wrote:
>> We could do that after 1.5 released, it will have same release cycle
>> as Spark in the future.
>>
>> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> <o....@lateral-thoughts.com> wrote:
>>> +1 (once again :) )
>>>
>>> 2015-07-28 14:51 GMT+02:00 Justin Uang <ju...@gmail.com>:
>>>>
>>>> // ping
>>>>
>>>> do we have any signoff from the pyspark devs to submit a PR to publish to
>>>> PyPI?
>>>>
>>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <fr...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in
>>>>> steps that make it easier to use PySpark as an ordinary python library.
>>>>>
>>>>> You might want to check out this (https://github.com/minrk/findspark),
>>>>> started by Jupyter project devs, that offers one way to facilitate this
>>>>> stuff. I’ve also cced them here to join the conversation.
>>>>>
>>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
>>>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
>>>>> just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)`
>>>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
>>>>> set correctly on *both* workers and driver. That said, there’s definitely
>>>>> additional configuration / functionality that would require going through
>>>>> the proper submit scripts.
>>>>>
>>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <pu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I agree with everything Justin just said. An additional advantage of
>>>>> publishing PySpark's Python code in a standards-compliant way is the fact
>>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
>>>>> way that pip can use. Contrast this with the current situation, where
>>>>> df.toPandas() exists in the Spark API but doesn't actually work until you
>>>>> install Pandas.
>>>>>
>>>>> Punya
>>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <ju...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> // + Davies for his comments
>>>>>> // + Punya for SA
>>>>>>
>>>>>> For development and CI, like Olivier mentioned, I think it would be
>>>>>> hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI.
>>>>>> If anyone wants to develop against PySpark APIs, they need to download the
>>>>>> distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
>>>>>> pytest, IDE code completion). Right now that involves adding python/ and
>>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
>>>>>> dependencies, we would have to manually mirror all the PYTHONPATH munging in
>>>>>> the ./pyspark script. With a proper pyspark setup.py which declares its
>>>>>> dependencies, and a published distribution, depending on pyspark will just
>>>>>> be adding pyspark to my setup.py dependencies.
>>>>>>
>>>>>> Of course, if we actually want to run parts of pyspark that is backed by
>>>>>> Py4J calls, then we need the full spark distribution with either ./pyspark
>>>>>> or ./spark-submit, but for things like linting and development, the
>>>>>> PYTHONPATH munging is very annoying.
>>>>>>
>>>>>> I don't think the version-mismatch issues are a compelling reason to not
>>>>>> go ahead with PyPI publishing. At runtime, we should definitely enforce that
>>>>>> the version has to be exact, which means there is no backcompat nightmare as
>>>>>> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
>>>>>> This would mean that even if the user got his pip installed pyspark to
>>>>>> somehow get loaded before the spark distribution provided pyspark, then the
>>>>>> user would be alerted immediately.
>>>>>>
>>>>>> Davies, if you buy this, should me or someone on my team pick up
>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>>>> https://github.com/apache/spark/pull/464?
>>>>>>
>>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>>>> <o....@lateral-thoughts.com> wrote:
>>>>>>>
>>>>>>> Ok, I get it. Now what can we do to improve the current situation,
>>>>>>> because right now if I want to set-up a CI env for PySpark, I have to :
>>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere on
>>>>>>> every agent
>>>>>>> 2- define the SPARK_HOME env
>>>>>>> 3- symlink this distribution pyspark dir inside the python install dir
>>>>>>> site-packages/ directory
>>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>>>>>> project), I have to (except if I'm mistaken)
>>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory
>>>>>>> on every agent
>>>>>>> 5- add this jar-filled directory to the Spark distribution's additional
>>>>>>> classpath using the conf/spark-default file
>>>>>>>
>>>>>>> Then finally we can launch our unit/integration-tests.
>>>>>>> Some issues are related to spark-packages, some to the lack of
>>>>>>> python-based dependency, and some to the way SparkContext are launched when
>>>>>>> using pyspark.
>>>>>>> I think step 1 and 2 are fair enough
>>>>>>> 4 and 5 may already have solutions, I didn't check and considering
>>>>>>> spark-shell is downloading such dependencies automatically, I think if
>>>>>>> nothing's done yet it will (I guess ?).
>>>>>>>
>>>>>>> For step 3, maybe just adding a setup.py to the distribution would be
>>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
>>>>>>> distribution in PyPi, maybe there's a better compromise ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Olivier.
>>>>>>>
>>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <je...@cs.berkeley.edu> a écrit
>>>>>>> :
>>>>>>>>
>>>>>>>> Couldn't we have a pip installable "pyspark" package that just serves
>>>>>>>> as a shim to an existing Spark installation? Or it could even download the
>>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
>>>>>>>> Spark doesn't play very well with the usual Python ecosystem. For example,
>>>>>>>> why do I need to use a strange incantation when booting up IPython if I want
>>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be much nicer
>>>>>>>> to just type `from pyspark import SparkContext; sc =
>>>>>>>> SparkContext("local[4]")` in my notebook.
>>>>>>>>
>>>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass when
>>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>>>>>>
>>>>>>>>
>>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>>>>>>
>>>>>>>> -Jey
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <ro...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> This has been proposed before:
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>>>>>>>
>>>>>>>>> There's currently tighter coupling between the Python and Java halves
>>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>>>>>>>>> we'd run into tons of issues when users try to run a newer version of the
>>>>>>>>> Python half of PySpark against an older set of Java components or
>>>>>>>>> vice-versa.
>>>>>>>>>
>>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>>>>>>> <o....@lateral-thoughts.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>> Considering the python API as just a front needing the SPARK_HOME
>>>>>>>>>> defined anyway, I think it would be interesting to deploy the Python part of
>>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python project
>>>>>>>>>> needing PySpark via pip.
>>>>>>>>>>
>>>>>>>>>> For now I just symlink the python/pyspark in my python install dir
>>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>>>>>>>>> I can do the setup.py work or anything.
>>>>>>>>>>
>>>>>>>>>> What do you think ?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Olivier.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>
>
>
>
> --
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> @ellisonbg on Twitter and GitHub
> bgranger@calpoly.edu and ellisonbg@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org