You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2016/10/12 19:49:48 UTC

Python Spark Improvements (forked from Spark Improvement Proposals)

Hi Spark Devs & Users,


Forking off from Cody’s original thread
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
of Spark Improvements, and Matei's follow up on asking what issues the
Python community was facing with Spark, I think it would be useful for us
to discuss some of the motivations behind some of the Python community
looking at different technologies to replace Apache Spark with. My
viewpoints are based that of a developer who works on Apache Spark
day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks at
Python conferences
<https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
and I feel many (but not all) of the same challenges as the Python
community does trying to use Spark. I’ve included both the user@ and dev@
lists on this one since I think the user community can probably provide
more reasons why they have difficulty with PySpark. I should also point out
- the solution for all of these things may not live inside of the Spark
project itself, but it still impacts our usability as a whole.


   -

   Lack of pip installability

This is one of the points that Matei mentioned, and it something several
people have tried to provide for Spark in one way or another. It seems
getting reviewer time for this issue is rather challenging, and I’ve been
hesitant to ask the contributors to keep updating their PRs (as much as I
want to see some progress) because I just don't know if we have the time or
interest in this. I’m happy to pick up the latest from Juliet and try and
carry it over the finish line if we can find some committer time to work on
this since it now sounds like there is consensus we should do this.

   -

   Difficulty using PySpark from outside of spark-submit / pyspark shell

The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
exist is one of the clearest examples of this challenge. There is also a PR
to make it easier for other shells to extend the Spark shell, and we ran
into some similar challenges while working on Sparkling Pandas. This could
be solved by making Spark pip installable so I won't’ say too much about
this point.

   -

   Minimal integration with IPython/IJupyter

This one is awkward since one of the areas that some of the larger
commercial players work in effectively “competes” (in a very loose term)
with any features introduced around here. I’m not really super sure what
the best path forward is here, but I think collaborating with the IJupyter
people to enable more features found in the commercial offerings in open
source could be beneficial to everyone in the community, and maybe even
reduce the maintenance cost for some of the commercial entities. I
understand this is a tricky issue but having good progress indicators or
something similar could make a huge difference. (Note that Apache Toree
<https://toree.incubator.apache.org/> [Incubating] exists for Scala users
but hopefully the PySpark IJupyter integration could be achieved without a
new kernel).

   -

   Lack of virtualenv and or Python package distribution support

This one is also tricky since many commercial providers have their own
“solution” to this, but there isn’t a good story around supporting custom
virtual envs or user required Python packages. While spark-packages _can_
be Python this requires that the Python package developer go through rather
a lot of work to make their package available and realistically won’t
happen for most Python packages people want to use. And to be fair, the
addFiles mechanism does support Python eggs which works for some packages.
There are some outstanding PRs around this issue (and I understand these
are perhaps large issues which might require large changes to the current
suggested implementations - I’ve had difficulty keeping the current set of
open PRs around this straight in my own head) but there seems to be no
committer bandwidth or interest on working with the contributors who have
suggested these things. Is this an intentional decision or is this
something we as a community are willing to work on/tackle?

   -

   Speed/performance

This is often a complaint I hear from more “data engineering” profile users
who are working in Python. These problems come mostly in places involving
the interaction of Python and the JVM (so UDFs, transformations with
arbitrary lambdas, collect() and toPandas()). This is an area I’m working
on (see https://github.com/apache/spark/pull/13571 ) and hopefully we can
start investigating Apache Arrow <https://arrow.apache.org/> to speed up
the bridge (or something similar) once it’s a bit more ready (currently
Arrow just released 0.1 which is exciting). We also probably need to start
measuring these things more closely since otherwise random regressions will
continue to be introduced (like the challenge with unbalanced partitions
and block serialization together - see SPARK-17817
<https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)

   -

   Configuration difficulties (especially related to OOMs)

This is a general challenge many people face working in Spark, but PySpark
users are also asked to somehow figure out what the correct amount of
memory is to give to the Python process versus the Scala/JVM processes.
This was maybe an acceptable solution at the start, but when combined with
the difficult to understand error messages it can become quite the time
sink. A quick work around would be picking a different default overhead for
applications using Python, but more generally hopefully some shared off-JVM
heap solution could also help reduce this challenge in the future.

   -

   API difficulties

The Spark API doesn’t “feel” very Pythony is a complaint some people have,
but I think we’ve done some excellent work in the DataFrame/Dataset API
here. At the same time we’ve made some really frustrating choices with the
DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
have no concrete plans to bring the Dataset API to PySpark).

A lot of users wish that our DataFrame API was more like the Pandas API
(and Wes has pointed out on some JIRAs where we have differences) as well
as covered more of the functionality of Pandas. This is a hard problem, and
it the solution might not belong inside of PySpark itself (Juliet and I did
some proof-of-concept work back in the day on Sparkling Pandas
<https://github.com/sparklingpandas/sparklingpandas>) - but since one of my
personal goals has been trying to become a committer I’ve been more focused
on contributing to Spark itself rather than libraries and very few people
seem to be interested in working on this project [although I still have
potential users ask if they can use it]. (Of course if there is sufficient
interest to reboot Sparkling Pandas or something similar that would be an
interesting area of work - but it’s also a huge area of work - if you look
at Dask <http://dask.pydata.org/>, a good portion of the work is dedicated
just to supporting pandas like operations).

   -

   Incomprehensible error messages

I often have people ask me how to debug PySpark and they often have a
certain haunted look in their eyes while they ask me this (slightly
joking). More seriously, we really need to provide more guidance around how
to understand PySpark error messages and look at figuring out if there are
places where we can improve the messaging so users aren’t hunting through
stack overflow trying to figure out where the Java exception they are
getting is related to their Python code. In one talk I gave recently
someone mentioned PySpark was the motivation behind finding the hide error
messages plugin/settings for IJupyter.

   -

   Lack of useful ML model & pipeline export/import

This is something we’ve made great progress on, many of the PySpark models
are now able to use the underlying export mechanisms from Java. However I
often hear challenges with using these models in the rest of the Python
space once they have been exported from Spark. I’ve got a PR to add basic
PMML export in Scala to ML (which we can then bring to Python), but I think
the Python community is open to other formats if the Spark community
doesn’t want to go the PMML route.

Now I don’t think we will see the same challenges we’ve seen develop in the
R community, but I suspect purely Python approaches to distributed systems
will continue to eat the “low end” of Spark (e.g. medium sized data
problems requiring parallelism). This isn’t necessarily a bad thing, but if
there is anything I’ve learnt it that's the "low end" solution often
quickly eats the "high end" within a few years - and I’d rather see Spark
continue to thrive outside of the pure JVM space.

These are just the biggest issues that I hear come up commonly and
remembered on my flight back - it’s quite possible I’ve missed important
things. I know contributing on a mailing list can be scary or intimidating
for new users (and even experienced developers may wish to stay out of
discussions they view as heated) - but I strongly encourage everyone to
participate (respectfully) in this thread and we can all work together to
help Spark continue to be the place where people from different languages
and backgrounds continue to come together to collaborate.


I want to be clear as well, while I feel these are all very important
issues (and being someone who has worked on PySpark & Spark for years
<http://bit.ly/hkspmg> without being a committer I may sometimes come off
as frustrated when I talk about these) I think PySpark as a whole is a
really excellent application and we do some really awesome stuff with it.
There are also things that I will be blind to as a result of having worked
on Spark for so long (for example yesterday I caught myself using the _
syntax in a Scala example without explaining it because it seems “normal”
to me but often trips up new comers.) If we can address even some of these
issues I believe it will be a huge win for Spark adoption outside of the
traditional JVM space (and as the various community surveys continue to
indicate PySpark usage is already quite high).

Normally I’d bring in a box of timbits
<https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
having an in-person meeting for this but all I can do for the mailing list
is attach a cute picture and offer future doughnuts/coffee if people want
to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
animals working on Spark on my flight back from a Python Data conference &
Spark meetup just to remind everyone that this is just a software project
and we can be friendly nice people if we try and things will be much more
awesome if we do :)
[image: Inline image 1]

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by mariusvniekerk <ma...@gmail.com>.

So for the jupyter integration pieces.

I've made a simple library ( https://github.com/MaxPoint/spylon
<https://github.com/MaxPoint/spylon>  ) which allows a simpler way of
creating a SparkContext (with all the parameters available to spark-submit)
as well as some usability enhancements, progress bars, tab completion for
spark configuration properties, easier loading of scala objects via py4j.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19449.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Holden Karau <ho...@pigscanfly.ca>.

Awesome, good points everyone. The ranking of the issues is super useful
and I'd also completely forgotten about the lack of built in UDAF support
which is rather important. There is a PR to make it easier to call/register
JVM UDFs from Python which will hopefully help a bit there too. I'm getting
on a flight to London for OSCON but I want to continueo encourage users to
chime in with their experiences (to that end I'm trying to re include user@
since it doesn't seem to have been posted there despite my initial attempt
to do so.)

On Thursday, October 13, 2016, assaf.mendelson <as...@rsa.com>
wrote:

> Hi,
>
> We are actually using pyspark heavily.
>
> I agree with all of your points,  for me I see the following as the main
> hurdles:
>
> 1.       Pyspark does not have support for UDAF. We have had multiple
> needs for UDAF and needed to go to java/scala to support these. Having
> python UDAF would have made life much easier (especially at earlier stages
> when we prototype).
>
> 2.       Performance. I cannot stress this enough. Currently we have
> engineers who take python UDFs and convert them to scala UDFs for
> performance. We are currently even looking at writing UDFs and UDAFs in a
> more native way (e.g. using expressions) to improve performance but working
> with pyspark can be really problematic.
>
>
>
> BTW, other than using jython or arrow, I believe there are a couple of
> other ways to get improve performance:
>
> 1.       Python provides tool to generate AST for python code (
> https://docs.python.org/2/library/ast.html). This means we can use the
> AST to construct scala code very similar to how expressions are build for
> native spark functions in scala. Of course doing full conversion is very
> hard but at least handling simple cases should be simple.
>
> 2.       The above would of course be limited if we use python packages
> but over time it is possible to add some “translation” tools (i.e. take
> python packages and find the appropriate scala equivalent. We can even
> provide this to the user to supply their own conversions thereby looking as
> a regular python code but being converted to scala code behind the scenes).
>
> 3.       In scala, it is possible to use codegen to actually generate
> code from a string. There is no reason why we can’t write the expression in
> python and provide a scala string. This would mean learning some scala but
> would mean we do not have to create a separate code tree.
>
>
>
> BTW, the fact that all of the tools to access java are marked as private
> has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are
> written in scala for performance. The wrapping to provide them in python
> uses way too many private elements for my taste.
>
>
>
>
>
> *From:* msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+
> <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19431&i=0>]
> *Sent:* Thursday, October 13, 2016 3:51 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Python Spark Improvements (forked from Spark Improvement
> Proposals)
>
>
>
> As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all
> of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
> Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.
>
>
> Being a Python shop, we were extremely pleased to learn about PySpark a
> few years ago as our main ETL pipeline used Apache Pig at the time. I was
> one of the only folks who understood Pig and Java so collaborating on this
> as a team was difficult.
>
> Spark provided a means for the entire team to collaborate, but we've hit
> our fair share of issues all of which are enumerated in this thread.
>
> Besides giving a +1 here, I think if I were to force rank these items for
> us, it'd be:
>
> 1. Configuration difficulties: we've lost literally weeks to
> troubleshooting memory issues for larger jobs. It took a long time to even
> understand *why* certain jobs were failing since Spark would just report
> executors being lost. Finally we tracked things down to understanding that
> spark.yarn.executor.memoryOverhead controls the portion of memory
> reserved for Python processes, but none of this is documented anywhere as
> far as I can tell. We discovered this via trial and error. Both
> documentation and better defaults for this setting when running a PySpark
> application are probably sufficient. We've also had a number of troubles
> with saving Parquet output as part of an ETL flow, but perhaps we'll save
> that for a blog post of its own.
>
> 2. Dependency management: I've tried to help move the conversation on
> https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a
> bit stalled. Installing the required dependencies for a PySpark application
> is a really messy ordeal right now.
>
> 3. Development workflow: I'd combine both "incomprehensible error
> messages" and "
> difficulty using PySpark from outside of spark-submit / pyspark shell"
> here. When teaching PySpark to new users, I'm reminded of how much inside
> knowledge is needed to overcome esoteric errors. As one example is hitting
> "PicklingError: Could not pickle object as excessively deep recursion
> required." errors. New users often do something innocent like try to pickle
> a global logging object and hit this and begin the Google -> stackoverflow
> search to try to comprehend what's going on. You can lose days to errors
> like these and they completely kill the productivity flow and send you
> hunting for alternatives.
>
> 4. Speed/performance: we are trying to use DataFrame/DataSets where we can
> and do as much in Java as possible but when we do move to Python, we're
> well aware that we're about to take a hit on performance. We're very keen
> to see what Apache Arrow does for things here.
>
> 5. API difficulties: I agree that when coming from Python, you'd expect
> that you can do the same kinds of operations on DataFrames in Spark that
> you can with Pandas, but I personally haven't been too bothered by this.
> Maybe I'm more used to this situation from using other frameworks that have
> similar concepts but incompatible implementations.
>
> We're big fans of PySpark and are happy to provide feedback and contribute
> wherever we can.
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-
> Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19431&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: Python Spark Improvements (forked from
> Spark Improvement Proposals)
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19431.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Holden Karau <ho...@pigscanfly.ca>.

Awesome, good points everyone. The ranking of the issues is super useful
and I'd also completely forgotten about the lack of built in UDAF support
which is rather important. There is a PR to make it easier to call/register
JVM UDFs from Python which will hopefully help a bit there too. I'm getting
on a flight to London for OSCON but I want to continueo encourage users to
chime in with their experiences (to that end I'm trying to re include user@
since it doesn't seem to have been posted there despite my initial attempt
to do so.)

On Thursday, October 13, 2016, assaf.mendelson <as...@rsa.com>
wrote:

> Hi,
>
> We are actually using pyspark heavily.
>
> I agree with all of your points,  for me I see the following as the main
> hurdles:
>
> 1.       Pyspark does not have support for UDAF. We have had multiple
> needs for UDAF and needed to go to java/scala to support these. Having
> python UDAF would have made life much easier (especially at earlier stages
> when we prototype).
>
> 2.       Performance. I cannot stress this enough. Currently we have
> engineers who take python UDFs and convert them to scala UDFs for
> performance. We are currently even looking at writing UDFs and UDAFs in a
> more native way (e.g. using expressions) to improve performance but working
> with pyspark can be really problematic.
>
>
>
> BTW, other than using jython or arrow, I believe there are a couple of
> other ways to get improve performance:
>
> 1.       Python provides tool to generate AST for python code (
> https://docs.python.org/2/library/ast.html). This means we can use the
> AST to construct scala code very similar to how expressions are build for
> native spark functions in scala. Of course doing full conversion is very
> hard but at least handling simple cases should be simple.
>
> 2.       The above would of course be limited if we use python packages
> but over time it is possible to add some “translation” tools (i.e. take
> python packages and find the appropriate scala equivalent. We can even
> provide this to the user to supply their own conversions thereby looking as
> a regular python code but being converted to scala code behind the scenes).
>
> 3.       In scala, it is possible to use codegen to actually generate
> code from a string. There is no reason why we can’t write the expression in
> python and provide a scala string. This would mean learning some scala but
> would mean we do not have to create a separate code tree.
>
>
>
> BTW, the fact that all of the tools to access java are marked as private
> has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are
> written in scala for performance. The wrapping to provide them in python
> uses way too many private elements for my taste.
>
>
>
>
>
> *From:* msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+
> <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19431&i=0>]
> *Sent:* Thursday, October 13, 2016 3:51 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Python Spark Improvements (forked from Spark Improvement
> Proposals)
>
>
>
> As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all
> of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
> Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.
>
>
> Being a Python shop, we were extremely pleased to learn about PySpark a
> few years ago as our main ETL pipeline used Apache Pig at the time. I was
> one of the only folks who understood Pig and Java so collaborating on this
> as a team was difficult.
>
> Spark provided a means for the entire team to collaborate, but we've hit
> our fair share of issues all of which are enumerated in this thread.
>
> Besides giving a +1 here, I think if I were to force rank these items for
> us, it'd be:
>
> 1. Configuration difficulties: we've lost literally weeks to
> troubleshooting memory issues for larger jobs. It took a long time to even
> understand *why* certain jobs were failing since Spark would just report
> executors being lost. Finally we tracked things down to understanding that
> spark.yarn.executor.memoryOverhead controls the portion of memory
> reserved for Python processes, but none of this is documented anywhere as
> far as I can tell. We discovered this via trial and error. Both
> documentation and better defaults for this setting when running a PySpark
> application are probably sufficient. We've also had a number of troubles
> with saving Parquet output as part of an ETL flow, but perhaps we'll save
> that for a blog post of its own.
>
> 2. Dependency management: I've tried to help move the conversation on
> https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a
> bit stalled. Installing the required dependencies for a PySpark application
> is a really messy ordeal right now.
>
> 3. Development workflow: I'd combine both "incomprehensible error
> messages" and "
> difficulty using PySpark from outside of spark-submit / pyspark shell"
> here. When teaching PySpark to new users, I'm reminded of how much inside
> knowledge is needed to overcome esoteric errors. As one example is hitting
> "PicklingError: Could not pickle object as excessively deep recursion
> required." errors. New users often do something innocent like try to pickle
> a global logging object and hit this and begin the Google -> stackoverflow
> search to try to comprehend what's going on. You can lose days to errors
> like these and they completely kill the productivity flow and send you
> hunting for alternatives.
>
> 4. Speed/performance: we are trying to use DataFrame/DataSets where we can
> and do as much in Java as possible but when we do move to Python, we're
> well aware that we're about to take a hit on performance. We're very keen
> to see what Apache Arrow does for things here.
>
> 5. API difficulties: I agree that when coming from Python, you'd expect
> that you can do the same kinds of operations on DataFrames in Spark that
> you can with Pandas, but I personally haven't been too bothered by this.
> Maybe I'm more used to this situation from using other frameworks that have
> similar concepts but incompatible implementations.
>
> We're big fans of PySpark and are happy to provide feedback and contribute
> wherever we can.
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-
> Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19431&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: Python Spark Improvements (forked from
> Spark Improvement Proposals)
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19431.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

RE: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by "assaf.mendelson" <as...@rsa.com>.

Hi,
We are actually using pyspark heavily.
I agree with all of your points,  for me I see the following as the main hurdles:

1.       Pyspark does not have support for UDAF. We have had multiple needs for UDAF and needed to go to java/scala to support these. Having python UDAF would have made life much easier (especially at earlier stages when we prototype).

2.       Performance. I cannot stress this enough. Currently we have engineers who take python UDFs and convert them to scala UDFs for performance. We are currently even looking at writing UDFs and UDAFs in a more native way (e.g. using expressions) to improve performance but working with pyspark can be really problematic.

BTW, other than using jython or arrow, I believe there are a couple of other ways to get improve performance:

1.       Python provides tool to generate AST for python code (https://docs.python.org/2/library/ast.html). This means we can use the AST to construct scala code very similar to how expressions are build for native spark functions in scala. Of course doing full conversion is very hard but at least handling simple cases should be simple.

2.       The above would of course be limited if we use python packages but over time it is possible to add some "translation" tools (i.e. take python packages and find the appropriate scala equivalent. We can even provide this to the user to supply their own conversions thereby looking as a regular python code but being converted to scala code behind the scenes).

3.       In scala, it is possible to use codegen to actually generate code from a string. There is no reason why we can't write the expression in python and provide a scala string. This would mean learning some scala but would mean we do not have to create a separate code tree.

BTW, the fact that all of the tools to access java are marked as private has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are written in scala for performance. The wrapping to provide them in python uses way too many private elements for my taste.


From: msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+s1001551n19426h95@n3.nabble.com]
Sent: Thursday, October 13, 2016 3:51 AM
To: Mendelson, Assaf
Subject: Re: Python Spark Improvements (forked from Spark Improvement Proposals)

As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.

Being a Python shop, we were extremely pleased to learn about PySpark a few years ago as our main ETL pipeline used Apache Pig at the time. I was one of the only folks who understood Pig and Java so collaborating on this as a team was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for us, it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting memory issues for larger jobs. It took a long time to even understand *why* certain jobs were failing since Spark would just report executors being lost. Finally we tracked things down to understanding that spark.yarn.executor.memoryOverhead controls the portion of memory reserved for Python processes, but none of this is documented anywhere as far as I can tell. We discovered this via trial and error. Both documentation and better defaults for this setting when running a PySpark application are probably sufficient. We've also had a number of troubles with saving Parquet output as part of an ETL flow, but perhaps we'll save that for a blog post of its own.

2. Dependency management: I've tried to help move the conversation on https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a bit stalled. Installing the required dependencies for a PySpark application is a really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages" and "
difficulty using PySpark from outside of spark-submit / pyspark shell" here. When teaching PySpark to new users, I'm reminded of how much inside knowledge is needed to overcome esoteric errors. As one example is hitting "PicklingError: Could not pickle object as excessively deep recursion required." errors. New users often do something innocent like try to pickle a global logging object and hit this and begin the Google -> stackoverflow search to try to comprehend what's going on. You can lose days to errors like these and they completely kill the productivity flow and send you hunting for alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can and do as much in Java as possible but when we do move to Python, we're well aware that we're about to take a hit on performance. We're very keen to see what Apache Arrow does for things here.

5. API difficulties: I agree that when coming from Python, you'd expect that you can do the same kinds of operations on DataFrames in Spark that you can with Pandas, but I personally haven't been too bothered by this. Maybe I'm more used to this situation from using other frameworks that have similar concepts but incompatible implementations.

We're big fans of PySpark and are happy to provide feedback and contribute wherever we can.
________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19431.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by msukmanowsky <mi...@gmail.com>.

As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of
the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.

Being a Python shop, we were extremely pleased to learn about PySpark a few
years ago as our main ETL pipeline used Apache Pig at the time. I was one of
the only folks who understood Pig and Java so collaborating on this as a
team was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our
fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for
us, it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting
memory issues for larger jobs. It took a long time to even understand *why*
certain jobs were failing since Spark would just report executors being
lost. Finally we tracked things down to understanding that
spark.yarn.executor.memoryOverhead controls the portion of memory reserved
for Python processes, but none of this is documented anywhere as far as I
can tell. We discovered this via trial and error. Both documentation and
better defaults for this setting when running a PySpark application are
probably sufficient. We've also had a number of troubles with saving Parquet
output as part of an ETL flow, but perhaps we'll save that for a blog post
of its own.

2. Dependency management: I've tried to help move the conversation on
https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a bit
stalled. Installing the required dependencies for a PySpark application is a
really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages"
and "
difficulty using PySpark from outside of spark-submit / pyspark shell" here.
When teaching PySpark to new users, I'm reminded of how much inside
knowledge is needed to overcome esoteric errors. As one example is hitting
"PicklingError: Could not pickle object as excessively deep recursion
required." errors. New users often do something innocent like try to pickle
a global logging object and hit this and begin the Google -> stackoverflow
search to try to comprehend what's going on. You can lose days to errors
like these and they completely kill the productivity flow and send you
hunting for alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can
and do as much in Java as possible but when we do move to Python, we're well
aware that we're about to take a hit on performance. We're very keen to see
what Apache Arrow does for things here.

5. API difficulties: I agree that when coming from Python, you'd expect that
you can do the same kinds of operations on DataFrames in Spark that you can
with Pandas, but I personally haven't been too bothered by this. Maybe I'm
more used to this situation from using other frameworks that have similar
concepts but incompatible implementations.

We're big fans of PySpark and are happy to provide feedback and contribute
wherever we can.

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Holden Karau <ho...@pigscanfly.ca>.

On that note there is some discussion on the Jira -
https://issues.apache.org/jira/browse/SPARK-13534 :)

On Mon, Oct 31, 2016 at 8:32 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> I believe Bryan is also working on this a little - and I'm a little busy
> with the other stuff but would love to stay in the loop on Arrow progress :)
>
>
> On Monday, October 31, 2016, mariusvniekerk <ma...@gmail.com>
> wrote:
>
>> So i've been working on some very very early stage apache arrow
>> integration.
>> My current plan it to emulate some of how the R function execution works.
>> If there are any other people working on similar efforts it would be good
>> idea to combine efforts.
>>
>> I can see how much effort is involved in converting that PR to a spark
>> package so that people can try to use it.  I think this is something that
>> we
>> want some more community iteration on maybe?
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/Python-Spark-Improvements-
>> forked-from-Spark-Improvement-Proposals-tp19422p19670.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Holden Karau <ho...@pigscanfly.ca>.

I believe Bryan is also working on this a little - and I'm a little busy
with the other stuff but would love to stay in the loop on Arrow progress :)

On Monday, October 31, 2016, mariusvniekerk <ma...@gmail.com>
wrote:

> So i've been working on some very very early stage apache arrow
> integration.
> My current plan it to emulate some of how the R function execution works.
> If there are any other people working on similar efforts it would be good
> idea to combine efforts.
>
> I can see how much effort is involved in converting that PR to a spark
> package so that people can try to use it.  I think this is something that
> we
> want some more community iteration on maybe?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Python-Spark-
> Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19670.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <javascript:;>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by mariusvniekerk <ma...@gmail.com>.

So i've been working on some very very early stage apache arrow integration. 
My current plan it to emulate some of how the R function execution works.  
If there are any other people working on similar efforts it would be good
idea to combine efforts.

I can see how much effort is involved in converting that PR to a spark
package so that people can try to use it.  I think this is something that we
want some more community iteration on maybe?





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19670.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Holden Karau <ho...@pigscanfly.ca>.

I've been working on some of the issues and I have an PR to help provide
pip installable PySpark - https://github.com/apache/spark/pull/15659 .

Also if anyone is interested in PySpark UDF performance there is a closed
PR that if we could maybe figure out a way to make it a little more light
weight I'd love to hear people's ideas (
https://github.com/apache/spark/pull/13571 ).

On Sun, Oct 16, 2016 at 8:52 PM, Tobi Bosede <an...@gmail.com> wrote:

> Right Jeff. Holden actually brought this thread up today on the user list
> in response to my email about UDAFs here https://www.mail-archive.
> com/user@spark.apache.org/msg58125.html. I have yet to hear what I can do
> as a work around besides switching to scala/java though...
>
> On Sun, Oct 16, 2016 at 10:07 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> Thanks Holden to start this thread. I agree that spark devs should put
>> more efforts on pyspark but the reality is that we are a little slow on
>> this perspective . Since pyspark is to integrate spark into python, so I
>> think the focus is on the usability of pyspark. We should hear more
>> feedback from python community. And lots of data scientists love python, if
>> we want to more adoption of spark, then we should spend more time on
>> pyspark.
>>
>> On Thu, Oct 13, 2016 at 9:59 AM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>> I'd add one item to this list: The lack of Python 3 support in Spark
>> Packages <https://github.com/databricks/sbt-spark-package/issues/26>.
>> This means that great packages like GraphFrames cannot be used with
>> Python 3 <https://github.com/graphframes/graphframes/issues/85>.
>>
>> This is quite disappointing since Spark itself supports Python 3 and
>> since -- at least in my circles -- Python 3 adoption is reaching a tipping
>> point. All new Python projects at my company and at my friends' companies
>> are being written in Python 3.
>>
>> Nick
>>
>>
>> On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>> Hi Spark Devs & Users,
>>
>>
>> Forking off from Cody’s original thread
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
>> of Spark Improvements, and Matei's follow up on asking what issues the
>> Python community was facing with Spark, I think it would be useful for us
>> to discuss some of the motivations behind some of the Python community
>> looking at different technologies to replace Apache Spark with. My
>> viewpoints are based that of a developer who works on Apache Spark
>> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
>> at Python conferences
>> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
>> and I feel many (but not all) of the same challenges as the Python
>> community does trying to use Spark. I’ve included both the user@ and dev@
>> lists on this one since I think the user community can probably provide
>> more reasons why they have difficulty with PySpark. I should also point out
>> - the solution for all of these things may not live inside of the Spark
>> project itself, but it still impacts our usability as a whole.
>>
>>
>>    -
>>
>>    Lack of pip installability
>>
>> This is one of the points that Matei mentioned, and it something several
>> people have tried to provide for Spark in one way or another. It seems
>> getting reviewer time for this issue is rather challenging, and I’ve been
>> hesitant to ask the contributors to keep updating their PRs (as much as I
>> want to see some progress) because I just don't know if we have the time or
>> interest in this. I’m happy to pick up the latest from Juliet and try and
>> carry it over the finish line if we can find some committer time to work on
>> this since it now sounds like there is consensus we should do this.
>>
>>    -
>>
>>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>>
>> The FindSpark <https://pypi.python.org/pypi/findspark> package needing
>> to exist is one of the clearest examples of this challenge. There is also a
>> PR to make it easier for other shells to extend the Spark shell, and we ran
>> into some similar challenges while working on Sparkling Pandas. This could
>> be solved by making Spark pip installable so I won't’ say too much about
>> this point.
>>
>>    -
>>
>>    Minimal integration with IPython/IJupyter
>>
>> This one is awkward since one of the areas that some of the larger
>> commercial players work in effectively “competes” (in a very loose term)
>> with any features introduced around here. I’m not really super sure what
>> the best path forward is here, but I think collaborating with the IJupyter
>> people to enable more features found in the commercial offerings in open
>> source could be beneficial to everyone in the community, and maybe even
>> reduce the maintenance cost for some of the commercial entities. I
>> understand this is a tricky issue but having good progress indicators or
>> something similar could make a huge difference. (Note that Apache Toree
>> <https://toree.incubator.apache.org/> [Incubating] exists for Scala
>> users but hopefully the PySpark IJupyter integration could be achieved
>> without a new kernel).
>>
>>    -
>>
>>    Lack of virtualenv and or Python package distribution support
>>
>> This one is also tricky since many commercial providers have their own
>> “solution” to this, but there isn’t a good story around supporting custom
>> virtual envs or user required Python packages. While spark-packages _can_
>> be Python this requires that the Python package developer go through rather
>> a lot of work to make their package available and realistically won’t
>> happen for most Python packages people want to use. And to be fair, the
>> addFiles mechanism does support Python eggs which works for some packages.
>> There are some outstanding PRs around this issue (and I understand these
>> are perhaps large issues which might require large changes to the current
>> suggested implementations - I’ve had difficulty keeping the current set of
>> open PRs around this straight in my own head) but there seems to be no
>> committer bandwidth or interest on working with the contributors who have
>> suggested these things. Is this an intentional decision or is this
>> something we as a community are willing to work on/tackle?
>>
>>    -
>>
>>    Speed/performance
>>
>> This is often a complaint I hear from more “data engineering” profile
>> users who are working in Python. These problems come mostly in places
>> involving the interaction of Python and the JVM (so UDFs, transformations
>> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
>> working on (see https://github.com/apache/spark/pull/13571 ) and
>> hopefully we can start investigating Apache Arrow
>> <https://arrow.apache.org/> to speed up the bridge (or something
>> similar) once it’s a bit more ready (currently Arrow just released 0.1
>> which is exciting). We also probably need to start measuring these things
>> more closely since otherwise random regressions will continue to be
>> introduced (like the challenge with unbalanced partitions and block
>> serialization together - see SPARK-17817
>> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>>
>>    -
>>
>>    Configuration difficulties (especially related to OOMs)
>>
>> This is a general challenge many people face working in Spark, but
>> PySpark users are also asked to somehow figure out what the correct amount
>> of memory is to give to the Python process versus the Scala/JVM processes.
>> This was maybe an acceptable solution at the start, but when combined with
>> the difficult to understand error messages it can become quite the time
>> sink. A quick work around would be picking a different default overhead for
>> applications using Python, but more generally hopefully some shared off-JVM
>> heap solution could also help reduce this challenge in the future.
>>
>>    -
>>
>>    API difficulties
>>
>> The Spark API doesn’t “feel” very Pythony is a complaint some people
>> have, but I think we’ve done some excellent work in the DataFrame/Dataset
>> API here. At the same time we’ve made some really frustrating choices with
>> the DataFrame API (e.g. removing map from DataFrames pre-emptively even
>> when we have no concrete plans to bring the Dataset API to PySpark).
>>
>> A lot of users wish that our DataFrame API was more like the Pandas API
>> (and Wes has pointed out on some JIRAs where we have differences) as well
>> as covered more of the functionality of Pandas. This is a hard problem, and
>> it the solution might not belong inside of PySpark itself (Juliet and I did
>> some proof-of-concept work back in the day on Sparkling Pandas
>> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
>> my personal goals has been trying to become a committer I’ve been more
>> focused on contributing to Spark itself rather than libraries and very few
>> people seem to be interested in working on this project [although I still
>> have potential users ask if they can use it]. (Of course if there is
>> sufficient interest to reboot Sparkling Pandas or something similar that
>> would be an interesting area of work - but it’s also a huge area of work -
>> if you look at Dask <http://dask.pydata.org/>, a good portion of the
>> work is dedicated just to supporting pandas like operations).
>>
>>    -
>>
>>    Incomprehensible error messages
>>
>> I often have people ask me how to debug PySpark and they often have a
>> certain haunted look in their eyes while they ask me this (slightly
>> joking). More seriously, we really need to provide more guidance around how
>> to understand PySpark error messages and look at figuring out if there are
>> places where we can improve the messaging so users aren’t hunting through
>> stack overflow trying to figure out where the Java exception they are
>> getting is related to their Python code. In one talk I gave recently
>> someone mentioned PySpark was the motivation behind finding the hide error
>> messages plugin/settings for IJupyter.
>>
>>    -
>>
>>    Lack of useful ML model & pipeline export/import
>>
>> This is something we’ve made great progress on, many of the PySpark
>> models are now able to use the underlying export mechanisms from Java.
>> However I often hear challenges with using these models in the rest of the
>> Python space once they have been exported from Spark. I’ve got a PR to add
>> basic PMML export in Scala to ML (which we can then bring to Python), but I
>> think the Python community is open to other formats if the Spark community
>> doesn’t want to go the PMML route.
>>
>> Now I don’t think we will see the same challenges we’ve seen develop in
>> the R community, but I suspect purely Python approaches to distributed
>> systems will continue to eat the “low end” of Spark (e.g. medium sized data
>> problems requiring parallelism). This isn’t necessarily a bad thing, but if
>> there is anything I’ve learnt it that's the "low end" solution often
>> quickly eats the "high end" within a few years - and I’d rather see Spark
>> continue to thrive outside of the pure JVM space.
>>
>> These are just the biggest issues that I hear come up commonly and
>> remembered on my flight back - it’s quite possible I’ve missed important
>> things. I know contributing on a mailing list can be scary or intimidating
>> for new users (and even experienced developers may wish to stay out of
>> discussions they view as heated) - but I strongly encourage everyone to
>> participate (respectfully) in this thread and we can all work together to
>> help Spark continue to be the place where people from different languages
>> and backgrounds continue to come together to collaborate.
>>
>>
>> I want to be clear as well, while I feel these are all very important
>> issues (and being someone who has worked on PySpark & Spark for years
>> <http://bit.ly/hkspmg> without being a committer I may sometimes come
>> off as frustrated when I talk about these) I think PySpark as a whole is a
>> really excellent application and we do some really awesome stuff with it.
>> There are also things that I will be blind to as a result of having worked
>> on Spark for so long (for example yesterday I caught myself using the _
>> syntax in a Scala example without explaining it because it seems “normal”
>> to me but often trips up new comers.) If we can address even some of these
>> issues I believe it will be a huge win for Spark adoption outside of the
>> traditional JVM space (and as the various community surveys continue to
>> indicate PySpark usage is already quite high).
>>
>> Normally I’d bring in a box of timbits
>> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we
>> were having an in-person meeting for this but all I can do for the mailing
>> list is attach a cute picture and offer future doughnuts/coffee if people
>> want to chat IRL. So, in closing, I’ve included a picture of two of my
>> stuffed animals working on Spark on my flight back from a Python Data
>> conference & Spark meetup just to remind everyone that this is just a
>> software project and we can be friendly nice people if we try and things
>> will be much more awesome if we do :)
>> [image: image.png]
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Holden Karau <ho...@pigscanfly.ca>.

I've been working on some of the issues and I have an PR to help provide
pip installable PySpark - https://github.com/apache/spark/pull/15659 .

Also if anyone is interested in PySpark UDF performance there is a closed
PR that if we could maybe figure out a way to make it a little more light
weight I'd love to hear people's ideas (
https://github.com/apache/spark/pull/13571 ).

On Sun, Oct 16, 2016 at 8:52 PM, Tobi Bosede <an...@gmail.com> wrote:

> Right Jeff. Holden actually brought this thread up today on the user list
> in response to my email about UDAFs here https://www.mail-archive.
> com/user@spark.apache.org/msg58125.html. I have yet to hear what I can do
> as a work around besides switching to scala/java though...
>
> On Sun, Oct 16, 2016 at 10:07 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> Thanks Holden to start this thread. I agree that spark devs should put
>> more efforts on pyspark but the reality is that we are a little slow on
>> this perspective . Since pyspark is to integrate spark into python, so I
>> think the focus is on the usability of pyspark. We should hear more
>> feedback from python community. And lots of data scientists love python, if
>> we want to more adoption of spark, then we should spend more time on
>> pyspark.
>>
>> On Thu, Oct 13, 2016 at 9:59 AM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>> I'd add one item to this list: The lack of Python 3 support in Spark
>> Packages <https://github.com/databricks/sbt-spark-package/issues/26>.
>> This means that great packages like GraphFrames cannot be used with
>> Python 3 <https://github.com/graphframes/graphframes/issues/85>.
>>
>> This is quite disappointing since Spark itself supports Python 3 and
>> since -- at least in my circles -- Python 3 adoption is reaching a tipping
>> point. All new Python projects at my company and at my friends' companies
>> are being written in Python 3.
>>
>> Nick
>>
>>
>> On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>> Hi Spark Devs & Users,
>>
>>
>> Forking off from Cody’s original thread
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
>> of Spark Improvements, and Matei's follow up on asking what issues the
>> Python community was facing with Spark, I think it would be useful for us
>> to discuss some of the motivations behind some of the Python community
>> looking at different technologies to replace Apache Spark with. My
>> viewpoints are based that of a developer who works on Apache Spark
>> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
>> at Python conferences
>> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
>> and I feel many (but not all) of the same challenges as the Python
>> community does trying to use Spark. I’ve included both the user@ and dev@
>> lists on this one since I think the user community can probably provide
>> more reasons why they have difficulty with PySpark. I should also point out
>> - the solution for all of these things may not live inside of the Spark
>> project itself, but it still impacts our usability as a whole.
>>
>>
>>    -
>>
>>    Lack of pip installability
>>
>> This is one of the points that Matei mentioned, and it something several
>> people have tried to provide for Spark in one way or another. It seems
>> getting reviewer time for this issue is rather challenging, and I’ve been
>> hesitant to ask the contributors to keep updating their PRs (as much as I
>> want to see some progress) because I just don't know if we have the time or
>> interest in this. I’m happy to pick up the latest from Juliet and try and
>> carry it over the finish line if we can find some committer time to work on
>> this since it now sounds like there is consensus we should do this.
>>
>>    -
>>
>>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>>
>> The FindSpark <https://pypi.python.org/pypi/findspark> package needing
>> to exist is one of the clearest examples of this challenge. There is also a
>> PR to make it easier for other shells to extend the Spark shell, and we ran
>> into some similar challenges while working on Sparkling Pandas. This could
>> be solved by making Spark pip installable so I won't’ say too much about
>> this point.
>>
>>    -
>>
>>    Minimal integration with IPython/IJupyter
>>
>> This one is awkward since one of the areas that some of the larger
>> commercial players work in effectively “competes” (in a very loose term)
>> with any features introduced around here. I’m not really super sure what
>> the best path forward is here, but I think collaborating with the IJupyter
>> people to enable more features found in the commercial offerings in open
>> source could be beneficial to everyone in the community, and maybe even
>> reduce the maintenance cost for some of the commercial entities. I
>> understand this is a tricky issue but having good progress indicators or
>> something similar could make a huge difference. (Note that Apache Toree
>> <https://toree.incubator.apache.org/> [Incubating] exists for Scala
>> users but hopefully the PySpark IJupyter integration could be achieved
>> without a new kernel).
>>
>>    -
>>
>>    Lack of virtualenv and or Python package distribution support
>>
>> This one is also tricky since many commercial providers have their own
>> “solution” to this, but there isn’t a good story around supporting custom
>> virtual envs or user required Python packages. While spark-packages _can_
>> be Python this requires that the Python package developer go through rather
>> a lot of work to make their package available and realistically won’t
>> happen for most Python packages people want to use. And to be fair, the
>> addFiles mechanism does support Python eggs which works for some packages.
>> There are some outstanding PRs around this issue (and I understand these
>> are perhaps large issues which might require large changes to the current
>> suggested implementations - I’ve had difficulty keeping the current set of
>> open PRs around this straight in my own head) but there seems to be no
>> committer bandwidth or interest on working with the contributors who have
>> suggested these things. Is this an intentional decision or is this
>> something we as a community are willing to work on/tackle?
>>
>>    -
>>
>>    Speed/performance
>>
>> This is often a complaint I hear from more “data engineering” profile
>> users who are working in Python. These problems come mostly in places
>> involving the interaction of Python and the JVM (so UDFs, transformations
>> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
>> working on (see https://github.com/apache/spark/pull/13571 ) and
>> hopefully we can start investigating Apache Arrow
>> <https://arrow.apache.org/> to speed up the bridge (or something
>> similar) once it’s a bit more ready (currently Arrow just released 0.1
>> which is exciting). We also probably need to start measuring these things
>> more closely since otherwise random regressions will continue to be
>> introduced (like the challenge with unbalanced partitions and block
>> serialization together - see SPARK-17817
>> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>>
>>    -
>>
>>    Configuration difficulties (especially related to OOMs)
>>
>> This is a general challenge many people face working in Spark, but
>> PySpark users are also asked to somehow figure out what the correct amount
>> of memory is to give to the Python process versus the Scala/JVM processes.
>> This was maybe an acceptable solution at the start, but when combined with
>> the difficult to understand error messages it can become quite the time
>> sink. A quick work around would be picking a different default overhead for
>> applications using Python, but more generally hopefully some shared off-JVM
>> heap solution could also help reduce this challenge in the future.
>>
>>    -
>>
>>    API difficulties
>>
>> The Spark API doesn’t “feel” very Pythony is a complaint some people
>> have, but I think we’ve done some excellent work in the DataFrame/Dataset
>> API here. At the same time we’ve made some really frustrating choices with
>> the DataFrame API (e.g. removing map from DataFrames pre-emptively even
>> when we have no concrete plans to bring the Dataset API to PySpark).
>>
>> A lot of users wish that our DataFrame API was more like the Pandas API
>> (and Wes has pointed out on some JIRAs where we have differences) as well
>> as covered more of the functionality of Pandas. This is a hard problem, and
>> it the solution might not belong inside of PySpark itself (Juliet and I did
>> some proof-of-concept work back in the day on Sparkling Pandas
>> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
>> my personal goals has been trying to become a committer I’ve been more
>> focused on contributing to Spark itself rather than libraries and very few
>> people seem to be interested in working on this project [although I still
>> have potential users ask if they can use it]. (Of course if there is
>> sufficient interest to reboot Sparkling Pandas or something similar that
>> would be an interesting area of work - but it’s also a huge area of work -
>> if you look at Dask <http://dask.pydata.org/>, a good portion of the
>> work is dedicated just to supporting pandas like operations).
>>
>>    -
>>
>>    Incomprehensible error messages
>>
>> I often have people ask me how to debug PySpark and they often have a
>> certain haunted look in their eyes while they ask me this (slightly
>> joking). More seriously, we really need to provide more guidance around how
>> to understand PySpark error messages and look at figuring out if there are
>> places where we can improve the messaging so users aren’t hunting through
>> stack overflow trying to figure out where the Java exception they are
>> getting is related to their Python code. In one talk I gave recently
>> someone mentioned PySpark was the motivation behind finding the hide error
>> messages plugin/settings for IJupyter.
>>
>>    -
>>
>>    Lack of useful ML model & pipeline export/import
>>
>> This is something we’ve made great progress on, many of the PySpark
>> models are now able to use the underlying export mechanisms from Java.
>> However I often hear challenges with using these models in the rest of the
>> Python space once they have been exported from Spark. I’ve got a PR to add
>> basic PMML export in Scala to ML (which we can then bring to Python), but I
>> think the Python community is open to other formats if the Spark community
>> doesn’t want to go the PMML route.
>>
>> Now I don’t think we will see the same challenges we’ve seen develop in
>> the R community, but I suspect purely Python approaches to distributed
>> systems will continue to eat the “low end” of Spark (e.g. medium sized data
>> problems requiring parallelism). This isn’t necessarily a bad thing, but if
>> there is anything I’ve learnt it that's the "low end" solution often
>> quickly eats the "high end" within a few years - and I’d rather see Spark
>> continue to thrive outside of the pure JVM space.
>>
>> These are just the biggest issues that I hear come up commonly and
>> remembered on my flight back - it’s quite possible I’ve missed important
>> things. I know contributing on a mailing list can be scary or intimidating
>> for new users (and even experienced developers may wish to stay out of
>> discussions they view as heated) - but I strongly encourage everyone to
>> participate (respectfully) in this thread and we can all work together to
>> help Spark continue to be the place where people from different languages
>> and backgrounds continue to come together to collaborate.
>>
>>
>> I want to be clear as well, while I feel these are all very important
>> issues (and being someone who has worked on PySpark & Spark for years
>> <http://bit.ly/hkspmg> without being a committer I may sometimes come
>> off as frustrated when I talk about these) I think PySpark as a whole is a
>> really excellent application and we do some really awesome stuff with it.
>> There are also things that I will be blind to as a result of having worked
>> on Spark for so long (for example yesterday I caught myself using the _
>> syntax in a Scala example without explaining it because it seems “normal”
>> to me but often trips up new comers.) If we can address even some of these
>> issues I believe it will be a huge win for Spark adoption outside of the
>> traditional JVM space (and as the various community surveys continue to
>> indicate PySpark usage is already quite high).
>>
>> Normally I’d bring in a box of timbits
>> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we
>> were having an in-person meeting for this but all I can do for the mailing
>> list is attach a cute picture and offer future doughnuts/coffee if people
>> want to chat IRL. So, in closing, I’ve included a picture of two of my
>> stuffed animals working on Spark on my flight back from a Python Data
>> conference & Spark meetup just to remind everyone that this is just a
>> software project and we can be friendly nice people if we try and things
>> will be much more awesome if we do :)
>> [image: image.png]
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Tobi Bosede <an...@gmail.com>.

Right Jeff. Holden actually brought this thread up today on the user list
in response to my email about UDAFs here
https://www.mail-archive.com/user@spark.apache.org/msg58125.html. I have
yet to hear what I can do as a work around besides switching to scala/java
though...

On Sun, Oct 16, 2016 at 10:07 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Thanks Holden to start this thread. I agree that spark devs should put
> more efforts on pyspark but the reality is that we are a little slow on
> this perspective . Since pyspark is to integrate spark into python, so I
> think the focus is on the usability of pyspark. We should hear more
> feedback from python community. And lots of data scientists love python, if
> we want to more adoption of spark, then we should spend more time on
> pyspark.
>
> On Thu, Oct 13, 2016 at 9:59 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
> I'd add one item to this list: The lack of Python 3 support in Spark
> Packages <https://github.com/databricks/sbt-spark-package/issues/26>.
> This means that great packages like GraphFrames cannot be used with
> Python 3 <https://github.com/graphframes/graphframes/issues/85>.
>
> This is quite disappointing since Spark itself supports Python 3 and since
> -- at least in my circles -- Python 3 adoption is reaching a tipping point.
> All new Python projects at my company and at my friends' companies are
> being written in Python 3.
>
> Nick
>
>
> On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>
> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: image.png]
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Jeff Zhang <zj...@gmail.com>.

Thanks Holden to start this thread. I agree that spark devs should put more
efforts on pyspark but the reality is that we are a little slow on this
perspective . Since pyspark is to integrate spark into python, so I think
the focus is on the usability of pyspark. We should hear more feedback from
python community. And lots of data scientists love python, if we want to
more adoption of spark, then we should spend more time on pyspark.

On Thu, Oct 13, 2016 at 9:59 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

I'd add one item to this list: The lack of Python 3 support in Spark
Packages <https://github.com/databricks/sbt-spark-package/issues/26>. This
means that great packages like GraphFrames cannot be used with Python 3
<https://github.com/graphframes/graphframes/issues/85>.

This is quite disappointing since Spark itself supports Python 3 and since
-- at least in my circles -- Python 3 adoption is reaching a tipping point.
All new Python projects at my company and at my friends' companies are
being written in Python 3.

Nick


On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca> wrote:

Hi Spark Devs & Users,


Forking off from Cody’s original thread
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
of Spark Improvements, and Matei's follow up on asking what issues the
Python community was facing with Spark, I think it would be useful for us
to discuss some of the motivations behind some of the Python community
looking at different technologies to replace Apache Spark with. My
viewpoints are based that of a developer who works on Apache Spark
day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks at
Python conferences
<https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
and I feel many (but not all) of the same challenges as the Python
community does trying to use Spark. I’ve included both the user@ and dev@
lists on this one since I think the user community can probably provide
more reasons why they have difficulty with PySpark. I should also point out
- the solution for all of these things may not live inside of the Spark
project itself, but it still impacts our usability as a whole.


   -

   Lack of pip installability

This is one of the points that Matei mentioned, and it something several
people have tried to provide for Spark in one way or another. It seems
getting reviewer time for this issue is rather challenging, and I’ve been
hesitant to ask the contributors to keep updating their PRs (as much as I
want to see some progress) because I just don't know if we have the time or
interest in this. I’m happy to pick up the latest from Juliet and try and
carry it over the finish line if we can find some committer time to work on
this since it now sounds like there is consensus we should do this.

   -

   Difficulty using PySpark from outside of spark-submit / pyspark shell

The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
exist is one of the clearest examples of this challenge. There is also a PR
to make it easier for other shells to extend the Spark shell, and we ran
into some similar challenges while working on Sparkling Pandas. This could
be solved by making Spark pip installable so I won't’ say too much about
this point.

   -

   Minimal integration with IPython/IJupyter

This one is awkward since one of the areas that some of the larger
commercial players work in effectively “competes” (in a very loose term)
with any features introduced around here. I’m not really super sure what
the best path forward is here, but I think collaborating with the IJupyter
people to enable more features found in the commercial offerings in open
source could be beneficial to everyone in the community, and maybe even
reduce the maintenance cost for some of the commercial entities. I
understand this is a tricky issue but having good progress indicators or
something similar could make a huge difference. (Note that Apache Toree
<https://toree.incubator.apache.org/> [Incubating] exists for Scala users
but hopefully the PySpark IJupyter integration could be achieved without a
new kernel).

   -

   Lack of virtualenv and or Python package distribution support

This one is also tricky since many commercial providers have their own
“solution” to this, but there isn’t a good story around supporting custom
virtual envs or user required Python packages. While spark-packages _can_
be Python this requires that the Python package developer go through rather
a lot of work to make their package available and realistically won’t
happen for most Python packages people want to use. And to be fair, the
addFiles mechanism does support Python eggs which works for some packages.
There are some outstanding PRs around this issue (and I understand these
are perhaps large issues which might require large changes to the current
suggested implementations - I’ve had difficulty keeping the current set of
open PRs around this straight in my own head) but there seems to be no
committer bandwidth or interest on working with the contributors who have
suggested these things. Is this an intentional decision or is this
something we as a community are willing to work on/tackle?

   -

   Speed/performance

This is often a complaint I hear from more “data engineering” profile users
who are working in Python. These problems come mostly in places involving
the interaction of Python and the JVM (so UDFs, transformations with
arbitrary lambdas, collect() and toPandas()). This is an area I’m working
on (see https://github.com/apache/spark/pull/13571 ) and hopefully we can
start investigating Apache Arrow <https://arrow.apache.org/> to speed up
the bridge (or something similar) once it’s a bit more ready (currently
Arrow just released 0.1 which is exciting). We also probably need to start
measuring these things more closely since otherwise random regressions will
continue to be introduced (like the challenge with unbalanced partitions
and block serialization together - see SPARK-17817
<https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)

   -

   Configuration difficulties (especially related to OOMs)

This is a general challenge many people face working in Spark, but PySpark
users are also asked to somehow figure out what the correct amount of
memory is to give to the Python process versus the Scala/JVM processes.
This was maybe an acceptable solution at the start, but when combined with
the difficult to understand error messages it can become quite the time
sink. A quick work around would be picking a different default overhead for
applications using Python, but more generally hopefully some shared off-JVM
heap solution could also help reduce this challenge in the future.

   -

   API difficulties

The Spark API doesn’t “feel” very Pythony is a complaint some people have,
but I think we’ve done some excellent work in the DataFrame/Dataset API
here. At the same time we’ve made some really frustrating choices with the
DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
have no concrete plans to bring the Dataset API to PySpark).

A lot of users wish that our DataFrame API was more like the Pandas API
(and Wes has pointed out on some JIRAs where we have differences) as well
as covered more of the functionality of Pandas. This is a hard problem, and
it the solution might not belong inside of PySpark itself (Juliet and I did
some proof-of-concept work back in the day on Sparkling Pandas
<https://github.com/sparklingpandas/sparklingpandas>) - but since one of my
personal goals has been trying to become a committer I’ve been more focused
on contributing to Spark itself rather than libraries and very few people
seem to be interested in working on this project [although I still have
potential users ask if they can use it]. (Of course if there is sufficient
interest to reboot Sparkling Pandas or something similar that would be an
interesting area of work - but it’s also a huge area of work - if you look
at Dask <http://dask.pydata.org/>, a good portion of the work is dedicated
just to supporting pandas like operations).

   -

   Incomprehensible error messages

I often have people ask me how to debug PySpark and they often have a
certain haunted look in their eyes while they ask me this (slightly
joking). More seriously, we really need to provide more guidance around how
to understand PySpark error messages and look at figuring out if there are
places where we can improve the messaging so users aren’t hunting through
stack overflow trying to figure out where the Java exception they are
getting is related to their Python code. In one talk I gave recently
someone mentioned PySpark was the motivation behind finding the hide error
messages plugin/settings for IJupyter.

   -

   Lack of useful ML model & pipeline export/import

This is something we’ve made great progress on, many of the PySpark models
are now able to use the underlying export mechanisms from Java. However I
often hear challenges with using these models in the rest of the Python
space once they have been exported from Spark. I’ve got a PR to add basic
PMML export in Scala to ML (which we can then bring to Python), but I think
the Python community is open to other formats if the Spark community
doesn’t want to go the PMML route.

Now I don’t think we will see the same challenges we’ve seen develop in the
R community, but I suspect purely Python approaches to distributed systems
will continue to eat the “low end” of Spark (e.g. medium sized data
problems requiring parallelism). This isn’t necessarily a bad thing, but if
there is anything I’ve learnt it that's the "low end" solution often
quickly eats the "high end" within a few years - and I’d rather see Spark
continue to thrive outside of the pure JVM space.

These are just the biggest issues that I hear come up commonly and
remembered on my flight back - it’s quite possible I’ve missed important
things. I know contributing on a mailing list can be scary or intimidating
for new users (and even experienced developers may wish to stay out of
discussions they view as heated) - but I strongly encourage everyone to
participate (respectfully) in this thread and we can all work together to
help Spark continue to be the place where people from different languages
and backgrounds continue to come together to collaborate.


I want to be clear as well, while I feel these are all very important
issues (and being someone who has worked on PySpark & Spark for years
<http://bit.ly/hkspmg> without being a committer I may sometimes come off
as frustrated when I talk about these) I think PySpark as a whole is a
really excellent application and we do some really awesome stuff with it.
There are also things that I will be blind to as a result of having worked
on Spark for so long (for example yesterday I caught myself using the _
syntax in a Scala example without explaining it because it seems “normal”
to me but often trips up new comers.) If we can address even some of these
issues I believe it will be a huge win for Spark adoption outside of the
traditional JVM space (and as the various community surveys continue to
indicate PySpark usage is already quite high).

Normally I’d bring in a box of timbits
<https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
having an in-person meeting for this but all I can do for the mailing list
is attach a cute picture and offer future doughnuts/coffee if people want
to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
animals working on Spark on my flight back from a Python Data conference &
Spark meetup just to remind everyone that this is just a software project
and we can be friendly nice people if we try and things will be much more
awesome if we do :)
[image: image.png]

-- 
Cell : 425-233-8271 <(425)%20233-8271>
Twitter: https://twitter.com/holdenkarau




-- 
Best Regards

Jeff Zhang

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Jeff Zhang <zj...@gmail.com>.

Thanks Holden to start this thread. I agree that spark devs should put more
efforts on pyspark but the reality is that we are a little slow on this
perspective . Since pyspark is to integrate spark into python, so I think
the focus is on the usability of pyspark. We should hear more feedback from
python community. And lots of data scientists love python, if we want to
more adoption of spark, then we should spend more time on pyspark.

On Thu, Oct 13, 2016 at 9:59 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

I'd add one item to this list: The lack of Python 3 support in Spark
Packages <https://github.com/databricks/sbt-spark-package/issues/26>. This
means that great packages like GraphFrames cannot be used with Python 3
<https://github.com/graphframes/graphframes/issues/85>.

This is quite disappointing since Spark itself supports Python 3 and since
-- at least in my circles -- Python 3 adoption is reaching a tipping point.
All new Python projects at my company and at my friends' companies are
being written in Python 3.

Nick


On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca> wrote:

Hi Spark Devs & Users,


Forking off from Cody’s original thread
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
of Spark Improvements, and Matei's follow up on asking what issues the
Python community was facing with Spark, I think it would be useful for us
to discuss some of the motivations behind some of the Python community
looking at different technologies to replace Apache Spark with. My
viewpoints are based that of a developer who works on Apache Spark
day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks at
Python conferences
<https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
and I feel many (but not all) of the same challenges as the Python
community does trying to use Spark. I’ve included both the user@ and dev@
lists on this one since I think the user community can probably provide
more reasons why they have difficulty with PySpark. I should also point out
- the solution for all of these things may not live inside of the Spark
project itself, but it still impacts our usability as a whole.


   -

   Lack of pip installability

This is one of the points that Matei mentioned, and it something several
people have tried to provide for Spark in one way or another. It seems
getting reviewer time for this issue is rather challenging, and I’ve been
hesitant to ask the contributors to keep updating their PRs (as much as I
want to see some progress) because I just don't know if we have the time or
interest in this. I’m happy to pick up the latest from Juliet and try and
carry it over the finish line if we can find some committer time to work on
this since it now sounds like there is consensus we should do this.

   -

   Difficulty using PySpark from outside of spark-submit / pyspark shell

The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
exist is one of the clearest examples of this challenge. There is also a PR
to make it easier for other shells to extend the Spark shell, and we ran
into some similar challenges while working on Sparkling Pandas. This could
be solved by making Spark pip installable so I won't’ say too much about
this point.

   -

   Minimal integration with IPython/IJupyter

This one is awkward since one of the areas that some of the larger
commercial players work in effectively “competes” (in a very loose term)
with any features introduced around here. I’m not really super sure what
the best path forward is here, but I think collaborating with the IJupyter
people to enable more features found in the commercial offerings in open
source could be beneficial to everyone in the community, and maybe even
reduce the maintenance cost for some of the commercial entities. I
understand this is a tricky issue but having good progress indicators or
something similar could make a huge difference. (Note that Apache Toree
<https://toree.incubator.apache.org/> [Incubating] exists for Scala users
but hopefully the PySpark IJupyter integration could be achieved without a
new kernel).

   -

   Lack of virtualenv and or Python package distribution support

This one is also tricky since many commercial providers have their own
“solution” to this, but there isn’t a good story around supporting custom
virtual envs or user required Python packages. While spark-packages _can_
be Python this requires that the Python package developer go through rather
a lot of work to make their package available and realistically won’t
happen for most Python packages people want to use. And to be fair, the
addFiles mechanism does support Python eggs which works for some packages.
There are some outstanding PRs around this issue (and I understand these
are perhaps large issues which might require large changes to the current
suggested implementations - I’ve had difficulty keeping the current set of
open PRs around this straight in my own head) but there seems to be no
committer bandwidth or interest on working with the contributors who have
suggested these things. Is this an intentional decision or is this
something we as a community are willing to work on/tackle?

   -

   Speed/performance

This is often a complaint I hear from more “data engineering” profile users
who are working in Python. These problems come mostly in places involving
the interaction of Python and the JVM (so UDFs, transformations with
arbitrary lambdas, collect() and toPandas()). This is an area I’m working
on (see https://github.com/apache/spark/pull/13571 ) and hopefully we can
start investigating Apache Arrow <https://arrow.apache.org/> to speed up
the bridge (or something similar) once it’s a bit more ready (currently
Arrow just released 0.1 which is exciting). We also probably need to start
measuring these things more closely since otherwise random regressions will
continue to be introduced (like the challenge with unbalanced partitions
and block serialization together - see SPARK-17817
<https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)

   -

   Configuration difficulties (especially related to OOMs)

This is a general challenge many people face working in Spark, but PySpark
users are also asked to somehow figure out what the correct amount of
memory is to give to the Python process versus the Scala/JVM processes.
This was maybe an acceptable solution at the start, but when combined with
the difficult to understand error messages it can become quite the time
sink. A quick work around would be picking a different default overhead for
applications using Python, but more generally hopefully some shared off-JVM
heap solution could also help reduce this challenge in the future.

   -

   API difficulties

The Spark API doesn’t “feel” very Pythony is a complaint some people have,
but I think we’ve done some excellent work in the DataFrame/Dataset API
here. At the same time we’ve made some really frustrating choices with the
DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
have no concrete plans to bring the Dataset API to PySpark).

A lot of users wish that our DataFrame API was more like the Pandas API
(and Wes has pointed out on some JIRAs where we have differences) as well
as covered more of the functionality of Pandas. This is a hard problem, and
it the solution might not belong inside of PySpark itself (Juliet and I did
some proof-of-concept work back in the day on Sparkling Pandas
<https://github.com/sparklingpandas/sparklingpandas>) - but since one of my
personal goals has been trying to become a committer I’ve been more focused
on contributing to Spark itself rather than libraries and very few people
seem to be interested in working on this project [although I still have
potential users ask if they can use it]. (Of course if there is sufficient
interest to reboot Sparkling Pandas or something similar that would be an
interesting area of work - but it’s also a huge area of work - if you look
at Dask <http://dask.pydata.org/>, a good portion of the work is dedicated
just to supporting pandas like operations).

   -

   Incomprehensible error messages

I often have people ask me how to debug PySpark and they often have a
certain haunted look in their eyes while they ask me this (slightly
joking). More seriously, we really need to provide more guidance around how
to understand PySpark error messages and look at figuring out if there are
places where we can improve the messaging so users aren’t hunting through
stack overflow trying to figure out where the Java exception they are
getting is related to their Python code. In one talk I gave recently
someone mentioned PySpark was the motivation behind finding the hide error
messages plugin/settings for IJupyter.

   -

   Lack of useful ML model & pipeline export/import

This is something we’ve made great progress on, many of the PySpark models
are now able to use the underlying export mechanisms from Java. However I
often hear challenges with using these models in the rest of the Python
space once they have been exported from Spark. I’ve got a PR to add basic
PMML export in Scala to ML (which we can then bring to Python), but I think
the Python community is open to other formats if the Spark community
doesn’t want to go the PMML route.

Now I don’t think we will see the same challenges we’ve seen develop in the
R community, but I suspect purely Python approaches to distributed systems
will continue to eat the “low end” of Spark (e.g. medium sized data
problems requiring parallelism). This isn’t necessarily a bad thing, but if
there is anything I’ve learnt it that's the "low end" solution often
quickly eats the "high end" within a few years - and I’d rather see Spark
continue to thrive outside of the pure JVM space.

These are just the biggest issues that I hear come up commonly and
remembered on my flight back - it’s quite possible I’ve missed important
things. I know contributing on a mailing list can be scary or intimidating
for new users (and even experienced developers may wish to stay out of
discussions they view as heated) - but I strongly encourage everyone to
participate (respectfully) in this thread and we can all work together to
help Spark continue to be the place where people from different languages
and backgrounds continue to come together to collaborate.


I want to be clear as well, while I feel these are all very important
issues (and being someone who has worked on PySpark & Spark for years
<http://bit.ly/hkspmg> without being a committer I may sometimes come off
as frustrated when I talk about these) I think PySpark as a whole is a
really excellent application and we do some really awesome stuff with it.
There are also things that I will be blind to as a result of having worked
on Spark for so long (for example yesterday I caught myself using the _
syntax in a Scala example without explaining it because it seems “normal”
to me but often trips up new comers.) If we can address even some of these
issues I believe it will be a huge win for Spark adoption outside of the
traditional JVM space (and as the various community surveys continue to
indicate PySpark usage is already quite high).

Normally I’d bring in a box of timbits
<https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
having an in-person meeting for this but all I can do for the mailing list
is attach a cute picture and offer future doughnuts/coffee if people want
to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
animals working on Spark on my flight back from a Python Data conference &
Spark meetup just to remind everyone that this is just a software project
and we can be friendly nice people if we try and things will be much more
awesome if we do :)
[image: image.png]

-- 
Cell : 425-233-8271 <(425)%20233-8271>
Twitter: https://twitter.com/holdenkarau




-- 
Best Regards

Jeff Zhang

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Nicholas Chammas <ni...@gmail.com>.

I'd add one item to this list: The lack of Python 3 support in Spark
Packages <https://github.com/databricks/sbt-spark-package/issues/26>. This
means that great packages like GraphFrames cannot be used with Python 3
<https://github.com/graphframes/graphframes/issues/85>.

This is quite disappointing since Spark itself supports Python 3 and since
-- at least in my circles -- Python 3 adoption is reaching a tipping point.
All new Python projects at my company and at my friends' companies are
being written in Python 3.

Nick


On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: image.png]
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Ricardo Almeida <ri...@actnowib.com>.

I would add to the list the lag between Scala and Python for
new released features. Some features/functions get implemented later for
Pyspark, others not available at all. Think GraphX (maybe not the best
example), usually mentioned as one of the main libraries, that didn't make
it to the Python API (and never will - fortunately GraphFrames came to the
rescue on this particular case).

On 12 October 2016 at 21:49, Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: Inline image 1]
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Ricardo Almeida <ri...@actnowib.com>.

I would add to the list the lag between Scala and Python for
new released features. Some features/functions get implemented later for
Pyspark, others not available at all. Think GraphX (maybe not the best
example), usually mentioned as one of the main libraries, that didn't make
it to the Python API (and never will - fortunately GraphFrames came to the
rescue on this particular case).

On 12 October 2016 at 21:49, Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: Inline image 1]
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

Posted by Nicholas Chammas <ni...@gmail.com>.

I'd add one item to this list: The lack of Python 3 support in Spark
Packages <https://github.com/databricks/sbt-spark-package/issues/26>. This
means that great packages like GraphFrames cannot be used with Python 3
<https://github.com/graphframes/graphframes/issues/85>.

This is quite disappointing since Spark itself supports Python 3 and since
-- at least in my circles -- Python 3 adoption is reaching a tipping point.
All new Python projects at my company and at my friends' companies are
being written in Python 3.

Nick


On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: image.png]
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>