You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2016/10/05 21:05:59 UTC

PySpark UDF Performance Exploration w/Jython (Early/rough 2~3X improvement*) [SPARK-15369]

Hi Python Spark Developers & Users,

As Datasets/DataFrames are becoming the core building block of Spark, and
as someone who cares about Python Spark performance, I've been looking more
at PySpark UDF performance.

I've got an early WIP/request for comments pull request open
<https://github.com/apache/spark/pull/13571> with a corresponding design
document
<https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit>
and
JIRA (SPARK-15369) <https://issues.apache.org/jira/browse/SPARK-15369> that
allows for selective UDF evaluation in Jython <http://www.jython.org/>. Now
that Spark 2.0.1 is out I'd really love peoples input or feedback on this
proposal so I can circle back with a more complete PR :) I'd love to hear
from people using PySpark if this is something which looks interesting (as
well as the PySpark developers) for some of the open questions :)

For users: If you have simple Python UDFs (or even better UDFs and
datasets) that you can share for bench-marking it would be really useful to
be able to add them to the bench-marking I've been looking at in the design
doc. It would also be useful to know if some, many, or none, of your UDFs
can be evaluated by Jython. If you have UDF you aren't comfortable sharing
on-list feel free to each out to me directly.

Some general open questions:

1) The draft PR does some magic** to allow being passed in functions at
least some of the time - is that something which people are interested in
or would it be better to leave the magic out and just require a string
representing the lambda be passed in?

2) Would it be useful to provide easy steps to use JyNI <http://jyni.org/>
 (its LGPL licensed <https://www.gnu.org/licenses/lgpl-3.0.en.html> so I
don't think we we can include it out of the bo
<https://www.apache.org/legal/resolved.html#category-x>x - but we could try
and make it easy for users to link with if its important)?

3) While we have a 2x speedup for tokenization/wordcount (getting close to
native scala perf) - what is performance like for other workloads (please
share your desired UDFs/workloads for my evil bench-marking plans)?

4) What does the eventual Dataset API look like for Python? (This could
partially influence #1)?

5) How important it is to not add the Jython dependencies to the weight for
non-Python users (and if desired which work around to chose - maybe
something like spark-hive?)

6) Do you often chain PySpark UDF operations and is that something we
should try and optimize for in Jython as well?

7) How many of your Python UDFs can / can not be evaluated in Jython for
one reason or another?

8) Do your UDFs depend on Spark accumulators or broadcast values?

9) What am I forgetting in my coffee fueled happiness?

Cheers,

Holden :)

*Bench-marking has been very limited 2~3X improvement likely different for
"real" work loads (unless you really like doing wordcount :p :))
** Note: magic depends on dill <https://pypi.python.org/pypi/dill>.

P.S.

I leave you with this optimistic 80s style intro screen
<https://twitter.com/holdenkarau/status/783762213408497670> :)
Also if anyone happens to be going to PyData DC <http://pydata.org/dc2016/>
this weekend I'd love to chat with you in person about this (and of course
circle it back to the mailing list).
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: PySpark UDF Performance Exploration w/Jython (Early/rough 2~3X improvement*) [SPARK-15369]

Posted by ayan guha <gu...@gmail.com>.

+Adding back the mailing list.....sorry for missing it....

Python 2.7 works for a large population, so I think it is definitely good
start.....



On Thu, Oct 6, 2016 at 10:30 AM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Awesome, thanks :)
>
> So Jython is now up to 2.7 which while not 3.X seems like its probably
> usable for a reasonable number of people. If this is something you would
> like to see commenting on the public mailing list thread/JIRA would be
> useful of course :)
>
> Cheers,
>
> Holden :)
>
> On Wed, Oct 5, 2016 at 4:08 PM, ayan guha <gu...@gmail.com> wrote:
>
>> Hi Holden
>>
>> This is great news for pyspark users :)
>>
>> One concern: Earlier I faced issues with jython, because the python
>> version it used is really old. I faced the issue with Pig 12 on an EMR
>> cluster, where jython uses Python 2.5 (or earlier). Is there any
>> significant change there? Does Jython supports Python 2.6 or 2.7 or 3.X
>> now?
>>
>> Best of luck for your work, which I am a fan of....
>>
>> Ayan
>>
>> On Thu, Oct 6, 2016 at 8:05 AM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Hi Python Spark Developers & Users,
>>>
>>> As Datasets/DataFrames are becoming the core building block of Spark,
>>> and as someone who cares about Python Spark performance, I've been looking
>>> more at PySpark UDF performance.
>>>
>>> I've got an early WIP/request for comments pull request open
>>> <https://github.com/apache/spark/pull/13571> with a corresponding design
>>> document
>>> <https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit> and
>>> JIRA (SPARK-15369) <https://issues.apache.org/jira/browse/SPARK-15369> that
>>> allows for selective UDF evaluation in Jython <http://www.jython.org/>. Now
>>> that Spark 2.0.1 is out I'd really love peoples input or feedback on this
>>> proposal so I can circle back with a more complete PR :) I'd love to hear
>>> from people using PySpark if this is something which looks interesting (as
>>> well as the PySpark developers) for some of the open questions :)
>>>
>>> For users: If you have simple Python UDFs (or even better UDFs and
>>> datasets) that you can share for bench-marking it would be really useful to
>>> be able to add them to the bench-marking I've been looking at in the design
>>> doc. It would also be useful to know if some, many, or none, of your UDFs
>>> can be evaluated by Jython. If you have UDF you aren't comfortable sharing
>>> on-list feel free to each out to me directly.
>>>
>>> Some general open questions:
>>>
>>> 1) The draft PR does some magic** to allow being passed in functions at
>>> least some of the time - is that something which people are interested in
>>> or would it be better to leave the magic out and just require a string
>>> representing the lambda be passed in?
>>>
>>> 2) Would it be useful to provide easy steps to use JyNI
>>> <http://jyni.org/> (its LGPL licensed
>>> <https://www.gnu.org/licenses/lgpl-3.0.en.html> so I don't think we we
>>> can include it out of the bo
>>> <https://www.apache.org/legal/resolved.html#category-x>x - but we could
>>> try and make it easy for users to link with if its important)?
>>>
>>> 3) While we have a 2x speedup for tokenization/wordcount (getting close
>>> to native scala perf) - what is performance like for other workloads
>>> (please share your desired UDFs/workloads for my evil bench-marking plans)?
>>>
>>> 4) What does the eventual Dataset API look like for Python? (This could
>>> partially influence #1)?
>>>
>>> 5) How important it is to not add the Jython dependencies to the weight
>>> for non-Python users (and if desired which work around to chose - maybe
>>> something like spark-hive?)
>>>
>>> 6) Do you often chain PySpark UDF operations and is that something we
>>> should try and optimize for in Jython as well?
>>>
>>> 7) How many of your Python UDFs can / can not be evaluated in Jython for
>>> one reason or another?
>>>
>>> 8) Do your UDFs depend on Spark accumulators or broadcast values?
>>>
>>> 9) What am I forgetting in my coffee fueled happiness?
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> *Bench-marking has been very limited 2~3X improvement likely different
>>> for "real" work loads (unless you really like doing wordcount :p :))
>>> ** Note: magic depends on dill <https://pypi.python.org/pypi/dill>.
>>>
>>> P.S.
>>>
>>> I leave you with this optimistic 80s style intro screen
>>> <https://twitter.com/holdenkarau/status/783762213408497670> :)
>>> Also if anyone happens to be going to PyData DC
>>> <http://pydata.org/dc2016/> this weekend I'd love to chat with you in
>>> person about this (and of course circle it back to the mailing list).
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Best Regards,
Ayan Guha