You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/10/05 21:26:21 UTC

[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

    [ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15549990#comment-15549990 ] 

Reynold Xin commented on SPARK-15369:
-------------------------------------

So while I'm sure you can improve performance for some UDFs, the limitation of Jython is pretty severe and I worry we are building on a shaky foundation with this approach. Maybe a better approach is to speed up serialization for Python, e.g. by introducing block oriented UDFs that return numpy arrays or Pandas data frames.

> Investigate selectively using Jython for parts of PySpark
> ---------------------------------------------------------
>
>                 Key: SPARK-15369
>                 URL: https://issues.apache.org/jira/browse/SPARK-15369
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: holdenk
>            Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial bottleneck. While Jython is not suitable for all UDFs or map functions, it may be suitable for some simple ones. We should investigate the option of using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org