You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2017/06/23 07:50:00 UTC

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs for Python

     [ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-21190:
--------------------------------
    Description: 
*Background and Motivation*
 
Python is one of the most popular programming languages among Spark users. Spark currently exposes a row-at-a-time interface for defining and executing user-defined functions (UDFs). This introduces high overhead in serialization and deserialization, and also makes it difficult to leverage Python libraries that are written in native code. This proposal advocates introducing new APIs to support vectorized UDFs in Python, in which a block of data is transferred over to Python in some column format for execution.
 
 
*Target Personas*

Data scientists, data engineers, library developers.
 

*Goals*

... todo ...
 

*Non-Goals*

- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
 
 
*Proposed API Changes*
 
... todo ...
 
 
 
*Optional Design Sketch*
The implementation should be pretty straightforward and is not a huge concern at this point. I’m more concerned about getting proper feedback for API design.
 
 
*Optional Rejected Designs*
See above.
 
 
 
 


> SPIP: Vectorized UDFs for Python
> --------------------------------
>
>                 Key: SPARK-21190
>                 URL: https://issues.apache.org/jira/browse/SPARK-21190
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, SQL
>    Affects Versions: 2.2.0
>            Reporter: Reynold Xin
>              Labels: SPIP
>
> *Background and Motivation*
>  
> Python is one of the most popular programming languages among Spark users. Spark currently exposes a row-at-a-time interface for defining and executing user-defined functions (UDFs). This introduces high overhead in serialization and deserialization, and also makes it difficult to leverage Python libraries that are written in native code. This proposal advocates introducing new APIs to support vectorized UDFs in Python, in which a block of data is transferred over to Python in some column format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> ... todo ...
>  
> *Non-Goals*
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
>  
>  
> *Proposed API Changes*
>  
> ... todo ...
>  
>  
>  
> *Optional Design Sketch*
> The implementation should be pretty straightforward and is not a huge concern at this point. I’m more concerned about getting proper feedback for API design.
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org