You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xinrong Meng (Jira)" <ji...@apache.org> on 2023/01/04 07:20:00 UTC

[jira] [Updated] (SPARK-40307) Optimize (De)Serialization of Python UDFs by Arrow

     [ https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xinrong Meng updated SPARK-40307:
---------------------------------
    Summary: Optimize (De)Serialization of Python UDFs by Arrow  (was: Optimize (De)Serialization of Python UDF)

> Optimize (De)Serialization of Python UDFs by Arrow
> --------------------------------------------------
>
>                 Key: SPARK-40307
>                 URL: https://issues.apache.org/jira/browse/SPARK-40307
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
> Python user-defined function (UDF) enables users to run arbitrary code against PySpark columns. It uses Pickle for (de)serialization, and executes row by row.
> One major performance bottleneck of Python UDFs is (de)serialization, that is, the data interchanging between the worker JVM and the spawned Python subprocess which actually executes the UDF. We should seek for an alternative to handle the (de)serialization: Arrow, which is used in (de)serialization of Pandas UDF already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org