You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/07/17 03:33:31 UTC

[GitHub] [spark] itholic commented on a diff in pull request #41974: [SPARK-44401][PYTHON][DOCS] Arrow Python UDF Use Guide

itholic commented on code in PR #41974:
URL: https://github.com/apache/spark/pull/41974#discussion_r1264809851


##########
python/docs/source/user_guide/sql/arrow_pandas.rst:
##########
@@ -333,6 +333,32 @@ The following example shows how to use ``DataFrame.groupby().cogroup().applyInPa
 
 For detailed usage, please see :meth:`PandasCogroupedOps.applyInPandas`
 
+Arrow Python UDFs
+-----------------
+
+Arrow Python UDFs are user defined functions that are executed row-by-row, utilizing Arrow for efficient batch data
+transfer and serialization. To define an Arrow Python UDF, you can use the :meth:`udf` decorator or wrap the function
+with the :meth:`udf` method, ensuring the ``useArrow`` parameter is set to True. Additionally, you can enable Arrow
+optimization for Python UDFs throughout the entire SparkSession by setting the Spark configuration ``spark.sql
+.execution.pythonUDF.arrow.enabled`` to true. It's important to note that the Spark configuration takes effect only
+when ``useArrow`` is either not set or set to None.
+
+The type hints for Arrow Python UDFs should be specified in the same way as for default, pickled Python UDFs.
+
+Here's an example that demonstrates the usage of both a default, pickled Python UDF and an Arrow Python UDF:
+
+.. literalinclude:: ../../../../../examples/src/main/python/sql/arrow.py
+    :language: python
+    :lines: 279-297
+    :dedent: 4
+
+Compared to the default, pickled Python UDF, Arrow Python UDF provides a more coherent type coercion mechanism. UDF

Review Comment:
   qq: Is the term "pickled Python UDF" generally used in PySpark?? Just to confirm.



##########
examples/src/main/python/sql/arrow.py:
##########
@@ -275,6 +275,28 @@ def merge_ordered(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
     # +--------+---+---+----+
 
 
+def arrow_python_udf_example(spark: SparkSession) -> None:
+    from pyspark.sql.functions import udf
+
+    @udf(returnType='int')  # A default, pickled Python UDF
+    def slen(s):  # type: ignore[no-untyped-def]
+        return len(s)
+
+    @udf(returnType='int', useArrow=True)  # An Arrow Python UDF
+    def add_one(x):  # type: ignore[no-untyped-def]
+        if x is not None:
+            return x + 1

Review Comment:
   Why don't we use same function as an example for both case to avoid confusion? Or maybe it's intended to use different example for some reason??



##########
python/docs/source/user_guide/sql/arrow_pandas.rst:
##########
@@ -333,6 +333,32 @@ The following example shows how to use ``DataFrame.groupby().cogroup().applyInPa
 
 For detailed usage, please see :meth:`PandasCogroupedOps.applyInPandas`
 
+Arrow Python UDFs
+-----------------
+
+Arrow Python UDFs are user defined functions that are executed row-by-row, utilizing Arrow for efficient batch data
+transfer and serialization. To define an Arrow Python UDF, you can use the :meth:`udf` decorator or wrap the function
+with the :meth:`udf` method, ensuring the ``useArrow`` parameter is set to True. Additionally, you can enable Arrow
+optimization for Python UDFs throughout the entire SparkSession by setting the Spark configuration ``spark.sql
+.execution.pythonUDF.arrow.enabled`` to true. It's important to note that the Spark configuration takes effect only
+when ``useArrow`` is either not set or set to None.
+
+The type hints for Arrow Python UDFs should be specified in the same way as for default, pickled Python UDFs.
+
+Here's an example that demonstrates the usage of both a default, pickled Python UDF and an Arrow Python UDF:
+
+.. literalinclude:: ../../../../../examples/src/main/python/sql/arrow.py
+    :language: python
+    :lines: 279-297
+    :dedent: 4
+
+Compared to the default, pickled Python UDF, Arrow Python UDF provides a more coherent type coercion mechanism. UDF

Review Comment:
   nit: I think it would be great if we can pick & use one term "Python UDF" or "Python UDFs" ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org