You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2018/01/25 20:17:48 UTC
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19575#discussion_r163956524
--- Diff: docs/sql-programming-guide.md ---
@@ -1693,70 +1693,70 @@ Using the above optimizations with Arrow will produce the same results as when A
enabled. Not all Spark data types are currently supported and an error will be raised if a column
has an unsupported type, see [Supported Types](#supported-types).
-## How to Write Vectorized UDFs
+## Pandas UDFs (a.k.a Vectorized UDFs)
-A vectorized UDF is similar to a standard UDF in Spark except the inputs and output will be
-Pandas Series, which allow the function to be composed with vectorized operations. This function
-can then be run very efficiently in Spark where data is sent in batches to Python and then
-is executed using Pandas Series as the inputs. The exected output of the function is also a Pandas
-Series of the same length as the inputs. A vectorized UDF is declared using the `pandas_udf`
-keyword, no additional configuration is required.
+With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is defined with a new function
+`pyspark.sql.functions.pandas_udf` and allows user to use functions that operate on `pandas.Series`
+and `pandas.DataFrame` with Spark. Currently, there are two types of pandas UDF: Scalar and Group Map.
-The following example shows how to create a vectorized UDF that computes the product of 2 columns.
+### Scalar
+
+Scalar pandas UDFs are used for vectorizing scalar operations. They can used with functions such as `select`
+and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to annotate a Python function. The Python
+should takes `pandas.Series` and returns a `pandas.Series` of the same size. Internally, Spark will
+split a column into multiple `pandas.Series` and invoke the Python function with each `pandas.Series`, and
+concat the results together to be a new column.
+
+The following example shows how to create a scalar pandas UDF that computes the product of 2 columns.
<div class="codetabs">
<div data-lang="python" markdown="1">
{% highlight python %}
import pandas as pd
-from pyspark.sql.functions import col, pandas_udf
-from pyspark.sql.types import LongType
+from pyspark.sql.functions import pandas_udf, PandasUDFTypr
--- End diff --
@icexelloss I think there is a typo -> `PandasUDFTypr`
Please make sure the example run
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org