You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2017/11/01 12:09:38 UTC

spark git commit: [SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with conditional expressions

Repository: spark
Updated Branches:
  refs/heads/master 96798d14f -> 07f390a27


[SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with conditional expressions

## What changes were proposed in this pull request?

Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression.

Since to fix it might need the change not small (e.g., #19592) and this issue has simpler workaround. We should just notice users in the document about this.

## How was this patch tested?

Only document change.

Author: Liang-Chi Hsieh <vi...@gmail.com>

Closes #19617 from viirya/SPARK-22347-3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07f390a2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07f390a2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07f390a2

Branch: refs/heads/master
Commit: 07f390a27d7b793291c352a643d4bbd5f47294a6
Parents: 96798d1
Author: Liang-Chi Hsieh <vi...@gmail.com>
Authored: Wed Nov 1 13:09:35 2017 +0100
Committer: Wenchen Fan <we...@databricks.com>
Committed: Wed Nov 1 13:09:35 2017 +0100

----------------------------------------------------------------------
 python/pyspark/sql/functions.py | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/07f390a2/python/pyspark/sql/functions.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 0d40368..3981549 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2185,6 +2185,13 @@ def udf(f=None, returnType=StringType()):
         duplicate invocations may be eliminated or the function may even be invoked more times than
         it is present in the query.
 
+    .. note:: The user-defined functions do not support conditional execution by using them with
+        SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no
+        matter the conditions are met or not. So the output is correct if the functions can be
+        correctly run on all rows without failure. If the functions can cause runtime failure on the
+        rows that do not satisfy the conditions, the suggested workaround is to incorporate the
+        condition logic into the functions.
+
     :param f: python function if used as a standalone function
     :param returnType: a :class:`pyspark.sql.types.DataType` object
 
@@ -2278,6 +2285,13 @@ def pandas_udf(f=None, returnType=StringType()):
        .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
 
     .. note:: The user-defined function must be deterministic.
+
+    .. note:: The user-defined functions do not support conditional execution by using them with
+        SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no
+        matter the conditions are met or not. So the output is correct if the functions can be
+        correctly run on all rows without failure. If the functions can cause runtime failure on the
+        rows that do not satisfy the conditions, the suggested workaround is to incorporate the
+        condition logic into the functions.
     """
     return _create_udf(f, returnType=returnType, pythonUdfType=PythonUdfType.PANDAS_UDF)
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org