You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "holdenk (JIRA)" <ji...@apache.org> on 2018/08/13 17:56:00 UTC
[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

    [ https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578710#comment-16578710 ] 

holdenk commented on SPARK-24735:
---------------------------------

I think we could do better than just improving the exception, if we look at the other aggregates in PySpark when we call them with select it does the grouping for us:

 
{code:java}
>>> df.select(sumDistinct(df._1)).show()
+----------------+
|sum(DISTINCT _1)|
+----------------+
| 4950           |
+----------------+{code}

> Improve exception when mixing up pandas_udf types
> -------------------------------------------------
>
>                 Key: SPARK-24735
>                 URL: https://issues.apache.org/jira/browse/SPARK-24735
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>            Priority: Major
>
> From the discussion here https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an exception which is hard to understand.  It should tell the user that the UDF type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", line 353, in show
>     print(self._jdf.showString(n, 20, vertical))
>   File "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
>   File "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", line 63, in deco
>     return f(*a, **kw)
>   File "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, bigint, false])
> 	at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
> 	at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
> 	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
> 	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
> 	at scala.Option.getOrElse(Option.scala:121)
>         ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org