You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2017/05/10 23:53:04 UTC

[jira] [Resolved] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

     [ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li resolved SPARK-20685.
-----------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0
                   2.2.1
                   2.1.2

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-20685
>                 URL: https://issues.apache.org/jira/browse/SPARK-20685
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>             Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a stage with a single UDF that takes more than one argument _where that argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
>     process()
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 107, in <lambda>
>     func = lambda _, it: map(mapper, it)
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 93, in <lambda>
>     mapper = lambda a: udf(*a)
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
>     return lambda *a: f(*a)
> TypeError: <lambda>() takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF, but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org