You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/01/12 02:22:16 UTC
[jira] [Resolved] (SPARK-15251) Cannot apply PythonUDF to aggregated column

     [ https://issues.apache.org/jira/browse/SPARK-15251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-15251.
----------------------------------
    Resolution: Cannot Reproduce

{code}
>>> def timesTwo(x):
...     return x * 2
...
>>> sqlContext.udf.register("timesTwo", timesTwo)
data = [(1, 'a'), (2, 'b')]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, ["x", "y"])

df.registerTempTable("my_data")
sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
>>> data = [(1, 'a'), (2, 'b')]
>>> rdd = sc.parallelize(data)
>>> df = sqlContext.createDataFrame(rdd, ["x", "y"])
>>>
>>> df.registerTempTable("my_data")
>>> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
+---+
|  t|
+---+
|  6|
+---+
{code}

It seems it is fixed somewhere and I can't reproduce this in the current master. It'd be great if this is backported if anyone can point out the PR

> Cannot apply PythonUDF to aggregated column
> -------------------------------------------
>
>                 Key: SPARK-15251
>                 URL: https://issues.apache.org/jira/browse/SPARK-15251
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.1
>            Reporter: Matthew Livesey
>
> In scala it is possible to define a UDF an apply it to an aggregated value in an expression, for example:
> {code}
> def timesTwo(x: Int): Int = x * 2
> sqlContext.udf.register("timesTwo", timesTwo _)
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> case class Data(x: Int, y: String)
> val data = List(Data(1, "a"), Data(2, "b"))
> val rdd = sc.parallelize(data)
> val df = rdd.toDF
> df.registerTempTable("my_data")
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() 
> +---+
> |  t|
> +---+
> |  6|
> +---+
> {code}
> Performing the same computation in pyspark:
> {code}
> def timesTwo(x):
>     return x * 2
> sqlContext.udf.register("timesTwo", timesTwo)
> data = [(1, 'a'), (2, 'b')]
> rdd = sc.parallelize(data)
> df = sqlContext.createDataFrame(rdd, ["x", "y"])
> df.registerTempTable("my_data")
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> {code}
> Gives the following:
> {code}
> AnalysisException: u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"
> {code}
> Using a lambda rather than a named function gives the same error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org