You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/12/25 10:32:49 UTC
[jira] [Commented] (SPARK-12491) UDAF result differs in SQL if
alias is used
[ https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071454#comment-15071454 ]
Herman van Hovell commented on SPARK-12491:
-------------------------------------------
I cannot reproduce this issue on 1.5.2 and the latest master in local mode.
I used the following code:
{noformat}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class GeometricMean extends UserDefinedAggregateFunction {
def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", DoubleType) :: Nil)
def bufferSchema: StructType = StructType(
StructField("count", LongType) ::
StructField("product", DoubleType) :: Nil
)
def dataType: DataType = DoubleType
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 1.0
}
def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
buffer(0) = buffer.getAs[Long](0) + 1
buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
}
def evaluate(buffer: Row): Any = {
math.pow(buffer.getDouble(1), 1.toDouble / buffer.getLong(0))
}
}
// Create an instance of UDAF GeometricMean.
val gm = new GeometricMean
// Register the udaf
sqlContext.udf.register("gm", gm)
// Create a simple DataFrame with a single column called "id"
// containing number 1 to 10.
val df = sqlContext.range(1, 11).select(($"id" / 3).cast("int").as("group_id"), $"id")
df.registerTempTable("simple")
// Without alias
sqlContext.sql("select group_id, gm(id) from simple group by group_id").show()
// With alias
sqlContext.sql("select group_id, gm(id) as GeometricMean from simple group by group_id").show()
{noformat}
Could you share the logical plans for both queries? You can get a logical plan by doing this:
{noformat}
// Query without alias
val q = sqlContext.sql("select group_id, gm(id) from simple group by group_id")
q.explain(true)
{noformat}
> UDAF result differs in SQL if alias is used
> -------------------------------------------
>
> Key: SPARK-12491
> URL: https://issues.apache.org/jira/browse/SPARK-12491
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.2
> Reporter: Tristan
>
> Using the GeometricMean UDAF example (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html), I found the following discrepancy in results:
> scala> sqlContext.sql("select group_id, gm(id) from simple group by group_id").show()
> +--------+---+
> |group_id|_c1|
> +--------+---+
> | 0|0.0|
> | 1|0.0|
> | 2|0.0|
> +--------+---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple group by group_id").show()
> +--------+-----------------+
> |group_id| GeometricMean|
> +--------+-----------------+
> | 0|8.981385496571725|
> | 1|7.301716979342118|
> | 2|7.706253151292568|
> +--------+-----------------+
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org