You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/12/25 10:32:49 UTC
[jira] [Commented] (SPARK-12491) UDAF result differs in SQL if alias is used

    [ https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071454#comment-15071454 ] 

Herman van Hovell commented on SPARK-12491:
-------------------------------------------

I cannot reproduce this issue on 1.5.2 and the latest master in local mode.

I used the following code:

{noformat}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

class GeometricMean extends UserDefinedAggregateFunction {
  def inputSchema: org.apache.spark.sql.types.StructType =
    StructType(StructField("value", DoubleType) :: Nil)

  def bufferSchema: StructType = StructType(
    StructField("count", LongType) ::
    StructField("product", DoubleType) :: Nil
  )

  def dataType: DataType = DoubleType

  def deterministic: Boolean = true

  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 1.0
  }

  def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
    buffer(0) = buffer.getAs[Long](0) + 1
    buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
    buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
  }

  def evaluate(buffer: Row): Any = {
    math.pow(buffer.getDouble(1), 1.toDouble / buffer.getLong(0))
  }
}

// Create an instance of UDAF GeometricMean.
val gm = new GeometricMean

// Register the udaf
sqlContext.udf.register("gm", gm)

// Create a simple DataFrame with a single column called "id"
// containing number 1 to 10.
val df = sqlContext.range(1, 11).select(($"id" / 3).cast("int").as("group_id"), $"id")
df.registerTempTable("simple")

// Without alias
sqlContext.sql("select group_id, gm(id) from simple group by group_id").show()

// With alias
sqlContext.sql("select group_id, gm(id) as GeometricMean from simple group by group_id").show()
{noformat}

Could you share the logical plans for both queries? You can get a logical plan by doing this:

{noformat}
// Query without alias
val q = sqlContext.sql("select group_id, gm(id) from simple group by group_id")
q.explain(true)
{noformat}


> UDAF result differs in SQL if alias is used
> -------------------------------------------
>
>                 Key: SPARK-12491
>                 URL: https://issues.apache.org/jira/browse/SPARK-12491
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Tristan
>
> Using the GeometricMean UDAF example (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html), I found the following discrepancy in results:
> scala> sqlContext.sql("select group_id, gm(id) from simple group by group_id").show()
> +--------+---+
> |group_id|_c1|
> +--------+---+
> |       0|0.0|
> |       1|0.0|
> |       2|0.0|
> +--------+---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple group by group_id").show()
> +--------+-----------------+
> |group_id|    GeometricMean|
> +--------+-----------------+
> |       0|8.981385496571725|
> |       1|7.301716979342118|
> |       2|7.706253151292568|
> +--------+-----------------+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org