You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2014/05/28 02:00:07 UTC

[jira] [Updated] (SPARK-1915) AverageFunction should not count if the evaluated value is null.

     [ https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Lian updated SPARK-1915:
------------------------------

    Description: 
Average values are difference between the calculation is done partially or not partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the evaluated value is null.

To reproduce this bug, run the following in {{sbt/sbt hive/console}}:

{code}
scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
...
[237.06666666666666]

scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]
{code}

In the first query, {{AVG}} is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}} isn't either, and the bug is triggered.

  was:
Average values are difference between the calculation is done partially or not partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the evaluated value is null.


> AverageFunction should not count if the evaluated value is null.
> ----------------------------------------------------------------
>
>                 Key: SPARK-1915
>                 URL: https://issues.apache.org/jira/browse/SPARK-1915
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Takuya Ueshin
>            Assignee: Takuya Ueshin
>             Fix For: 1.1.0, 1.0.1
>
>
> Average values are difference between the calculation is done partially or not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the evaluated value is null.
> To reproduce this bug, run the following in {{sbt/sbt hive/console}}:
> {code}
> scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
>    HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
> 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
>    HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
> ...
> [237.06666666666666]
> scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
> 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
> ...
> [142.24,15]
> {code}
> In the first query, {{AVG}} is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}} isn't either, and the bug is triggered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)