You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2014/05/23 13:27:29 UTC

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/862

    [SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null.

    Average values are difference between the calculation is done partially or not partially.
    Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-1915

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #862
    
----
commit b1ff3c08a2ec77a829b46611fbe3aa68fef5e4e1
Author: Takuya UESHIN <ue...@happy-camper.st>
Date:   2014-05-23T11:16:54Z

    Modify AverageFunction not to count if the evaluated value is null.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-43997697
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-43997644
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44004051
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15165/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44348205
  
    Ah, realized what's wrong, I need at lease 1 non-partial aggregation:
    
    ```
    scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
    ...
    == Query Plan ==
    Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
     Exchange SinglePartition
      HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
    14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
    == Query Plan ==
    Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
     Exchange SinglePartition
      HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
    ...
    [142.24,15]
    ```
    
    And it does lead to the wrong answer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44347248
  
    @ueshin Sorry, my bad, misunderstood your PR description. And I think you are right.
    
    On the other hand, it seems that `AverageFunction` is not used in query plan unless we use it in DSL queries explicitly (maybe I missed something related to aggregation functions here):
    
    ```
    scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
    == Query Plan ==
    Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
     Exchange SinglePartition
      Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
       HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
    14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
    == Query Plan ==
    Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
     Exchange SinglePartition
      Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
       HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
    ```
    
    It seems that currently `AVG` is always turned into partial sum divided by partial count, and leads us to the right answer (null values are ignored). But I think your fix still makes sense and should be merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/862


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44340725
  
    Thanks. I've merged this into master & branch-1.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44191684
  
    Hi @ueshin, would you please give some pointer about the semantics of `avg` function you described? I investigated official documentation and code base of Hive, and also have run several experiments with Hive 0.12.0, but didn't find any clue that `avg` should count null values.
    
    For a quick proof, here is a sample session I ran under Hive 0.12.0:
    
    ```
    hive> create table src(key int, value string);             
    hive> load data local inpath '/tmp/kv3.txt' into table src;
    hive> select avg(key) from src;                  
    ...
    OK
    237.06666666666666
    hive> select avg(key) from src where key is not null;
    ...
    OK
    237.06666666666666
    ```
    
    The `kv3.txt` is copied from the Hive test data files, with 15 non-null keys and 10 null keys:
    
    ```
    hive> select key, value from src;
    hive> select * from src;                                   
    ...
    OK
    238     val_238
    NULL
    311     val_311
    NULL    val_27
    NULL    val_165
    NULL    val_409
    255     val_255
    278     val_278
    98      val_98
    NULL    val_484
    NULL    val_265
    NULL    val_193
    401     val_401
    150     val_150
    273     val_273
    224
    369
    66      val_66
    128
    213     val_213
    146     val_146
    406     val_406
    NULL
    NULL
    NULL
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44004047
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1915] [SQL] AverageFunction should not ...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44201454
  
    @liancheng Thank you for your comment.
    You mean the `avg` should not count null values, right?
    That is what I meant.
    Current `AverageFunction` counts null values, so I think my patch is needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---