You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/05/12 14:31:26 UTC

[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/17964

    [SPARK-20725][SQL] partial aggregate should behave correctly for sameResult

    ## What changes were proposed in this pull request?
    
    In the final aggregate of a 2-phase aggregation, we have missing attributes in aggregate expressions, which is not handled properly by `QueryPlan.canonicalized`, and cause a bug for `sameResult`. This PR fixes it.
    
    ## How was this patch tested?
    
    a new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark tmp

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17964.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17964
    
----
commit a2ebfdaf7dd1f2ccce2d0ded3ff22c2c93fb454b
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-05-12T14:27:57Z

    partial aggregate should behave correctly for sameResult

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76891/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76901/testReport)** for PR 17964 at commit [`49da955`](https://github.com/apache/spark/commit/49da955dce260260325708d07becbc692cd3a005).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76893/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17964#discussion_r116352803
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala ---
    @@ -46,4 +48,10 @@ class SameResultSuite extends QueryTest with SharedSQLContext {
         df.queryExecution.sparkPlan.find(_.isInstanceOf[FileSourceScanExec]).get
           .asInstanceOf[FileSourceScanExec]
       }
    +
    +  test("SPARK-20725: partial aggregate should behave correctly for sameResult") {
    +    val df1 = spark.range(10).agg(sum($"id"))
    +    val df2 = spark.range(10).agg(sum($"id"))
    --- End diff --
    
    ```Scala
        val df1 = spark.range(10).agg(sumDistinct($"id"))
        val df2 = spark.range(10).agg(sumDistinct($"id"))
    ```
    They will not match?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76871/testReport)** for PR 17964 at commit [`a2ebfda`](https://github.com/apache/spark/commit/a2ebfdaf7dd1f2ccce2d0ded3ff22c2c93fb454b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17964


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76871/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76871/testReport)** for PR 17964 at commit [`a2ebfda`](https://github.com/apache/spark/commit/a2ebfdaf7dd1f2ccce2d0ded3ff22c2c93fb454b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76901/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    @cloud-fan can you backport this to 2.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76891/testReport)** for PR 17964 at commit [`557298e`](https://github.com/apache/spark/commit/557298e3d88c04910ebff9cdb1ae77a1537c83af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76901/testReport)** for PR 17964 at commit [`49da955`](https://github.com/apache/spark/commit/49da955dce260260325708d07becbc692cd3a005).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76893/testReport)** for PR 17964 at commit [`557298e`](https://github.com/apache/spark/commit/557298e3d88c04910ebff9cdb1ae77a1537c83af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    LGTM - merging to master/2.2/2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17964#discussion_r116306324
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala ---
    @@ -429,17 +429,13 @@ object QueryPlan {
        * with its referenced ordinal from input attributes. It's similar to `BindReferences` but we
        * do not use `BindReferences` here as the plan may take the expression as a parameter with type
        * `Attribute`, and replace it with `BoundReference` will cause error.
    +   * Note that, we may have missing attributes, e.g. in the final aggregate of 2-phase aggregation,
    +   * we should normalize missing attributes too, with expr id -1.
        */
       def normalizeExprId[T <: Expression](e: T, input: AttributeSeq): T = {
         e.transformUp {
           case s: SubqueryExpression => s.canonicalize(input)
    -      case ar: AttributeReference =>
    -        val ordinal = input.indexOf(ar.exprId)
    -        if (ordinal == -1) {
    -          ar
    -        } else {
    -          ar.withExprId(ExprId(ordinal))
    -        }
    +      case ar: AttributeReference => ar.withExprId(ExprId(input.indexOf(ar.exprId)))
    --- End diff --
    
    Ok, so this works because of the way we plan aggregates and I am totally fine with this.
    
    I am slightly worried about non-complete aggregate expression that cannot be resolved and wreak havok further down the line because `sameResult` falsely evaluated to `true`. Can we special case non-complete aggregate expressions?
    
    From an architectural point of view it might be better to add this as a normalize function to Expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17964#discussion_r116357380
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala ---
    @@ -46,4 +48,10 @@ class SameResultSuite extends QueryTest with SharedSQLContext {
         df.queryExecution.sparkPlan.find(_.isInstanceOf[FileSourceScanExec]).get
           .asInstanceOf[FileSourceScanExec]
       }
    +
    +  test("SPARK-20725: partial aggregate should behave correctly for sameResult") {
    +    val df1 = spark.range(10).agg(sum($"id"))
    +    val df2 = spark.range(10).agg(sum($"id"))
    --- End diff --
    
    Good catch! The reason is, `HashAggregateExec.requiredChildDistributionExpressions` is a `Option[Seq[Expression]]`, which is not treated as expressions of `HashAggregateExec`, and thus not touched by `QueryPlan.mapExpressions`.
    
    I have fixed it in `QueryPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    cc @hvanhovell  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17964
  
    **[Test build #76893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76893/testReport)** for PR 17964 at commit [`557298e`](https://github.com/apache/spark/commit/557298e3d88c04910ebff9cdb1ae77a1537c83af).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org