You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2016/02/19 00:12:06 UTC

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/11256

    [SPARK-13376] [SQL] improve column pruning

    ## What changes were proposed in this pull request?
    
    This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).
    
    ## How was the this patch tested?
    
    This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s). 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_column_pruning

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11256
    
----
commit 1145f1975481fc5dadc0efe876a7b746ce26cdc1
Author: Davies Liu <da...@databricks.com>
Date:   2016-02-18T23:00:35Z

    improve column pruning

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53748029
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    --- End diff --
    
    I think `CollapseProject` can also cover this case, it merges the inner project list into the outer one, so unused columns in inner project list are also removed.
    
    But this rule is useful when `CollapseProject` can't be applied, i.e. there are non-deterministic overlapped expressions. BTW I think the order doesn't matter here, this rule can run before or after `CollapseProject`, as it handles cases `CollapseProject` can't handle.
    
    So this LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53663946
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    --- End diff --
    
    CollapseProject can't remove the unused columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-185984746
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51503/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-186369230
  
    **[Test build #2548 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2548/consoleFull)** for PR 11256 at commit [`1145f19`](https://github.com/apache/spark/commit/1145f1975481fc5dadc0efe876a7b746ce26cdc1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187396600
  
    **[Test build #51674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51674/consoleFull)** for PR 11256 at commit [`11104c3`](https://github.com/apache/spark/commit/11104c36367f8b10c7d545782e336def7b6844c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53737200
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    --- End diff --
    
    Should we guarantee this rule run before `CollapseProject`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-186519466
  
    cc @cloud-fan 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187448958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187563170
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-188432509
  
    FYI I had to revert this because this broke lateral view test.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-186376123
  
    **[Test build #2548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2548/consoleFull)** for PR 11256 at commit [`1145f19`](https://github.com/apache/spark/commit/1145f1975481fc5dadc0efe876a7b746ce26cdc1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-186657939
  
    **[Test build #2551 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2551/consoleFull)** for PR 11256 at commit [`1145f19`](https://github.com/apache/spark/commit/1145f1975481fc5dadc0efe876a7b746ce26cdc1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-188016320
  
    Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187500547
  
    **[Test build #51709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51709/consoleFull)** for PR 11256 at commit [`face1c7`](https://github.com/apache/spark/commit/face1c7a0e14ce6ac2caba15f36f3e0fe41264d7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53738719
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    --- End diff --
    
    `p` should be created within this rule, then this case should be applied before CollapseProject.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-185984711
  
    **[Test build #51503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51503/consoleFull)** for PR 11256 at commit [`1145f19`](https://github.com/apache/spark/commit/1145f1975481fc5dadc0efe876a7b746ce26cdc1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187448961
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51696/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187986296
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187525269
  
    **[Test build #51728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51728/consoleFull)** for PR 11256 at commit [`c1f155c`](https://github.com/apache/spark/commit/c1f155c163481c5866f46fa1be7781e269df3ea7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187983748
  
    cc @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187413282
  
    **[Test build #51674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51674/consoleFull)** for PR 11256 at commit [`11104c3`](https://github.com/apache/spark/commit/11104c36367f8b10c7d545782e336def7b6844c9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187413356
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187563177
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51728/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53875740
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -313,97 +313,85 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = p2.copy(projectList = p2.projectList.filter(p.references.contains)))
    +    case p @ Project(_, a: Aggregate) if (a.outputSet -- p.references).nonEmpty =>
    +      p.copy(
    +        child = a.copy(aggregateExpressions = a.aggregateExpressions.filter(p.references.contains)))
    +    case p @ Project(_, w: Window) if (w.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = w.copy(
    +        projectList = w.projectList.filter(p.references.contains),
    +        windowExpressions = w.windowExpressions.filter(p.references.contains)))
    +    case a @ Project(_, e @ Expand(_, _, grandChild)) if (e.outputSet -- a.references).nonEmpty =>
    +      val newOutput = e.output.filter(a.references.contains(_))
    +      val newProjects = e.projections.map { proj =>
    +        proj.zip(e.output).filter { case (e, a) =>
               newOutput.contains(a)
             }.unzip._1
           }
    -      a.copy(child = Expand(newProjects, newOutput, child))
    +      a.copy(child = Expand(newProjects, newOutput, grandChild))
    +    // TODO: support some logical plan for Dataset
    --- End diff --
    
    Maybe create a JIRA for this.  Should be pretty easy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-186656042
  
    **[Test build #2551 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2551/consoleFull)** for PR 11256 at commit [`1145f19`](https://github.com/apache/spark/commit/1145f1975481fc5dadc0efe876a7b746ce26cdc1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187448600
  
    **[Test build #51696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51696/consoleFull)** for PR 11256 at commit [`e31dec7`](https://github.com/apache/spark/commit/e31dec7d1ede3547862e78d9fc4959f773ccbb72).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11256


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53561584
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = p2.copy(projectList = p2.projectList.filter(p.references.contains)))
    +    case p @ Project(_, a: Aggregate) if (a.outputSet -- p.references).nonEmpty =>
    +      p.copy(
    +        child = a.copy(aggregateExpressions = a.aggregateExpressions.filter(p.references.contains)))
    +    case p @ Project(_, w: Window) if (w.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = w.copy(
    +        projectList = w.projectList.filter(p.references.contains),
    +        windowExpressions = w.windowExpressions.filter(p.references.contains)))
    +    case a @ Project(_, e @ Expand(_, _, grandChild)) if (e.outputSet -- a.references).nonEmpty =>
    +      val newOutput = e.output.filter(a.references.contains(_))
    +      val newProjects = e.projections.map { proj =>
    +        proj.zip(e.output).filter { case (e, a) =>
               newOutput.contains(a)
             }.unzip._1
           }
    -      a.copy(child = Expand(newProjects, newOutput, child))
    +      a.copy(child = Expand(newProjects, newOutput, grandChild))
    +    // TODO: support some logical plan for Dataset
     
    -    case a @ Aggregate(_, _, e @ Expand(_, _, child))
    -      if (child.outputSet -- e.references -- a.references).nonEmpty =>
    -      a.copy(child = e.copy(child = prunedChild(child, e.references ++ a.references)))
    -
    -    // Eliminate attributes that are not needed to calculate the specified aggregates.
    +    // Prunes the unused columns from child of Aggregate/Window/Expand/Generate
         case a @ Aggregate(_, _, child) if (child.outputSet -- a.references).nonEmpty =>
    -      a.copy(child = Project(a.references.toSeq, child))
    -
    -    // Eliminate attributes that are not needed to calculate the Generate.
    +      a.copy(child = prunedChild(child, a.references))
    +    case w @ Window(_, _, _, _, child) if (child.outputSet -- w.references).nonEmpty =>
    +      w.copy(child = prunedChild(child, w.references))
    +    case e @ Expand(_, _, child) if (child.outputSet -- e.references).nonEmpty =>
    +      e.copy(child = prunedChild(child, e.references))
         case g: Generate if !g.join && (g.child.outputSet -- g.references).nonEmpty =>
    -      g.copy(child = Project(g.references.toSeq, g.child))
    +      g.copy(child = prunedChild(g.child, g.references))
     
    +    // Turn off `join` for Generate if no column from it's child is used
         case p @ Project(_, g: Generate) if g.join && p.references.subsetOf(g.generatedSet) =>
           p.copy(child = g.copy(join = false))
     
    -    case p @ Project(projectList, g: Generate) if g.join =>
    -      val neededChildOutput = p.references -- g.generatorOutput ++ g.references
    -      if (neededChildOutput == g.child.outputSet) {
    -        p
    -      } else {
    -        Project(projectList, g.copy(child = Project(neededChildOutput.toSeq, g.child)))
    -      }
    -
    -    case p @ Project(projectList, a @ Aggregate(groupingExpressions, aggregateExpressions, child))
    -        if (a.outputSet -- p.references).nonEmpty =>
    -      Project(
    -        projectList,
    -        Aggregate(
    -          groupingExpressions,
    -          aggregateExpressions.filter(e => p.references.contains(e)),
    -          child))
    -
    -    // Eliminate unneeded attributes from either side of a Join.
    -    case Project(projectList, Join(left, right, joinType, condition)) =>
    -      // Collect the list of all references required either above or to evaluate the condition.
    -      val allReferences: AttributeSet =
    -        AttributeSet(
    -          projectList.flatMap(_.references.iterator)) ++
    -          condition.map(_.references).getOrElse(AttributeSet(Seq.empty))
    -
    -      /** Applies a projection only when the child is producing unnecessary attributes */
    -      def pruneJoinChild(c: LogicalPlan): LogicalPlan = prunedChild(c, allReferences)
    -
    -      Project(projectList, Join(pruneJoinChild(left), pruneJoinChild(right), joinType, condition))
    -
         // Eliminate unneeded attributes from right side of a LeftSemiJoin.
    -    case Join(left, right, LeftSemi, condition) =>
    -      // Collect the list of all references required to evaluate the condition.
    -      val allReferences: AttributeSet =
    -        condition.map(_.references).getOrElse(AttributeSet(Seq.empty))
    -
    -      Join(left, prunedChild(right, allReferences), LeftSemi, condition)
    -
    -    // Push down project through limit, so that we may have chance to push it further.
    -    case Project(projectList, Limit(exp, child)) =>
    -      Limit(exp, Project(projectList, child))
    +    case j @ Join(left, right, LeftSemi, condition) =>
    +      j.copy(right = prunedChild(right, j.references))
     
    -    // Push down project if possible when the child is sort.
    -    case p @ Project(projectList, s @ Sort(_, _, grandChild)) =>
    -      if (s.references.subsetOf(p.outputSet)) {
    -        s.copy(child = Project(projectList, grandChild))
    +    // Eliminate no-op Projects
    +    case p @ Project(projectList, child) if child.outputSet == p.outputSet => child
    +
    +    // all the columns will be used to compare, so we can't prune them
    +    case p @ Project(_, _: SetOperation) => p
    +    case p @ Project(_, _: Distinct) => p
    +
    +    // Can't prune the columns on LeafNode
    +    case p @ Project(_, l: LeafNode) => p
    +
    +    // for all other logical plans that inherits the output from it's children
    +    case p @ Project(_, child) =>
    +      val allAttributes = child.children.flatMap(_.outputSet).toSet
    +      val required = child.references ++ p.references
    +      if ((allAttributes -- required).nonEmpty) {
    +        val newChildren = child.children.map(c => prunedChild(c, required))
    +        p.copy(child = child.withNewChildren(newChildren))
           } else {
    -        val neededReferences = s.references ++ p.references
    -        if (neededReferences == grandChild.outputSet) {
    -          // No column we can prune, return the original plan.
    -          p
    -        } else {
    -          // Do not use neededReferences.toSeq directly, should respect grandChild's output order.
    -          val newProjectList = grandChild.output.filter(neededReferences.contains)
    -          p.copy(child = s.copy(child = Project(newProjectList, grandChild)))
    -        }
    +        p
           }
    -
    -    // Eliminate no-op Projects
    -    case Project(projectList, child) if child.output == projectList => child
       }
     
       /** Applies a projection only when the child is producing unnecessary attributes */
       private def prunedChild(c: LogicalPlan, allReferences: AttributeSet) =
         if ((c.outputSet -- allReferences.filter(c.outputSet.contains)).nonEmpty) {
    -      Project(allReferences.filter(c.outputSet.contains).toSeq, c)
    +      val proj = allReferences.filter(c.outputSet.contains).toSeq.sortBy(_.name)
    --- End diff --
    
    why do we need to sort it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187500897
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187413358
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51674/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187500901
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51709/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53561554
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    --- End diff --
    
    how would this rule interact with `CollapseProject`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53561724
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -300,97 +300,71 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = p2.copy(projectList = p2.projectList.filter(p.references.contains)))
    +    case p @ Project(_, a: Aggregate) if (a.outputSet -- p.references).nonEmpty =>
    +      p.copy(
    +        child = a.copy(aggregateExpressions = a.aggregateExpressions.filter(p.references.contains)))
    +    case p @ Project(_, w: Window) if (w.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = w.copy(
    +        projectList = w.projectList.filter(p.references.contains),
    +        windowExpressions = w.windowExpressions.filter(p.references.contains)))
    +    case a @ Project(_, e @ Expand(_, _, grandChild)) if (e.outputSet -- a.references).nonEmpty =>
    +      val newOutput = e.output.filter(a.references.contains(_))
    +      val newProjects = e.projections.map { proj =>
    +        proj.zip(e.output).filter { case (e, a) =>
               newOutput.contains(a)
             }.unzip._1
           }
    -      a.copy(child = Expand(newProjects, newOutput, child))
    +      a.copy(child = Expand(newProjects, newOutput, grandChild))
    +    // TODO: support some logical plan for Dataset
     
    -    case a @ Aggregate(_, _, e @ Expand(_, _, child))
    -      if (child.outputSet -- e.references -- a.references).nonEmpty =>
    -      a.copy(child = e.copy(child = prunedChild(child, e.references ++ a.references)))
    -
    -    // Eliminate attributes that are not needed to calculate the specified aggregates.
    +    // Prunes the unused columns from child of Aggregate/Window/Expand/Generate
         case a @ Aggregate(_, _, child) if (child.outputSet -- a.references).nonEmpty =>
    -      a.copy(child = Project(a.references.toSeq, child))
    -
    -    // Eliminate attributes that are not needed to calculate the Generate.
    +      a.copy(child = prunedChild(child, a.references))
    +    case w @ Window(_, _, _, _, child) if (child.outputSet -- w.references).nonEmpty =>
    +      w.copy(child = prunedChild(child, w.references))
    +    case e @ Expand(_, _, child) if (child.outputSet -- e.references).nonEmpty =>
    +      e.copy(child = prunedChild(child, e.references))
         case g: Generate if !g.join && (g.child.outputSet -- g.references).nonEmpty =>
    -      g.copy(child = Project(g.references.toSeq, g.child))
    +      g.copy(child = prunedChild(g.child, g.references))
     
    +    // Turn off `join` for Generate if no column from it's child is used
         case p @ Project(_, g: Generate) if g.join && p.references.subsetOf(g.generatedSet) =>
           p.copy(child = g.copy(join = false))
     
    -    case p @ Project(projectList, g: Generate) if g.join =>
    -      val neededChildOutput = p.references -- g.generatorOutput ++ g.references
    -      if (neededChildOutput == g.child.outputSet) {
    -        p
    -      } else {
    -        Project(projectList, g.copy(child = Project(neededChildOutput.toSeq, g.child)))
    -      }
    -
    -    case p @ Project(projectList, a @ Aggregate(groupingExpressions, aggregateExpressions, child))
    -        if (a.outputSet -- p.references).nonEmpty =>
    -      Project(
    -        projectList,
    -        Aggregate(
    -          groupingExpressions,
    -          aggregateExpressions.filter(e => p.references.contains(e)),
    -          child))
    -
    -    // Eliminate unneeded attributes from either side of a Join.
    -    case Project(projectList, Join(left, right, joinType, condition)) =>
    -      // Collect the list of all references required either above or to evaluate the condition.
    -      val allReferences: AttributeSet =
    -        AttributeSet(
    -          projectList.flatMap(_.references.iterator)) ++
    -          condition.map(_.references).getOrElse(AttributeSet(Seq.empty))
    -
    -      /** Applies a projection only when the child is producing unnecessary attributes */
    -      def pruneJoinChild(c: LogicalPlan): LogicalPlan = prunedChild(c, allReferences)
    -
    -      Project(projectList, Join(pruneJoinChild(left), pruneJoinChild(right), joinType, condition))
    -
         // Eliminate unneeded attributes from right side of a LeftSemiJoin.
    -    case Join(left, right, LeftSemi, condition) =>
    -      // Collect the list of all references required to evaluate the condition.
    -      val allReferences: AttributeSet =
    -        condition.map(_.references).getOrElse(AttributeSet(Seq.empty))
    -
    -      Join(left, prunedChild(right, allReferences), LeftSemi, condition)
    -
    -    // Push down project through limit, so that we may have chance to push it further.
    -    case Project(projectList, Limit(exp, child)) =>
    -      Limit(exp, Project(projectList, child))
    +    case j @ Join(left, right, LeftSemi, condition) =>
    +      j.copy(right = prunedChild(right, j.references))
     
    -    // Push down project if possible when the child is sort.
    -    case p @ Project(projectList, s @ Sort(_, _, grandChild)) =>
    -      if (s.references.subsetOf(p.outputSet)) {
    -        s.copy(child = Project(projectList, grandChild))
    +    // Eliminate no-op Projects
    +    case p @ Project(projectList, child) if child.outputSet == p.outputSet => child
    +
    +    // all the columns will be used to compare, so we can't prune them
    +    case p @ Project(_, _: SetOperation) => p
    +    case p @ Project(_, _: Distinct) => p
    +
    +    // Can't prune the columns on LeafNode
    +    case p @ Project(_, l: LeafNode) => p
    +
    +    // for all other logical plans that inherits the output from it's children
    +    case p @ Project(_, child) =>
    +      val allAttributes = child.children.flatMap(_.outputSet).toSet
    +      val required = child.references ++ p.references
    +      if ((allAttributes -- required).nonEmpty) {
    +        val newChildren = child.children.map(c => prunedChild(c, required))
    +        p.copy(child = child.withNewChildren(newChildren))
           } else {
    -        val neededReferences = s.references ++ p.references
    -        if (neededReferences == grandChild.outputSet) {
    -          // No column we can prune, return the original plan.
    -          p
    -        } else {
    -          // Do not use neededReferences.toSeq directly, should respect grandChild's output order.
    -          val newProjectList = grandChild.output.filter(neededReferences.contains)
    -          p.copy(child = s.copy(child = Project(newProjectList, grandChild)))
    -        }
    +        p
           }
    -
    -    // Eliminate no-op Projects
    -    case Project(projectList, child) if child.output == projectList => child
       }
     
       /** Applies a projection only when the child is producing unnecessary attributes */
       private def prunedChild(c: LogicalPlan, allReferences: AttributeSet) =
         if ((c.outputSet -- allReferences.filter(c.outputSet.contains)).nonEmpty) {
    -      Project(allReferences.filter(c.outputSet.contains).toSeq, c)
    +      val proj = allReferences.filter(c.outputSet.contains).toSeq.sortBy(_.name)
    --- End diff --
    
    If we don't sort, the plan checking will be flaky


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53885102
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -313,97 +313,85 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
      */
     object ColumnPruning extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    -    case a @ Aggregate(_, _, e @ Expand(projects, output, child))
    -      if (e.outputSet -- a.references).nonEmpty =>
    -      val newOutput = output.filter(a.references.contains(_))
    -      val newProjects = projects.map { proj =>
    -        proj.zip(output).filter { case (e, a) =>
    +    // Prunes the unused columns from project list of Project/Aggregate/Window/Expand
    +    case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = p2.copy(projectList = p2.projectList.filter(p.references.contains)))
    +    case p @ Project(_, a: Aggregate) if (a.outputSet -- p.references).nonEmpty =>
    +      p.copy(
    +        child = a.copy(aggregateExpressions = a.aggregateExpressions.filter(p.references.contains)))
    +    case p @ Project(_, w: Window) if (w.outputSet -- p.references).nonEmpty =>
    +      p.copy(child = w.copy(
    +        projectList = w.projectList.filter(p.references.contains),
    +        windowExpressions = w.windowExpressions.filter(p.references.contains)))
    +    case a @ Project(_, e @ Expand(_, _, grandChild)) if (e.outputSet -- a.references).nonEmpty =>
    +      val newOutput = e.output.filter(a.references.contains(_))
    +      val newProjects = e.projections.map { proj =>
    +        proj.zip(e.output).filter { case (e, a) =>
               newOutput.contains(a)
             }.unzip._1
           }
    -      a.copy(child = Expand(newProjects, newOutput, child))
    +      a.copy(child = Expand(newProjects, newOutput, grandChild))
    +    // TODO: support some logical plan for Dataset
    --- End diff --
    
    https://issues.apache.org/jira/browse/SPARK-13463


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11256#discussion_r53850231
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala ---
    @@ -65,52 +64,6 @@ class FilterPushdownSuite extends PlanTest {
         comparePlans(optimized, correctAnswer)
       }
     
    -  test("column pruning for group") {
    --- End diff --
    
    These tests are moved to ColumnPruningSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-185984744
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187476883
  
    **[Test build #51709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51709/consoleFull)** for PR 11256 at commit [`face1c7`](https://github.com/apache/spark/commit/face1c7a0e14ce6ac2caba15f36f3e0fe41264d7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187429269
  
    **[Test build #51696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51696/consoleFull)** for PR 11256 at commit [`e31dec7`](https://github.com/apache/spark/commit/e31dec7d1ede3547862e78d9fc4959f773ccbb72).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-187561925
  
    **[Test build #51728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51728/consoleFull)** for PR 11256 at commit [`c1f155c`](https://github.com/apache/spark/commit/c1f155c163481c5866f46fa1be7781e269df3ea7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13376] [SQL] improve column pruning

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11256#issuecomment-185979442
  
    **[Test build #51503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51503/consoleFull)** for PR 11256 at commit [`1145f19`](https://github.com/apache/spark/commit/1145f1975481fc5dadc0efe876a7b746ce26cdc1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org