You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by heary-cao <gi...@git.apache.org> on 2018/02/08 08:06:20 UTC

[GitHub] spark pull request #20541: [SPARK-23356][SQL]Pushes Project to both sides of...

GitHub user heary-cao opened a pull request:

    https://github.com/apache/spark/pull/20541

    [SPARK-23356][SQL]Pushes Project to both sides of Union when expression is non-deterministic

    ## What changes were proposed in this pull request?
    
    Currently, PushProjectionThroughUnion optimizer only supports pushdown project operator to both sides of a Union operator when expression is deterministic , in fact, we can be like pushdown filters, also support pushdown project operator to both sides of a Union operator when expression is non-deterministic , this PR description fix this problem。now the explain looks like:
    
    ```
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion ===
    Input LogicalPlan:
    Project [a#0, rand(10) AS rnd#9]
    +- Union
       :- LocalRelation <empty>, [a#0, b#1, c#2]
       :- LocalRelation <empty>, [d#3, e#4, f#5]
       +- LocalRelation <empty>, [g#6, h#7, i#8]
    
    Output LogicalPlan:
    Project [a#0, rand(10) AS rnd#9]
    +- Union
       :- Project [a#0]
       :  +- LocalRelation <empty>, [a#0, b#1, c#2]
       :- Project [d#3]
       :  +- LocalRelation <empty>, [d#3, e#4, f#5]
       +- Project [g#6]
          +- LocalRelation <empty>, [g#6, h#7, i#8]
    ```
    
    ## How was this patch tested?
    
    add new test cases

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/heary-cao/spark PushProjectionThroughUnion

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20541.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20541
    
----
commit 36dbc9c543f36dc5952a89c354bd70067ddd6883
Author: caoxuewen <ca...@...>
Date:   2018-02-08T08:02:17Z

    Pushes Project to both sides of Union when expression is non-deterministic

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87210/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    **[Test build #87210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87210/testReport)** for PR 20541 at commit [`36dbc9c`](https://github.com/apache/spark/commit/36dbc9c543f36dc5952a89c354bd70067ddd6883).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    I'm confused about why we need `PushProjectionThroughUnion`. Generally we only need to push down required columns, not entire project list, as there is no benifit of doing this. I think we just need to handle `Union` in the `ColumnPruning` rule, but I may miss something. cc @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20541: [SPARK-23356][SQL]Pushes Project to both sides of...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20541#discussion_r167113352
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -400,13 +400,24 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper
         // Push down deterministic projection through UNION ALL
         case p @ Project(projectList, Union(children)) =>
           assert(children.nonEmpty)
    -      if (projectList.forall(_.deterministic)) {
    -        val newFirstChild = Project(projectList, children.head)
    +      val (deterministicList, nonDeterministic) = projectList.partition(_.deterministic)
    +
    +      if (deterministicList.nonEmpty) {
    +        val newFirstChild = Project(deterministicList, children.head)
             val newOtherChildren = children.tail.map { child =>
               val rewrites = buildRewrites(children.head, child)
    -          Project(projectList.map(pushToRight(_, rewrites)), child)
    +          Project(deterministicList.map(pushToRight(_, rewrites)), child)
    --- End diff --
    
    if we push a + 1 to union, just a + 1.
    seems this lack of test cases. let's add this test cases. thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    oh, I see, I fallback to the modification of the non-deterministic expression, and to keep the newly added test cases for a+1 and a+b, can you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    I don't agree. `a + 1`/`a + b` are evaluated the same number of time, no matter you push in through Union or not. I don't see any performance benefit by doing this, except you can eliminate the entire project above Union.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    **[Test build #87237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87237/testReport)** for PR 20541 at commit [`4f5d46b`](https://github.com/apache/spark/commit/4f5d46baca612caaa882cbabb3b35665e9c7ed8b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    @gatorsmile ,@cloud-fan Can you help me to review it. thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20541: [SPARK-23356][SQL]Pushes Project to both sides of...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao closed the pull request at:

    https://github.com/apache/spark/pull/20541


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    @gatorsmile, OK, I will do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    oh ,yeah, there is a little difference, for a + 1 and a + b.
    **for a + 1**:
    ```
    `PushProjectionThroughUnion `rule handles:
    Union
    :- Project [(a#0 + 1) AS aa#10]
    :  +- LocalRelation <empty>, [a#0, b#1, c#2]
    :- Project [(d#3 + 1) AS aa#11]
    :  +- LocalRelation <empty>, [d#3, e#4, f#5]
    +- Project [(g#6 + 1) AS aa#12]
       +- LocalRelation <empty>, [g#6, h#7, i#8]
    
    `ColumnPruning `rule handles:
    Project [(a#0 + 1) AS aa#9]
    Union
    :- Project [a#0]
    :  +- LocalRelation <empty>, [a#0, b#1, c#2]
    :- Project [d#3]
    :  +- LocalRelation <empty>, [d#3, e#4, f#5]
    +- Project [g#6]
       +- LocalRelation <empty>, [g#6, h#7, i#8]
    ```
    	  
    **for a + b**:
    ```
    `PushProjectionThroughUnion `rule handles:
    Union
    :- Project [(a#0 + b#1) AS ab#10]
    :  +- LocalRelation <empty>, [a#0, b#1, c#2]
    :- Project [(d#3 + e#4) AS ab#11]
    :  +- LocalRelation <empty>, [d#3, e#4, f#5]
    +- Project [(g#6 + h#7) AS ab#12]
       +- LocalRelation <empty>, [g#6, h#7, i#8]	
    
    `ColumnPruning `rule handles:
    Project [(a#0 + b#1) AS ab#9]
    Union
    :- Project [a#0, b#1]
    :  +- LocalRelation <empty>, [a#0, b#1, c#2]
    :- Project [d#3, e#4]
    :  +- LocalRelation <empty>, [d#3, e#4, f#5]
    +- Project [g#6, h#7]
       +- LocalRelation <empty>, [g#6, h#7, i#8]
    ```
    	  
    So I think this may be the reason for the need to add the pushprojectionthroughunion rules. and to non-deterministic expression.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20541: [SPARK-23356][SQL]Pushes Project to both sides of...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20541#discussion_r166870474
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -400,13 +400,24 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper
         // Push down deterministic projection through UNION ALL
         case p @ Project(projectList, Union(children)) =>
           assert(children.nonEmpty)
    -      if (projectList.forall(_.deterministic)) {
    -        val newFirstChild = Project(projectList, children.head)
    +      val (deterministicList, nonDeterministic) = projectList.partition(_.deterministic)
    +
    +      if (deterministicList.nonEmpty) {
    +        val newFirstChild = Project(deterministicList, children.head)
             val newOtherChildren = children.tail.map { child =>
               val rewrites = buildRewrites(children.head, child)
    -          Project(projectList.map(pushToRight(_, rewrites)), child)
    +          Project(deterministicList.map(pushToRight(_, rewrites)), child)
    --- End diff --
    
    do we push `a + 1` to union children? or just `a`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    `ColumnPruning` rule handles `Union` already.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    I think the use case is, by pushing projects into Union, we are more likely to combine adjacent Unions. So I don't think we need to improve it to push part of the project list and still leave a project above Union.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    **[Test build #87210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87210/testReport)** for PR 20541 at commit [`36dbc9c`](https://github.com/apache/spark/commit/36dbc9c543f36dc5952a89c354bd70067ddd6883).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87237/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    in my opinion, this is considered that PushProjectionThroughUnion optimizes rules when there are multiple columns of union in data sources, while projection requires only a few columns, and the performance of file operation is better.  thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    **[Test build #87237 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87237/testReport)** for PR 20541 at commit [`4f5d46b`](https://github.com/apache/spark/commit/4f5d46baca612caaa882cbabb3b35665e9c7ed8b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20541
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org