You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2017/11/12 16:23:12 UTC

[GitHub] spark pull request #19727: [WIP][SPARK-22497][SQL] Project reuse

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/19727

    [WIP][SPARK-22497][SQL] Project reuse

    ## What changes were proposed in this pull request?
    
    The below SQL will scan `table1` twice. This PR reuse the `p1` and scan `table1` once.
    ```sql
    with p1 as (select * from table1 where key < 100), 
    s1 as (SELECT key, count(*) FROM p1 group by key), 
    s2 as (SELECT key, count(*) FROM p1 where key > -100 group by key) 
    select s1.* from s1 join s2 on s1.key= s2.key
    ```
    
    ## How was this patch tested?
    
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-22497

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19727
    
----
commit 1c458b8860b3b17f137db18eff9f97df81b47a76
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-11-12T16:14:38Z

    Reuse project

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19727
  
    Simply reusing `ProjectExec` doesn't really reduce the scan. The duplication execution of CTE is a well known issue. I've addressed it before. But seems no solution to deal all possible cases yet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19727
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19727
  
    **[Test build #83744 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83744/testReport)** for PR 19727 at commit [`1c458b8`](https://github.com/apache/spark/commit/1c458b8860b3b17f137db18eff9f97df81b47a76).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ReusedProjectExec(override val output: Seq[Attribute], child: ProjectExec)`
      * `case class ReuseProject(conf: SQLConf) extends Rule[SparkPlan] `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19727
  
    **[Test build #83744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83744/testReport)** for PR 19727 at commit [`1c458b8`](https://github.com/apache/spark/commit/1c458b8860b3b17f137db18eff9f97df81b47a76).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19727
  
    CTE reuse can cause the performance regression. It is hard to address without considering the costs. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum closed the pull request at:

    https://github.com/apache/spark/pull/19727


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19727
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83744/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org