You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2017/11/12 16:23:12 UTC
[GitHub] spark pull request #19727: [WIP][SPARK-22497][SQL] Project reuse
GitHub user wangyum opened a pull request:
https://github.com/apache/spark/pull/19727
[WIP][SPARK-22497][SQL] Project reuse
## What changes were proposed in this pull request?
The below SQL will scan `table1` twice. This PR reuse the `p1` and scan `table1` once.
```sql
with p1 as (select * from table1 where key < 100),
s1 as (SELECT key, count(*) FROM p1 group by key),
s2 as (SELECT key, count(*) FROM p1 where key > -100 group by key)
select s1.* from s1 join s2 on s1.key= s2.key
```
## How was this patch tested?
unit tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangyum/spark SPARK-22497
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19727.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19727
----
commit 1c458b8860b3b17f137db18eff9f97df81b47a76
Author: Yuming Wang <wg...@gmail.com>
Date: 2017-11-12T16:14:38Z
Reuse project
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19727
Simply reusing `ProjectExec` doesn't really reduce the scan. The duplication execution of CTE is a well known issue. I've addressed it before. But seems no solution to deal all possible cases yet.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19727
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19727
**[Test build #83744 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83744/testReport)** for PR 19727 at commit [`1c458b8`](https://github.com/apache/spark/commit/1c458b8860b3b17f137db18eff9f97df81b47a76).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `case class ReusedProjectExec(override val output: Seq[Attribute], child: ProjectExec)`
* `case class ReuseProject(conf: SQLConf) extends Rule[SparkPlan] `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19727
**[Test build #83744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83744/testReport)** for PR 19727 at commit [`1c458b8`](https://github.com/apache/spark/commit/1c458b8860b3b17f137db18eff9f97df81b47a76).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19727
CTE reuse can cause the performance regression. It is hard to address without considering the costs.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum closed the pull request at:
https://github.com/apache/spark/pull/19727
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19727: [WIP][SPARK-22497][SQL] Project reuse
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19727
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83744/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org