You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/09/21 13:26:09 UTC

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22518

    [SPARK-25482][SQL] ReuseSubquery can be useless when the subqueries have the same id

    ## What changes were proposed in this pull request?
    
    When a `ExecSubqueryExpression` is copied, it may happen that there are two instances with the same `exprId`. This can happen for instance when a filter containing a scalar subquery is pushed to a DataSource. In this scenario, `ReuseSubquery` becomes useless, as replacing the `SubqueryExec` is ignored since the new plan is equal to the previous one.
    
    The PR avoids this problem by creating a new `ExprId` for the subquery expression when it is changed in `ReuseSubquery`, so that the changes are not ignored.
    
    ## How was this patch tested?
    
    added UT

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-25482

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22518
    
----
commit 36fa664c6d251901270984115ff2ebfd1b665fca
Author: Marco Gaido <ma...@...>
Date:   2018-09-21T12:50:11Z

    [SPARK-25482][SQL] ReuseSubquery can be effectless when the subqueries have the same id

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #96472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96472/testReport)** for PR 22518 at commit [`ec72458`](https://github.com/apache/spark/commit/ec724581d4bee264c83dc0824fe6d8a7be43e2ef).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #96472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96472/testReport)** for PR 22518 at commit [`ec72458`](https://github.com/apache/spark/commit/ec724581d4bee264c83dc0824fe6d8a7be43e2ef).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r233032650
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ---
    @@ -22,7 +22,7 @@ import scala.collection.mutable.ArrayBuffer
     
     import org.apache.spark.sql.SparkSession
     import org.apache.spark.sql.catalyst.{expressions, InternalRow}
    -import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, PlanExpression}
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, NamedExpression, PlanExpression}
    --- End diff --
    
    unnecessary change


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232720903
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    I think you are right about it, but it also means the data source scan must wait until the subquery is finished. We need to make tradeoffs carefully.
    
    I'd suggest we open a new ticket about scalar subquery filter pushdown to data source, and forbid it here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232737458
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    ah sorry I misread the code. Unless the subquery is rewritten into join, we must wait for all subqueries to be finished before executing the plan.
    
    We can rewrite scalar subquery in data source filters into literal, to make it work with the filter pushdown API.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232688360
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    Spark pushing filters doesn't mean the data source can always handle them and give perf improvement.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98734/testReport)** for PR 22518 at commit [`ef0a953`](https://github.com/apache/spark/commit/ef0a953f0c3fb6f5ba50e51668a3f0b6938b5416).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98681/testReport)** for PR 22518 at commit [`144091f`](https://github.com/apache/spark/commit/144091fed2a05d0d0a9d8ed3d592be61473d814b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98681/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98777 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98777/testReport)** for PR 22518 at commit [`b414572`](https://github.com/apache/spark/commit/b4145721a30f83563dca264f838e042b2741d645).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    cc @cloud-fan @dongjoon-hyun @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    BTW can you include a simple benchmark to show this problem? e.g. just run a query in spark-shell, and post the result before and after this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r233035869
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ---
    @@ -22,7 +22,7 @@ import scala.collection.mutable.ArrayBuffer
     
     import org.apache.spark.sql.SparkSession
     import org.apache.spark.sql.catalyst.{expressions, InternalRow}
    -import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, PlanExpression}
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, NamedExpression, PlanExpression}
    --- End diff --
    
    thanks sorry, I missed it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232704954
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    I see your point. I'd argue that scalar subqueries like this may be used for Filter pushdown: since the subquery is evaluated before the plan execution, their result _may_ be used in the filter push down. With the current codebase you're 100% right and I can follow your comment/suggestion. But wouldn't it be worth to let them be used (and do that in another PR)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232906743
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    --- End diff --
    
    let's update the test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98779/testReport)** for PR 22518 at commit [`52ae956`](https://github.com/apache/spark/commit/52ae9561e58d65f2c26a112ce78a78994e83f868).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232687865
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    is there any data source can support subquery filter? for data source v1/v2, the public `Filter` API does not support subquery. For file source, they don't support subquery filter either.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4986/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232725686
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    >  it also means the data source scan must wait until the subquery is finished
    
    The subquery should be executed anyway sooner or later, right? So I don't see the problem here: am I missing something?
    
    Ok, thanks, I'll follow your suggestion and forbid it here and create a new ticket about pushing it down to data sources. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    > hmm, how can this happen?
    
    you can check the UT which reproduces the issue. The scalar subquery is pushed down as part of the filter `GreaterThen`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98775/testReport)** for PR 22518 at commit [`da3843e`](https://github.com/apache/spark/commit/da3843ed4711fb0ea6103e91d08285594dba5696).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98681 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98681/testReport)** for PR 22518 at commit [`144091f`](https://github.com/apache/spark/commit/144091fed2a05d0d0a9d8ed3d592be61473d814b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    @cloud-fan this is the benchmark:
    ```
    (1 to 1000000).toSeq.toDF("a").write.save("/tmp/t1")
    spark.read.load("/tmp/t1").createTempView("t1")
    (1 to 2000).toSeq.toDF("b").write.save("/tmp/t2")
    spark.read.load("/tmp/t2").createTempView("t2")
    val plan = sql("select * from t2 where b > (select avg(a + 1) from t1)")
    val t0 = System.nanoTime()
    plan.show
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0) + "ns")
    ```
    
    the result is:
    
    ```
    Before PR: Elapsed time: 862499689ns
    After  PR: Elapsed time: 914728641ns
    ```
    The difference is very small because all the subqueries run in parallel. The execution time would be much more affected if there are several subqueries (our thread pool has 16 threads so a query like that but with 9 filters with subqueries would see a big performance gain after this PR).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98779/testReport)** for PR 22518 at commit [`52ae956`](https://github.com/apache/spark/commit/52ae9561e58d65f2c26a112ce78a78994e83f868).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4984/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98775/testReport)** for PR 22518 at commit [`da3843e`](https://github.com/apache/spark/commit/da3843ed4711fb0ea6103e91d08285594dba5696).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232906707
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala ---
    @@ -47,7 +47,8 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] {
               case a: AttributeReference =>
                 a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name)
             }
    -      }
    +      }.filterNot(SubqueryExpression.hasSubquery)
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r233032721
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Forbid pushdown to dattasources of filters containing subqueries") {
    --- End diff --
    
    `dattasources` typo


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232673035
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    we have this problem if we `copy` the same subquery. I can't think of any other case than filter push-down, but I may be wrong.
    
    Forbidding to push down these filter may cause a perf regression, I am not sure it is the right solution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    @gengliangwang no, let me cite and explain the PR description. I am not sure how to improve it, but if you have suggestions I am happy to. The main point of the PR is to address an issue which arise when:
    
    > When a `ExecSubqueryExpression` is copied
    
    Now the point is, can this condition happen? The answer is yes, and one situation in which this happens (as reported in the JIRA) is
    
    > when a filter containing a scalar subquery is pushed to a DataSource.
    
    So in the plan we have two `ExecSubqueryExpression` each with a copy of the same `SubqueryExec`. The problem which arises in this condition is that:
    
    > `ReuseSubquery` becomes useless, as replacing the `SubqueryExec` is ignored since the new plan is equal to the previous one.
    
    So this result in the subquery being executed twice (as the two `SubqueryExec` are distinct, despite they are the same).
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98778/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    I'd like to merge this simple PR first, to address the performance problem (unnecessary subquery execution).
    
    Let's create a new ticket for subquery filter pushing to data source, and have more people to attend the discussion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98775/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232668569
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    Do we only have a problem when we have subquery in data source filter? If that's the case, I would suggest not pushdown subquery filter into data source.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    @mgaido91 I see, thanks for the explanation!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by peter-toth <gi...@git.apache.org>.

Github user peter-toth commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r219617722
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ---
    @@ -166,7 +168,7 @@ case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan] {
             val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]())
             val sameResult = sameSchema.find(_.sameResult(sub.plan))
             if (sameResult.isDefined) {
    -          sub.withNewPlan(sameResult.get)
    +          sub.withNewPlan(sameResult.get).withNewExprId()
    --- End diff --
    
    Can we avoid double copy()? Or is it cleaner this way?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98778/testReport)** for PR 22518 at commit [`56ed812`](https://github.com/apache/spark/commit/56ed8129d0fa045c1a28914182d79cb9fa9d6103).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232729788
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    > The subquery should be executed anyway sooner or later, right?
    
    Yes, but we could execute scan and subquery at the same time (2 spark jobs running together), instead of executing them serialized.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98734/testReport)** for PR 22518 at commit [`ef0a953`](https://github.com/apache/spark/commit/ef0a953f0c3fb6f5ba50e51668a3f0b6938b5416).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96427/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #96427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96427/testReport)** for PR 22518 at commit [`36fa664`](https://github.com/apache/spark/commit/36fa664c6d251901270984115ff2ebfd1b665fca).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98734/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232906652
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -155,15 +155,14 @@ object FileSourceStrategy extends Strategy with Logging {
               case a: AttributeReference =>
                 a.withName(l.output.find(_.semanticEquals(a)).get.name)
             }
    -      }
    +      }.filterNot(SubqueryExpression.hasSubquery)
    --- End diff --
    
    shall we do the filter before the `map`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98779/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232558384
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    sorry it has been a long time and I don't quite remember the context.
    
    What was the problem we are trying to fix? This test looks nothing related to subquery reuse.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22518


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4987/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3344/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by peter-toth <gi...@git.apache.org>.

Github user peter-toth commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r219616464
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2", "t3") {
    --- End diff --
    
    There is no need for "t3".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #96427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96427/testReport)** for PR 22518 at commit [`36fa664`](https://github.com/apache/spark/commit/36fa664c6d251901270984115ff2ebfd1b665fca).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98777/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r219667342
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ---
    @@ -166,7 +168,7 @@ case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan] {
             val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]())
             val sameResult = sameSchema.find(_.sameResult(sub.plan))
             if (sameResult.isDefined) {
    -          sub.withNewPlan(sameResult.get)
    +          sub.withNewPlan(sameResult.get).withNewExprId()
    --- End diff --
    
    I think it is cleaner this way. I don't expect this to happen very often (how many subquery can you have in a plan?) so I don't think it is an issue. But if there are cleaner options/solutions, I am open to suggestions, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96428/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #96428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96428/testReport)** for PR 22518 at commit [`7c75067`](https://github.com/apache/spark/commit/7c75067767ed6935960d09c7915da86fea3553fa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98778/testReport)** for PR 22518 at commit [`56ed812`](https://github.com/apache/spark/commit/56ed8129d0fa045c1a28914182d79cb9fa9d6103).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4955/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232733022
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    > we could execute scan and subquery at the same time (
    
    is this really possible? My understanding is that subqueries are executed before the plan they belong to (in `SparkPlan.executeQuery`). So my understanding is that when a subquery is running, the rest of the query is not.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4988/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #98777 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98777/testReport)** for PR 22518 at commit [`b414572`](https://github.com/apache/spark/commit/b4145721a30f83563dca264f838e042b2741d645).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232580202
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    Sure, please can you check the PR description? I think the context is quite well explained there.
    
    Anyway, as a quick summary: in this case `b > (select max(a) from t1)` is pushed down as a datasource filter. So we have 2 instances of `b > (select max(a) from t1)` and the result is not reused. It is not reused because the copied plan satisfies `==`, so even if `ReuseSubquery` replaces it, then the change is ignored.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4913/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    **[Test build #96428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96428/testReport)** for PR 22518 at commit [`7c75067`](https://github.com/apache/spark/commit/7c75067767ed6935960d09c7915da86fea3553fa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96472/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries t...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22518#discussion_r232754792
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(getNumSortsInQuery(query5) == 1)
         }
       }
    +
    +  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
    +    withTempView("t1", "t2") {
    +      sql("create temporary view t1(a int) using parquet")
    +      sql("create temporary view t2(b int) using parquet")
    +      val plan = sql("select * from t2 where b > (select max(a) from t1)")
    --- End diff --
    
    yes, exactly. That's what I meant. Shall I revert the changes to the previous PR? And in the scope of the new JIRA do the rewriting to literals? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3345/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    > This can happen for instance when a filter containing a scalar subquery is pushed to a DataSource
    
    hmm, how can this happen? I don't think a data source can handle a filter of subquery...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3379/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    @mgaido91 Sorry but can you have a more detailed explanation in the PR description? 
    With the code changes, the predicate with subquery result can be push down into data source. Is this the main point of the PR? And why is that creating a new expr ID can fix it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org