You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/12 18:51:14 UTC

[GitHub] [spark] peter-toth opened a new pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

peter-toth opened a new pull request #31820:
URL: https://github.com/apache/spark/pull/31820


   ### What changes were proposed in this pull request?
   
   This PR adds canonicalization to `FileScan.partitionFilters` and `FileScan.dataFilters` in `BatchScanExec` nodes.
   
   ### Why are the changes needed?
   
   Partition filters and data filters added to `FileScan` (in https://github.com/apache/spark/pull/27112 and https://github.com/apache/spark/pull/27157) caused that canonicalized form of `BatchScanExec` nodes don't match and so they prevents some reuse possibilities.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added new UT.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800108155


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800108411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth closed pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth closed pull request #31820:
URL: https://github.com/apache/spark/pull/31820


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593831495



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -49,6 +49,13 @@ case class BatchScanExec(
   }
 
   override def doCanonicalize(): BatchScanExec = {
-    this.copy(output = output.map(QueryPlan.normalizeExpressions(_, output)))
+    val canonicalizedScan = scan match {
+      case s: FileScan =>
+        s.withFilters(QueryPlan.normalizePredicates(s.partitionFilters, output),
+          QueryPlan.normalizePredicates(s.dataFilters, output))
+      case _ => scan
+    }
+    this.copy(output = output.map(QueryPlan.normalizeExpressions(_, output)),
+      scan = canonicalizedScan)

Review comment:
       ```suggestion
       this.copy(
         output = output.map(QueryPlan.normalizeExpressions(_, output)),
         scan = canonicalizedScan)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800078659


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136105/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797703980


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40602/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593648104



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -86,7 +86,7 @@ trait FileScan extends Scan
 
   override def equals(obj: Any): Boolean = obj match {
     case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&

Review comment:
       Oh... nice catch. cc: @gengliangwang , too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593720033



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -86,7 +86,7 @@ trait FileScan extends Scan
 
   override def equals(obj: Any): Boolean = obj match {
     case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&

Review comment:
       Credit goes to @bersprockets for catching this in SPARK-33482.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797941864


   **[Test build #136026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136026/testReport)** for PR 31820 at commit [`c0fb9b2`](https://github.com/apache/spark/commit/c0fb9b2cdca15e06ff2f708b7046a44cfc965da1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800102602


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r594983681



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -49,6 +49,15 @@ case class BatchScanExec(
   }
 
   override def doCanonicalize(): BatchScanExec = {
-    this.copy(output = output.map(QueryPlan.normalizeExpressions(_, output)))
+    val canonicalizedScan = scan match {
+      case s: FileScan =>
+        s.withFilters(
+          QueryPlan.normalizePredicates(s.partitionFilters, output),
+          QueryPlan.normalizePredicates(s.dataFilters, output))

Review comment:
       This works, but is a bit hacky as it doesn't apply to all the `Scan` implementations.
   
   I think we should add doc in the `Scan` interface to explain how the `hashCode/equals` should be implemented.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797706327


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40602/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593831464



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -49,6 +49,13 @@ case class BatchScanExec(
   }
 
   override def doCanonicalize(): BatchScanExec = {
-    this.copy(output = output.map(QueryPlan.normalizeExpressions(_, output)))
+    val canonicalizedScan = scan match {
+      case s: FileScan =>
+        s.withFilters(QueryPlan.normalizePredicates(s.partitionFilters, output),
+          QueryPlan.normalizePredicates(s.dataFilters, output))

Review comment:
       ```suggestion
           s.withFilters(
             QueryPlan.normalizePredicates(s.partitionFilters, output),
             QueryPlan.normalizePredicates(s.dataFilters, output))
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-802449898


   We can close this now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593386168



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -86,7 +86,7 @@ trait FileScan extends Scan
 
   override def equals(obj: Any): Boolean = obj match {
     case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&

Review comment:
       This change is not related to this PR, but it looks like a `&&` is missing from here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] c21 commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

c21 commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593843014



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4065,6 +4066,29 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
       }
     }
   }
+
+  test("SPARK-33482: Fix FileScan canonicalization") {
+    withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") {
+      withTempPath { path =>
+        spark.range(5).toDF().write.mode("overwrite").parquet(path.toString)
+        withTempView("t") {
+          spark.read.parquet(path.toString).createOrReplaceTempView("t")
+          val df = sql(
+            """
+              |SELECT *
+              |FROM t AS t1
+              |JOIN t AS t2 ON t2.id = t1.id
+              |JOIN t AS t3 ON t3.id = t2.id
+              |""".stripMargin)
+          df.collect()

Review comment:
       do we really need `df.collect()` here? shouldn't `AdaptiveSparkPlanHelper.collect()` below take care of going through query plan properly with AQE being enabled?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593831123



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -86,7 +86,7 @@ trait FileScan extends Scan
 
   override def equals(obj: Any): Boolean = obj match {
     case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&

Review comment:
       wow ..




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797712127


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136018/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-798869365






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797941864


   **[Test build #136026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136026/testReport)** for PR 31820 at commit [`c0fb9b2`](https://github.com/apache/spark/commit/c0fb9b2cdca15e06ff2f708b7046a44cfc965da1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-799713591


   Also, cc @cloud-fan .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-798869365






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593649326



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4065,6 +4066,33 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
       }
     }
   }
+
+  test("SPARK-33482: Fix FileScan canonicalization") {
+    Seq(true, false).foreach { aqe =>
+      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "",
+        SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) {

Review comment:
       We need to set this config for this test purpose?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797712127


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136018/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-802169877


   Reviewers, I've moved the e2e test to https://github.com/apache/spark/pull/31848 and fixed the issue there by changing `FileScan.equals()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800078659


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136105/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593857501



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4065,6 +4066,29 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
       }
     }
   }
+
+  test("SPARK-33482: Fix FileScan canonicalization") {
+    withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") {
+      withTempPath { path =>
+        spark.range(5).toDF().write.mode("overwrite").parquet(path.toString)
+        withTempView("t") {
+          spark.read.parquet(path.toString).createOrReplaceTempView("t")
+          val df = sql(
+            """
+              |SELECT *
+              |FROM t AS t1
+              |JOIN t AS t2 ON t2.id = t1.id
+              |JOIN t AS t3 ON t3.id = t2.id
+              |""".stripMargin)
+          df.collect()

Review comment:
       Yes, we do need to run the query first and then check the plan as this is an AQE compatible query where `ReusedExchangeExec` nodes are inserted during execution.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-798022838


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40610/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800108411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-798382359


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136026/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] c21 commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

c21 commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593402812



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -86,7 +86,7 @@ trait FileScan extends Scan
 
   override def equals(obj: Any): Boolean = obj match {
     case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&

Review comment:
       Nice catch. It seems that https://github.com/apache/spark/pull/27112 introduced this. 
    cc @dongjoon-hyun if we want to backport this to `3.1` and `3.0`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4065,6 +4066,33 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
       }
     }
   }
+
+  test("SPARK-33482: Fix FileScan canonicalization") {
+    Seq(true, false).foreach { aqe =>
+      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "",
+        SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) {
+        withTempPath { path =>
+          spark.range(5).toDF().write.mode("overwrite").parquet(path.toString)
+          withTempView("t") {
+            spark.read.parquet(path.toString).createOrReplaceTempView("t")
+            val df = sql(
+              """
+                |SELECT *
+                |FROM t AS t1
+                |JOIN t AS t2 ON t2.id = t1.id
+                |JOIN t AS t3 ON t3.id = t2.id
+                |""".stripMargin)
+            df.collect()
+            df.explain()

Review comment:
       do we really need these two statements?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797706327


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40602/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-798022838






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-800063668


   > This is a nice patch, @peter-toth . For a better traceability, please proceed `FileScan` change in a separate PR. It looks worth to have a new JIRA because it looks like a correctness issue. And, if possible, with a separate test case focus on `FileScan.equals`.
   
   All right, reverted in https://github.com/apache/spark/pull/31820/commits/483686d98e887c278974c0ef89bf32cbb22b580d and extracted to https://github.com/apache/spark/pull/31848


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-798303644


   **[Test build #136026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136026/testReport)** for PR 31820 at commit [`c0fb9b2`](https://github.com/apache/spark/commit/c0fb9b2cdca15e06ff2f708b7046a44cfc965da1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593720458



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4065,6 +4066,33 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
       }
     }
   }
+
+  test("SPARK-33482: Fix FileScan canonicalization") {
+    Seq(true, false).foreach { aqe =>
+      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "",
+        SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) {
+        withTempPath { path =>
+          spark.range(5).toDF().write.mode("overwrite").parquet(path.toString)
+          withTempView("t") {
+            spark.read.parquet(path.toString).createOrReplaceTempView("t")
+            val df = sql(
+              """
+                |SELECT *
+                |FROM t AS t1
+                |JOIN t AS t2 ON t2.id = t1.id
+                |JOIN t AS t3 ON t3.id = t2.id
+                |""".stripMargin)
+            df.collect()
+            df.explain()

Review comment:
       No. I left `explain()` there accidentally, removed in: https://github.com/apache/spark/pull/31820/commits/c0fb9b2cdca15e06ff2f708b7046a44cfc965da1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31820:
URL: https://github.com/apache/spark/pull/31820#issuecomment-797706312


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40602/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31820: [SPARK-33482][SQL] Fix FileScan canonicalization

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31820:
URL: https://github.com/apache/spark/pull/31820#discussion_r593721144



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4065,6 +4066,33 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
       }
     }
   }
+
+  test("SPARK-33482: Fix FileScan canonicalization") {
+    Seq(true, false).foreach { aqe =>
+      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "",
+        SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) {

Review comment:
       Well, reuse exchange code is very different in AQE/non-AQE paths, but I think you are right as we just need to test the canonicalization fix so I've removed this in https://github.com/apache/spark/pull/31820/commits/c0fb9b2cdca15e06ff2f708b7046a44cfc965da1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org