You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/11 06:54:27 UTC

[GitHub] [spark] caican00 opened a new pull request, #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

caican00 opened a new pull request, #37479:
URL: https://github.com/apache/spark/pull/37479

   ### Why are the changes needed?
   select id, data FROM testcat.ns1.ns2.table
   where id =2
   and md5(data) = '8cde774d6f7333752ed72cacddb05126'
   and trim(data) = 'a' 
   Based on the SQL, we currently get the filters in the following order:
   
   // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)`
   == Physical Plan == *(1) Project [id#22L, data#23]
    +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2))
       +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
   In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly.
   
    
   
   So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated.
   
    
   
   As shown below:
   
   //  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
   == Physical Plan == *(1) Project [id#22L, data#23]
    +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2))
       +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
   
   ### How was this patch tested?
   1. Add new test
   2. manually test：the stage execution time for reading data dropped from 5min+ to 24s
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r948773474


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    val sbq = "sbq"
+    withTable(t1) {
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where md5(data) = '8cde774d6f7333752ed72cacddb05126'

Review Comment:
   > can we use a java UDF? I'm afraid we may support translating `md5` in the near future.
   
   @cloud-fan Okay, i will optimize this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951334595


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersAfter.length == 5
+        && filtersAfter(3).trim.startsWith("(udfStrLen(data")

Review Comment:
   > wait, shouldn't this udf in the far right?
   
   @cloud-fan  In the following SQL:
   ```
   SELECT id, data FROM testcat.ns1.ns2.table
   where udfStrLen(data) = 1
   and trim(data) = 'a'
   and id =2
   ```
   udfStrLen and trim functions are untranslatable and they're on the far right with respect to `id =2`.
   
   ```
   == Physical Plan ==
   *(1) Project [id#24L, data#25]
   +- *(1) Filter (((isnotnull(id#24L) AND (id#24L = 2)) AND (udfStrLen(data#25) = 1)) AND (trim(data#25, None) = a))
      +- BatchScan[id#24L, data#25] class org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1418391950

   @caican00 I have opened a new PR https://github.com/apache/spark/pull/39892. I don't have your github account email to add you as a co-author. You can add yourself as a co-author to get the commit credit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1219160957

   > case r: SupportsPushDownV2Filters
   
   @huaxingao Thanks, i will optimize this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] shardulm94 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "shardulm94 (via GitHub)" <gi...@apache.org>.

shardulm94 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1405528306

   Hey @caican00! Haven't seen an update from your side in the last few months. Are you still interested in contributing this patch to Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1212558027

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zinking commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

zinking commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1217364666

   > @zinking
   > 
   > > what about this case: day(ts)=1 versus col1 > 13. where ts is a partition column while col1 is not. and some v2 data source implementation is capable of pushing that day function down. ? what will happen?
   > 
   > Seems to me the current implementation can only push down filters which are in the format of attribute cmp lit. Could you copy and paste the plan that pushes down `day(ts)=1`?
   
   just wondering this simple patch might not work for some weird edge cases, anyways I don't have that at hand yet. this is good for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1411311731

   @caican00 if you don't have time for this any more, is it OK with you that I take this over and finish it up? We have quite some customers using DS V2, it would be nice if this fix can be merged. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1217385775

   @cloud-fan
   Can you help verify this patch? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1418414223

   I will close this PR for now @caican00 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao closed pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates
URL: https://github.com/apache/spark/pull/37479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r948686704


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala:
##########
@@ -65,7 +65,7 @@ object PushDownUtils extends PredicateHelper {
         val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter =>
           DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr)
         }
-        (Left(r.pushedFilters()), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Left(r.pushedFilters()), (postScanFilters ++ untranslatableExprs).toSeq)

Review Comment:
   I think this simple heuristic should be safe, although it can't optimize all the cases but it won't make things worse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

huaxingao commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1217325544

   @zinking 
   > what about this case: day(ts)=1 versus col1 > 13. where ts is a partition column while col1 is not. and some v2 data source implementation is capable of pushing that day function down. ? what will happen?
   
   Seems to me the current implementation can only push down filters which are in the format of attribute cmp lit. Could you copy and paste the plan that pushes down `day(ts)=1`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1217384921

   > the point is clear and valid.
   > 
   > what about this case: `day(ts)=1` versus `col1 > 13`. where `ts` is a partition column while `col1` is not. and some v2 data source implementation is capable of pushing that `day` function down. ? what will happen?
   > 
   > thought the fix should be more complicated though.
   
   @zinking 
   I think currently  only simple predicates can be pushed down and in your case, is it possible to convert `day(ts)=1` and `col1 > 13` to `ts=20220801 and day(ts)=1` and `col1 > 13`?
   So partition column `ts` can be push down and after the data scan,  `day(ts)=1` can be reused for filtering.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1219072055

   cc @sigmod 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951157596


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersAfter.length == 5
+        && filtersAfter(3).trim.startsWith("(udfStrLen(data")

Review Comment:
   wait, shouldn't this udf in the far right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

huaxingao commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r952023794


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala:
##########
@@ -65,7 +65,7 @@ object PushDownUtils extends PredicateHelper {
         val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter =>
           DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr)
         }
-        (Left(r.pushedFilters()), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Left(r.pushedFilters()), (postScanFilters ++ untranslatableExprs).toSeq)

Review Comment:
   Can we add a comment here to state that `untranslatableExprs` needs to be on the right side and also briefly explain the reason? 



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala:
##########
@@ -94,7 +94,7 @@ object PushDownUtils extends PredicateHelper {
         val postScanFilters = r.pushPredicates(translatedFilters.toArray).map { predicate =>
           DataSourceV2Strategy.rebuildExpressionFromFilter(predicate, translatedFilterToExpr)
         }
-        (Right(r.pushedPredicates), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Right(r.pushedPredicates), (postScanFilters ++ untranslatableExprs).toSeq)

Review Comment:
   same comment as the above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

huaxingao commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r944137018


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala:
##########
@@ -65,7 +65,7 @@ object PushDownUtils extends PredicateHelper {
         val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter =>
           DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr)
         }
-        (Left(r.pushedFilters()), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Left(r.pushedFilters()), (postScanFilters ++ untranslatableExprs).toSeq)

Review Comment:
   This order switching makes sense to me. I think the translated filters (`postScanFilters`) are simple filters that can be evaluated faster, while the untranslated filters are normally complicated filters that take more time to evaluate, so we want to evaluate the `postScanFilters` filters first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r965477862


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,55 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withUserDefinedFunction("udfStrLen" -> true) {
+      withTable(t1) {
+        spark.udf.register("udfStrLen", (str: String) => str.length)
+        sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+        sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+        val filterBefore = spark.sql(
+          s"""
+             |SELECT id, data FROM $t1
+             |where udfStrLen(data) = 1
+             |and trim(data) = 'a'

Review Comment:
   > can we remove `trim`? to make the test easier to understand.
   
   okay, i have updated it. @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1211628009

   gently ping @rdblue @cloud-fan 
   Could you help to review this patch? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "mridulm (via GitHub)" <gi...@apache.org>.

mridulm commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1405522008

   QQ: Why is this PR targeting 3.3 and not master ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951056982


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    val sbq = "sbq"
+    withTable(t1) {
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where md5(data) = '8cde774d6f7333752ed72cacddb05126'

Review Comment:
   > > can we use a java UDF? I'm afraid we may support translating `md5` in the near future.
   > 
   > @cloud-fan Okay, i will optimize this
   
   @cloud-fan Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951330558


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")

Review Comment:
   > nit: let's call `splitConjunctivePredicates` and check `toString` of the resulting predicates.
   
   @cloud-fan Thanks. Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951334595


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersAfter.length == 5
+        && filtersAfter(3).trim.startsWith("(udfStrLen(data")

Review Comment:
   > wait, shouldn't this udf in the far right?
   
   @cloud-fan  In the following SQL:
   ```
   SELECT id, data FROM testcat.ns1.ns2.table
   where udfStrLen(data) = 1
   and trim(data) = 'a'
   and id =2
   ```
   udfStrLen and trim functions are untranslatable and they're on the far right with respect to `id =2`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951387333


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,55 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withUserDefinedFunction("udfStrLen" -> true) {
+      withTable(t1) {
+        spark.udf.register("udfStrLen", (str: String) => str.length)
+        sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+        sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+        val filterBefore = spark.sql(
+          s"""
+             |SELECT id, data FROM $t1
+             |where udfStrLen(data) = 1
+             |and trim(data) = 'a'

Review Comment:
   can we remove `trim`? to make the test easier to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1219169296

   > @caican00 Could you change the filters order in `case r: SupportsPushDownV2Filters` too?
   
   @huaxingao ok, i will optimize this case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951155252


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)

Review Comment:
   nit: wrap the test with `withUserDefinedFunction` to unregister the function at the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r945427452


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala:
##########
@@ -65,7 +65,7 @@ object PushDownUtils extends PredicateHelper {
         val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter =>
           DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr)
         }
-        (Left(r.pushedFilters()), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Left(r.pushedFilters()), (postScanFilters ++ untranslatableExprs).toSeq)

Review Comment:
   > This order switching makes sense to me. I think the translated filters (`postScanFilters`) are simple filters that can be evaluated faster, while the untranslated filters are normally complicated filters that take more time to evaluate, so we want to evaluate the `postScanFilters` filters first.
   
   Thank you for your reply. That's exactly what I was thinking



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

huaxingao commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r952023380


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala:
##########
@@ -253,7 +254,8 @@ class InMemoryTable(
   }
 
   class InMemoryScanBuilder(tableSchema: StructType) extends ScanBuilder
-      with SupportsPushDownRequiredColumns {
+      with SupportsPushDownRequiredColumns with SupportsPushDownFilters
+    with SupportsPushDownV2Filters {

Review Comment:
   I think this `SupportsPushDownV2Filters` should be implemented in `InMemoryV2FilterScanBuilder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951334595


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersAfter.length == 5
+        && filtersAfter(3).trim.startsWith("(udfStrLen(data")

Review Comment:
   > wait, shouldn't this udf in the far right?
   
   @cloud-fan  In the following SQL:
   ```
   SELECT id, data FROM testcat.ns1.ns2.table
   where udfStrLen(data) = 1
   and trim(data) = 'a'
   and id =2
   ```
   udfStrLen and trim functions are untranslatable and they're on the far right with respect to `id =2`. Before 
    this optimization, `id = 2` was on the far right.
   
   ```
   == Physical Plan ==
   *(1) Project [id#24L, data#25]
   +- *(1) Filter (((isnotnull(id#24L) AND (id#24L = 2)) AND (udfStrLen(data#25) = 1)) AND (trim(data#25, None) = a))
      +- BatchScan[id#24L, data#25] class org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

huaxingao commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r944137954


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala:
##########
@@ -65,7 +65,7 @@ object PushDownUtils extends PredicateHelper {
         val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter =>
           DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr)
         }
-        (Left(r.pushedFilters()), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Left(r.pushedFilters()), (postScanFilters ++ untranslatableExprs).toSeq)

Review Comment:
   cc @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1408035075

   @caican00 Do you want to finish this? I think you can just remove implementing `SupportsPushDownV2Filters` here https://github.com/apache/spark/blob/10dbd4239787ad8db80b06cfde90380c3696c386/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala#L258, then it's ready to be merged. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zinking commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

zinking commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1212897707

   
   the point is clear and valid.  
   
   what about this case:  `day(ts)=1` versus `col1 > 13`. where `ts` is a partition column while `col1` is not.  and some v2 data source implementation is capable of pushing that `day` function down. ?  what will happen?
   
   thought the fix should be more complicated though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] shardulm94 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1294365611

   @caican00 Do you think this PR is ready for another round of review? In our organization, we have seen a number of users impacted by this after migration to DSv2, so it would be nice to get this merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1221904476

   > @caican00 Could you change the filters order in `case r: SupportsPushDownV2Filters` too?
   
   @huaxingao  Updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951157110


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")

Review Comment:
   nit: let's call `splitConjunctivePredicates` and check `toString` of the resulting predicates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951330058


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)

Review Comment:
   > nit: wrap the test with `withUserDefinedFunction` to unregister the function at the end.
   
   @cloud-fan Thanks. Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r965477750


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,55 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withUserDefinedFunction("udfStrLen" -> true) {
+      withTable(t1) {
+        spark.udf.register("udfStrLen", (str: String) => str.length)
+        sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+        sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+        val filterBefore = spark.sql(
+          s"""
+             |SELECT id, data FROM $t1
+             |where udfStrLen(data) = 1
+             |and trim(data) = 'a'

Review Comment:
   @cloud-fan okay, i have updated it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r948687132


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    val sbq = "sbq"
+    withTable(t1) {
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where md5(data) = '8cde774d6f7333752ed72cacddb05126'

Review Comment:
   can we use a java UDF? I'm afraid we may support translating `md5` in the near future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 commented on PR #37479:
URL: https://github.com/apache/spark/pull/37479#issuecomment-1219168497

   > @caican00 Could you change the filters order in `case r: SupportsPushDownV2Filters` too?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] caican00 closed pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Posted by GitBox <gi...@apache.org>.

caican00 closed pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates
URL: https://github.com/apache/spark/pull/37479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org