You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/17 04:27:04 UTC

[GitHub] [spark] wankunde opened a new pull request, #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

wankunde opened a new pull request, #38682:
URL: https://github.com/apache/spark/pull/38682

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   We can improve multi like by reorder the match expressions.
   
   ### Why are the changes needed?
   
   Local benchmark `org.apache.spark.sql.execution.benchmark.LikeAnyBenchmark`, before this PR:
   ```
   [info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] Query with multi like                              5144           5145           2          0.0       25717.9       1.0X
   [info] Query with LikeAny simplification                  5098           5099           0          0.0       25492.3       1.0X
   [info] Query without LikeAny simplification              14501          14599         169          0.0       72504.4       0.4X
   ```
   after this PR:
   ```
   [info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] Query with multi like                              5149           5201          85          0.0       25747.3       1.0X
   [info] Query with LikeAny simplification                  9100           9335         210          0.0       45498.0       0.6X
   [info] Query without LikeAny simplification              15041          15078          48          0.0       75202.7       0.3X
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   Exists UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1320804263

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1318132557

   Seems like it became slower after your PR (?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026470585


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##########
@@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase {
       TypeCheckSuccess
     }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+       patterns: Seq[Expression],
+       expressionCombiner: (Expression, Expression) => Expression): Expression = {
+    assert(patterns.size > 0)
+    var res = patterns
+    while (res.size > 1) {
+      res = res.sliding(2, 2).toSeq
+        .map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0))
+    }
+    res.head

Review Comment:
   It seems that the original implementation is better than this one?
   ```scala
     def reduceToExpressionTree1(
         patterns: Seq[Expression],
         expressionCombiner: (Expression, Expression) => Expression): Expression = {
       var res = patterns
       while (res.size > 1) {
         res = res.sliding(2, 2).toSeq
           .map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0))
       }
       res.head
     }
   
     def reduceToExpressionTree2(
         expressions: Seq[Expression],
         expressionCombiner: (Expression, Expression) => Expression): Expression = {
       def reduceToExpressionTree(low: Int, high: Int): Expression = high - low match {
         case 0 =>
           expressions(low)
         case 1 =>
           expressionCombiner(expressions(low), expressions(high))
         case x =>
           val mid = low + x / 2
           expressionCombiner(
             reduceToExpressionTree(low, mid),
             reduceToExpressionTree(mid + 1, high))
       }
       reduceToExpressionTree(0, expressions.size - 1)
     }
   
       Seq(10000, 100000, 1000000).foreach { N =>
         val expressions = Range(1, N).map(Literal(_))
   
         val benchmark = new Benchmark(s"Benchmark reduceToExpressionTree with $N elements", N, minNumIters = 10)
         benchmark.addCase("reduceToExpressionTree1") { _ =>
           reduceToExpressionTree1(expressions, Or.apply)
         }
   
         benchmark.addCase("reduceToExpressionTree2") { _ =>
           reduceToExpressionTree2(expressions, Or.apply)
         }
   
         benchmark.run()
       }
   ```
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 10.16
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   Benchmark reduceToExpressionTree with 1000000 elements:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------
   reduceToExpressionTree1                                           823           1012         212          1.2         822.5       1.0X
   reduceToExpressionTree2                                           150            183          25          6.7         149.7       5.5X
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1320735591

   @wankunde Please fix the PR title and description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026330362


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##########
@@ -743,6 +743,21 @@ object LikeSimplification extends Rule[LogicalPlan] {
     }
   }
 
+  def combinePatterns(

Review Comment:
   Remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026987556


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##########
@@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase {
       TypeCheckSuccess
     }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+       patterns: Seq[Expression],

Review Comment:
   `patterns` -> `expressions`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1320929894

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wankunde commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

wankunde commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1336852190

   Retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026534660


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/LikeAnyBenchmark.scala:
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Benchmark to measure like any expressions performance.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class>
+ *     --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/LikeAnyBenchmark-results.txt".
+ * }}}
+ */
+object LikeAnyBenchmark extends SqlBasedBenchmark {

Review Comment:
   Please remove this benchmark test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wankunde commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

wankunde commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1337096043

   Retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1337393910

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1339368069

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026264942


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##########
@@ -756,16 +771,16 @@ object LikeSimplification extends Rule[LogicalPlan] {
     } else {
       multi match {
         case l: LikeAll =>
-          val and = replacements.reduceLeft(And)
+          val and = combinePatterns(replacements, And.apply).get

Review Comment:
   Could you reuse these code to create a balanced tree:
   https://github.com/apache/spark/blob/5c43da6858721664318c3cdbcb051231b0e98175/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L1657-L1669



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1039110128


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##########
@@ -756,16 +756,16 @@ object LikeSimplification extends Rule[LogicalPlan] {
     } else {
       multi match {
         case l: LikeAll =>
-          val and = replacements.reduceLeft(And)
+          val and = buildBalancedPredicate(replacements, And.apply)

Review Comment:
   buildBalancedPredicate(replacements, And.apply) -> buildBalancedPredicate(replacements, And)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum closed pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

Posted by GitBox <gi...@apache.org>.

wangyum closed pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate
URL: https://github.com/apache/spark/pull/38682


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wankunde commented on pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wankunde commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1318136683

   > Seems like it became slower after your PR (?)
   
    Sorry for the mistake, I have update the benchmark result, after this PR,  `Query with LikeAny simplification` should be the same with `Query with multi like`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026470585


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##########
@@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase {
       TypeCheckSuccess
     }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+       patterns: Seq[Expression],
+       expressionCombiner: (Expression, Expression) => Expression): Expression = {
+    assert(patterns.size > 0)
+    var res = patterns
+    while (res.size > 1) {
+      res = res.sliding(2, 2).toSeq
+        .map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0))
+    }
+    res.head

Review Comment:
   seems that the original implementation is better than this one?
   ```scala
     def reduceToExpressionTree1(
         patterns: Seq[Expression],
         expressionCombiner: (Expression, Expression) => Expression): Expression = {
       var res = patterns
       while (res.size > 1) {
         res = res.sliding(2, 2).toSeq
           .map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0))
       }
       res.head
     }
   
     def reduceToExpressionTree2(
         expressions: Seq[Expression],
         expressionCombiner: (Expression, Expression) => Expression): Expression = {
       def reduceToExpressionTree(low: Int, high: Int): Expression = high - low match {
         case 0 =>
           expressions(low)
         case 1 =>
           expressionCombiner(expressions(low), expressions(high))
         case x =>
           val mid = low + x / 2
           expressionCombiner(
             reduceToExpressionTree(low, mid),
             reduceToExpressionTree(mid + 1, high))
       }
       reduceToExpressionTree(0, expressions.size - 1)
     }
   
       Seq(10000, 100000, 1000000).foreach { N =>
         val expressions = Range(1, N).map(Literal(_))
   
         val benchmark = new Benchmark(s"Benchmark reduceToExpressionTree with $N elements", N, minNumIters = 10)
         benchmark.addCase("reduceToExpressionTree1") { _ =>
           reduceToExpressionTree1(expressions, Or.apply)
         }
   
         benchmark.addCase("reduceToExpressionTree2") { _ =>
           reduceToExpressionTree2(expressions, Or.apply)
         }
   
         benchmark.run()
       }
   ```
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 10.16
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   Benchmark reduceToExpressionTree with 1000000 elements:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------
   reduceToExpressionTree1                                           823           1012         212          1.2         822.5       1.0X
   reduceToExpressionTree2                                           150            183          25          6.7         149.7       5.5X
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026531799


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala:
##########
@@ -207,11 +207,17 @@ class LikeSimplificationSuite extends PlanTest {
 
     val optimized = Optimize.execute(originalQuery.analyze)
     val correctAnswer = testRelation
-      .where((((((StartsWith($"a", "abc") || EndsWith($"a", "xyz")) ||
-        (Length($"a") >= 6 && (StartsWith($"a", "abc") && EndsWith($"a", "def")))) ||
-        Contains($"a", "mn")) || ($"a" === "")) || ($"a" === "abc")) ||
-        ($"a" likeAny("abc\\%", "abc\\%def", "%mn\\%")))
-      .analyze
+      .where(
+        (
+          (

Review Comment:
   Please fix the format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1027001153


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##########
@@ -117,4 +117,29 @@ object ExprUtils extends QueryErrorsBase {
       TypeCheckSuccess
     }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+       expressions: Seq[Expression],
+       expressionCombiner: (Expression, Expression) => Expression): Expression = {
+    assert(expressions.size > 0)

Review Comment:
   Remove this `assert`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org