You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/09 14:57:52 UTC

[GitHub] [spark] Dooyoung-Hwang opened a new pull request #35465: [SPARK-38168][SQL] LikeSimplification rule handles escape characters

Dooyoung-Hwang opened a new pull request #35465:
URL: https://github.com/apache/spark/pull/35465


   ### What changes were proposed in this pull request?
   Currently, LikeSimplification rule is skipped if the pattern in LIKE filter contains an escape character.
   Thus, the filter "LIKE '%100\\%'" in this query is not optimized into 'EndsWith' of StringType.
   ```
   spark.sql("SELECT * FROM tbl LIKE '%100\\%'").explain(true)
   
   ...
   == Optimized Logical Plan ==
   Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
   +- Relation[c_1#0,c_2#1,c_3#2]
   ...
   ```
   
   LikeSimplification rule can consider a special character(wildcard(%, _) or escape character) as a plain character if the special character follows an escape character. By doing that, LikeSimplification rule can optimize the filter like below.
   
   ```
   spark.sql("SELECT * FROM tbl LIKE '%100\\%'").explain(true)
   
   ...
   == Optimized Logical Plan ==
   Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
   +- Relation[c_1#0,c_2#1,c_3#2] 
   ```
   
   ### Why are the changes needed?
   To enhance performance of processing LIKE filters such as "LIKE '%100\\%'", "LIKE '\\%100\\%'", "LIKE '%\\%100\\%%'", "LIKE '100\\%%'", "LIKE '100\\%%90\\%'"
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Test suites are added in LikeSimplificationSuite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35465: [SPARK-38168][SQL] LikeSimplification rule handles escape characters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35465:
URL: https://github.com/apache/spark/pull/35465#issuecomment-1034512588


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35465: [SPARK-38168][SQL] LikeSimplification rule handles escape characters

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35465:
URL: https://github.com/apache/spark/pull/35465#discussion_r803309117



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
##########
@@ -680,42 +680,50 @@ object PushFoldableIntoBranches extends Rule[LogicalPlan] with PredicateHelper {
  * pattern.
  */
 object LikeSimplification extends Rule[LogicalPlan] {
-  // if guards below protect from escapes on trailing %.
-  // Cases like "something\%" are not optimized, but this does not affect correctness.
-  private val startsWith = "([^_%]+)%".r
-  private val endsWith = "%([^_%]+)".r
-  private val startsAndEndsWith = "([^_%]+)%([^_%]+)".r
-  private val contains = "%([^_%]+)%".r
-  private val equalTo = "([^_%]*)".r
-
   private def simplifyLike(
       input: Expression, pattern: String, escapeChar: Char = '\\'): Option[Expression] = {
-    if (pattern.contains(escapeChar)) {
-      // There are three different situations when pattern containing escapeChar:
-      // 1. pattern contains invalid escape sequence, e.g. 'm\aca'
-      // 2. pattern contains escaped wildcard character, e.g. 'ma\%ca'
-      // 3. pattern contains escaped escape character, e.g. 'ma\\ca'
-      // Although there are patterns can be optimized if we handle the escape first, we just
-      // skip this rule if pattern contains any escapeChar for simplicity.

Review comment:
       This is a trade-off between code simplicity and performance. The assumption is that using escape char is rare and we shouldn't add a complicated implementation for it.
   
   Besides, Spark now provides `starts_with`, `ends_with`, etc. functions and people can use them directly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Dooyoung-Hwang commented on a change in pull request #35465: [SPARK-38168][SQL] LikeSimplification rule handles escape characters

Posted by GitBox <gi...@apache.org>.
Dooyoung-Hwang commented on a change in pull request #35465:
URL: https://github.com/apache/spark/pull/35465#discussion_r803313544



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
##########
@@ -680,42 +680,50 @@ object PushFoldableIntoBranches extends Rule[LogicalPlan] with PredicateHelper {
  * pattern.
  */
 object LikeSimplification extends Rule[LogicalPlan] {
-  // if guards below protect from escapes on trailing %.
-  // Cases like "something\%" are not optimized, but this does not affect correctness.
-  private val startsWith = "([^_%]+)%".r
-  private val endsWith = "%([^_%]+)".r
-  private val startsAndEndsWith = "([^_%]+)%([^_%]+)".r
-  private val contains = "%([^_%]+)%".r
-  private val equalTo = "([^_%]*)".r
-
   private def simplifyLike(
       input: Expression, pattern: String, escapeChar: Char = '\\'): Option[Expression] = {
-    if (pattern.contains(escapeChar)) {
-      // There are three different situations when pattern containing escapeChar:
-      // 1. pattern contains invalid escape sequence, e.g. 'm\aca'
-      // 2. pattern contains escaped wildcard character, e.g. 'ma\%ca'
-      // 3. pattern contains escaped escape character, e.g. 'ma\\ca'
-      // Although there are patterns can be optimized if we handle the escape first, we just
-      // skip this rule if pattern contains any escapeChar for simplicity.

Review comment:
       Ok, thank you.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Dooyoung-Hwang commented on pull request #35465: [SPARK-38168][SQL] LikeSimplification rule handles escape characters

Posted by GitBox <gi...@apache.org>.
Dooyoung-Hwang commented on pull request #35465:
URL: https://github.com/apache/spark/pull/35465#issuecomment-1033858911


   Could you review this patch?
   @cloud-fan @HyukjinKwon @dongjoon-hyun 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Dooyoung-Hwang closed pull request #35465: [SPARK-38168][SQL] LikeSimplification rule handles escape characters

Posted by GitBox <gi...@apache.org>.
Dooyoung-Hwang closed pull request #35465:
URL: https://github.com/apache/spark/pull/35465


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org