You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/27 12:57:58 UTC

[GitHub] [spark] MaxGekk opened a new pull request, #37004: [WIP][SQL] Add the `REGEXP_COUNT` function

MaxGekk opened a new pull request, #37004:
URL: https://github.com/apache/spark/pull/37004

   ### What changes were proposed in this pull request?
   In the PR, I propose to add new expression `RegExpCount` as a runtime replaceable expression.
   
   ### Why are the changes needed?
   To make the migration process from other systems to Spark SQL easier, and achieve feature parity to such systems. For example, ... supports the `REGEXP_COUNT` function, see ....
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   By running new tests:
   ```
   $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql"
   $ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
   $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #37004:
URL: https://github.com/apache/spark/pull/37004#discussion_r908209618


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -980,3 +980,38 @@ case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expres
       newFirst: Expression, newSecond: Expression, newThird: Expression): RegExpExtractAll =
     copy(subject = newFirst, regexp = newSecond, idx = newThird)
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(str, regexp) - Returns a count of the number of times that the regular expression pattern `regexp` is matched in the string `str`.
+  """,
+  arguments = """
+    Arguments:
+      * str - a string expression.
+      * regexp - a string representing a regular expression. The regex string should be a
+          Java regular expression.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
+       2
+      > SELECT _FUNC_('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
+       8
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class RegExpCount(left: Expression, right: Expression, replacement: Expression)
+  extends RuntimeReplaceable with InheritAnalysisRules {

Review Comment:
   Let me try to remove it ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37004:
URL: https://github.com/apache/spark/pull/37004#discussion_r908444535


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -980,3 +980,41 @@ case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expres
       newFirst: Expression, newSecond: Expression, newThird: Expression): RegExpExtractAll =
     copy(subject = newFirst, regexp = newSecond, idx = newThird)
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(str, regexp) - Returns a count of the number of times that the regular expression pattern `regexp` is matched in the string `str`.
+  """,
+  arguments = """
+    Arguments:
+      * str - a string expression.
+      * regexp - a string representing a regular expression. The regex string should be a
+          Java regular expression.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
+       2
+      > SELECT _FUNC_('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
+       8
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class RegExpCount(left: Expression, right: Expression)
+  extends RuntimeReplaceable with ExpectsInputTypes {

Review Comment:
   shall we extend `ImplicitCastInputTypes` so that type coercion can happen?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on PR #37004:
URL: https://github.com/apache/spark/pull/37004#issuecomment-1169041800

   Merging to master. Thank you, @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37004:
URL: https://github.com/apache/spark/pull/37004#discussion_r908193886


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -980,3 +980,38 @@ case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expres
       newFirst: Expression, newSecond: Expression, newThird: Expression): RegExpExtractAll =
     copy(subject = newFirst, regexp = newSecond, idx = newThird)
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(str, regexp) - Returns a count of the number of times that the regular expression pattern `regexp` is matched in the string `str`.
+  """,
+  arguments = """
+    Arguments:
+      * str - a string expression.
+      * regexp - a string representing a regular expression. The regex string should be a
+          Java regular expression.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
+       2
+      > SELECT _FUNC_('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
+       8
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class RegExpCount(left: Expression, right: Expression, replacement: Expression)
+  extends RuntimeReplaceable with InheritAnalysisRules {

Review Comment:
   Do we need `InheritAnalysisRules`? Since the type coercion rule is very simple here: both of the 2 parameters should be string type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #37004:
URL: https://github.com/apache/spark/pull/37004#discussion_r908275686


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -980,3 +980,38 @@ case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expres
       newFirst: Expression, newSecond: Expression, newThird: Expression): RegExpExtractAll =
     copy(subject = newFirst, regexp = newSecond, idx = newThird)
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(str, regexp) - Returns a count of the number of times that the regular expression pattern `regexp` is matched in the string `str`.
+  """,
+  arguments = """
+    Arguments:
+      * str - a string expression.
+      * regexp - a string representing a regular expression. The regex string should be a
+          Java regular expression.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
+       2
+      > SELECT _FUNC_('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
+       8
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class RegExpCount(left: Expression, right: Expression, replacement: Expression)
+  extends RuntimeReplaceable with InheritAnalysisRules {

Review Comment:
   Probably, need to extend `ExpectsInputTypes`, I guess.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #37004:
URL: https://github.com/apache/spark/pull/37004#discussion_r908268235


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -980,3 +980,38 @@ case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expres
       newFirst: Expression, newSecond: Expression, newThird: Expression): RegExpExtractAll =
     copy(subject = newFirst, regexp = newSecond, idx = newThird)
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(str, regexp) - Returns a count of the number of times that the regular expression pattern `regexp` is matched in the string `str`.
+  """,
+  arguments = """
+    Arguments:
+      * str - a string expression.
+      * regexp - a string representing a regular expression. The regex string should be a
+          Java regular expression.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
+       2
+      > SELECT _FUNC_('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
+       8
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class RegExpCount(left: Expression, right: Expression, replacement: Expression)
+  extends RuntimeReplaceable with InheritAnalysisRules {

Review Comment:
   I have removed `InheritAnalysisRules` and have to re-implement the expr to:
   ```scala
   case class RegExpCount(left: Expression, right: Expression) extends RuntimeReplaceable {
   
     override lazy val replacement: Expression =
       Size(RegExpExtractAll(left, right, Literal(0)), legacySizeOfNull = false)
   
     override def prettyName: String = "regexp_count"
   
     override def children: Seq[Expression] = Seq(left, right)
   
     override protected def withNewChildrenInternal(
         newChildren: IndexedSeq[Expression]): RegExpCount =
       copy(left = newChildren(0), right = newChildren(1))
   }
   ```
   and got the test failure on NULL handling:
   ```diff
    -- !query
    SELECT regexp_count(null, 'abc')
    -- !query schema
   -struct<regexp_count(NULL, abc):int>
   +struct<>
    -- !query output
   -NULL
   +java.lang.IllegalStateException
   +Illegal RuntimeReplaceable: regexp_count(null, abc)
   +Replacement is unresolved: size(regexp_extract_all(null, abc, 0), false)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function

Posted by GitBox <gi...@apache.org>.
MaxGekk closed pull request #37004: [SPARK-39618][SQL] Add the `REGEXP_COUNT` function
URL: https://github.com/apache/spark/pull/37004


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org