You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/09 08:05:35 UTC

[GitHub] [spark] beliefer opened a new pull request #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

beliefer opened a new pull request #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507
 
 
   ### What changes were proposed in this pull request?
   `regexp_extract_all` is a very useful function expanded the capabilities of `regexp_extract`.
   There are some description of this function.
   ```
   SELECT regexp_extract('1a 2b 14m', '\d+'); -- 1
   SELECT regexp_extract_all('1a 2b 14m', '\d+'); -- [1, 2, 14]
   SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a'
   SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']
   ```
   There are some mainstream database support the syntax.
   **Presto:**
   https://prestodb.io/docs/current/functions/regexp.html
   
   **Pig:**
   https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html
   
   ### Why are the changes needed?
   `regexp_extract_all` is a very useful function and make work easier.
   
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   New UT

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585246145
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583817457
 
 
   **[Test build #118091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118091/testReport)** for PR 27507 at commit [`06bb690`](https://github.com/apache/spark/commit/06bb690dfd4e884c0db56af1f854cb0861bed810).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585383256
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118304/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585249982
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23064/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585738790
 
 
   **[Test build #118341 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118341/testReport)** for PR 27507 at commit [`5e3c092`](https://github.com/apache/spark/commit/5e3c092dc055ca0f1a2f523efa5f305555b991e6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585518801
 
 
   > I have a high-level question. Do we have huge advantage to generate Java code?
   > 
   > One advantage is to store the result of `Pattern.compile()` into each global variable for caching while the non-generated code shares one variable for cache.
   > On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?
   
   There is a discussion in https://github.com/apache/spark/pull/26875.
   All the regex function exists issue that worth to discuss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585784282
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585002672
 
 
   **[Test build #118274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118274/testReport)** for PR 27507 at commit [`1ed159f`](https://github.com/apache/spark/commit/1ed159f8a717d71060bcb4a3448ac7df552c4e68).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585601422
 
 
   **[Test build #118335 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118335/testReport)** for PR 27507 at commit [`5e3c092`](https://github.com/apache/spark/commit/5e3c092dc055ca0f1a2f523efa5f305555b991e6).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378292496
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2383,17 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, an empty array is returned.
 
 Review comment:
   let's document the behavior clearly.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378857718
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,67 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
 
 Review comment:
   `Extracts a group from the first match of regexp.`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378240497
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2383,17 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
 
 Review comment:
   `groups`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r377613558
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -537,3 +557,71 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) group identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
 
 Review comment:
   shall we explain the semantic of `idx`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-587274746
 
 
   @kiszk Thanks for your suggestion. Your suggestion is another way. I think the current code is OK.  If we plan to use the method you suggested, I think should make more discussion and not just modify this place.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585173101
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23055/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585081126
 
 
   **[Test build #118274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118274/testReport)** for PR 27507 at commit [`1ed159f`](https://github.com/apache/spark/commit/1ed159f8a717d71060bcb4a3448ac7df552c4e68).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kiszk commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
kiszk commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378371671
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
 
 Review comment:
   nit: `maybe contains` -> `may contain`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585173093
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585383244
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585784303
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23120/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378858627
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,98 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
 
 Review comment:
   Can we refine it a little bit? What the "group" means here if the `idx` is specified or not specified?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585783549
 
 
   **[Test build #118363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118363/testReport)** for PR 27507 at commit [`e9c5ef9`](https://github.com/apache/spark/commit/e9c5ef94934924472ae81c7a6e9c5f72bb08905a).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-586966345
 
 
   @kiszk I think one problem is, if we don't implement codegen, the child expression can't be codegened either. Also whole-stage-codegen will be disabled if there are expressions don't support codegen.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583817611
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-587385604
 
 
   > It would be good to read [this](https://www.bytefusion.de/2019/07/11/java-code-cache-full-how-to-measure-fill-level/).
   > The code cache is limited. If the code cache is filled, no native code is generated furthremore.
   
   Thanks. I learned it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585936349
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585245191
 
 
   **[Test build #118304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118304/testReport)** for PR 27507 at commit [`e50f010`](https://github.com/apache/spark/commit/e50f010c661f68134a93ec9bd186c8f0bdbd4ca0).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585383320
 
 
   **[Test build #118306 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118306/testReport)** for PR 27507 at commit [`517251a`](https://github.com/apache/spark/commit/517251ad1072fc422f5971651df6747114b3d195).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378623722
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
+          groups. `idx` indicates which regex group to extract.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('100-200, 300-400', '(\\d+)-(\\d+)', 1);
+       ["100","300"]
+  """,
+  since = "3.0.0")
+case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult
+    while(m.find) {
+      val mr: MatchResult = m.toMatchResult
+      val index = r.asInstanceOf[Int]
+      RegExpExtractBase.checkGroupIndex(mr.groupCount, index)
+      val group = mr.group(index)
+      if (group == null) { // Pattern matched, but not optional group
 
 Review comment:
   These codes are borrowed from `RegExpExtract`.
   I removed this code branch and running all test case, then I find this branch unreachable.
   I will remove this code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378612641
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
+          groups. `idx` indicates which regex group to extract.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('100-200, 300-400', '(\\d+)-(\\d+)', 1);
+       ["100","300"]
+  """,
+  since = "3.0.0")
+case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult
 
 Review comment:
   I will remove it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585003249
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23033/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585289451
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118297/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378890548
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,98 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585003236
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583842007
 
 
   **[Test build #118096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118096/testReport)** for PR 27507 at commit [`7a1c6d3`](https://github.com/apache/spark/commit/7a1c6d397b15673cf88da3416d144bb0405036b8).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585382436
 
 
   **[Test build #118304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118304/testReport)** for PR 27507 at commit [`e50f010`](https://github.com/apache/spark/commit/e50f010c661f68134a93ec9bd186c8f0bdbd4ca0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585289435
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583867818
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585879798
 
 
   I misunderstood one thing. In this PR, in my understanding, both cache w/ and w/o codegen can achieve the same advantage.
   
   Do you have performance numbers w/ and w/o codegen? I think that both can achieve a similar performance while we execute the same SQL multiple times.  Am I wrong?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378238116
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
+          index of regex group.
 
 Review comment:
   `\`idx\` indicates which regex group to extract.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378416827
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
+          groups. `idx` indicates which regex group to extract.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('100-200, 300-400', '(\\d+)-(\\d+)', 1);
+       ["100","300"]
+  """,
+  since = "3.0.0")
+case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult
+    while(m.find) {
+      val mr: MatchResult = m.toMatchResult
+      val index = r.asInstanceOf[Int]
+      RegExpExtractBase.checkGroupIndex(mr.groupCount, index)
+      val group = mr.group(index)
+      if (group == null) { // Pattern matched, but not optional group
 
 Review comment:
   do we have a test case for it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583842117
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585081783
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118274/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583867821
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118096/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585081778
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585384092
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118306/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585518801
 
 
   > I have a high-level question. Do we have huge advantage to generate Java code?
   > 
   > One advantage is to store the result of `Pattern.compile()` into each global variable for caching while the non-generated code shares one variable for cache.
   > On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?
   
   LIKE and RLIKE cache the result of `Pattern.compile()`.
   RegExpReplace and RegExpExtract use another way https://github.com/apache/spark/blob/5e3c092dc055ca0f1a2f523efa5f305555b991e6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L438.
   If the pattern string is a constant, the two approaches to the same goal.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378612746
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-587274746
 
 
   @kiszk Thanks for your suggestion which is another way. I think the current code is OK.  If we plan to use the method you suggested, I think should make more discussion and not just modify this place.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585601876
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585081778
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583836090
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585249966
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378240309
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +568,96 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) group identified by a Java regex.
 
 Review comment:
   `group` -> `groups`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585173093
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585175290
 
 
   **[Test build #118297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118297/testReport)** for PR 27507 at commit [`ea29d66`](https://github.com/apache/spark/commit/ea29d66b73af276ce7c17e176ac9a69a25c6dc06).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378238397
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
 
 Review comment:
   `a string expression of the input string.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583842007
 
 
   **[Test build #118096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118096/testReport)** for PR 27507 at commit [`7a1c6d3`](https://github.com/apache/spark/commit/7a1c6d397b15673cf88da3416d144bb0405036b8).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583842121
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22861/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585601658
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118335/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583867597
 
 
   **[Test build #118096 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118096/testReport)** for PR 27507 at commit [`7a1c6d3`](https://github.com/apache/spark/commit/7a1c6d397b15673cf88da3416d144bb0405036b8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583842121
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22861/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585246151
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23062/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585936349
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378275154
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +568,96 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) group identified by a Java regex.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583817611
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583842117
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378009621
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -537,3 +557,71 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) group identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
 
 Review comment:
   OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585289435
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585384092
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118306/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378273173
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378623722
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
+          groups. `idx` indicates which regex group to extract.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('100-200, 300-400', '(\\d+)-(\\d+)', 1);
+       ["100","300"]
+  """,
+  since = "3.0.0")
+case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult
+    while(m.find) {
+      val mr: MatchResult = m.toMatchResult
+      val index = r.asInstanceOf[Int]
+      RegExpExtractBase.checkGroupIndex(mr.groupCount, index)
+      val group = mr.group(index)
+      if (group == null) { // Pattern matched, but not optional group
 
 Review comment:
   These codes are borrowed from `RegExpExtract`.
   I removed this code branch and running all test case, then I find this branch unreachable.
   I will remove this code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585249982
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23064/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585544654
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585601658
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118335/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583836013
 
 
   **[Test build #118091 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118091/testReport)** for PR 27507 at commit [`06bb690`](https://github.com/apache/spark/commit/06bb690dfd4e884c0db56af1f854cb0861bed810).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes `
     * `case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585936355
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118363/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378238116
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
+          index of regex group.
 
 Review comment:
   ` \`idx\` indicates which regex group to extract.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583817457
 
 
   **[Test build #118091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118091/testReport)** for PR 27507 at commit [`06bb690`](https://github.com/apache/spark/commit/06bb690dfd4e884c0db56af1f854cb0861bed810).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585936355
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118363/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378238547
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
 
 Review comment:
   `a string expression of the regex string.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer edited a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585518801
 
 
   > I have a high-level question. Do we have huge advantage to generate Java code?
   > 
   > One advantage is to store the result of `Pattern.compile()` into each global variable for caching while the non-generated code shares one variable for cache.
   > On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?
   
   LIKE and RLIKE cache the result of `Pattern.compile()`.
   RegExpReplace and RegExpExtract use another way https://github.com/apache/spark/blob/5e3c092dc055ca0f1a2f523efa5f305555b991e6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L438.
   If the pattern string is a constant, the two approaches to the same goal.
   If the pattern string is a variable, the performance issue seems cannot to avoid.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r377613299
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -452,6 +454,46 @@ case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expressio
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  protected def getGenCodeVals(ctx: CodegenContext, ev: ExprCode) = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+
+    (classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull)
 
 Review comment:
   this is a little hard to read at the caller side.
   
   can we implement `doGenCode` in the base class, which calls an abstract method. Sub-classes need to implement the abstract method.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378890420
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2385,19 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific groups matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, return an empty array.
+   * if the specified group index exceeds the group count of regex, an IllegalArgumentException
+   * will be throwing.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585249378
 
 
   **[Test build #118306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118306/testReport)** for PR 27507 at commit [`517251a`](https://github.com/apache/spark/commit/517251ad1072fc422f5971651df6747114b3d195).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585784303
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23120/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585249966
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585383256
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118304/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378238116
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
+          index of regex group.
 
 Review comment:
   `index of regex group to extract.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583836090
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585003236
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583817614
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22856/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583867818
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585003249
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23033/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585602645
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23098/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583867821
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118096/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585739502
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378002552
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -452,6 +454,46 @@ case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expressio
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  protected def getGenCodeVals(ctx: CodegenContext, ev: ExprCode) = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+
+    (classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull)
 
 Review comment:
   OK. Good idea.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378273983
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
 
 Review comment:
   OK. There just references the comment of `RLIKE`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585739502
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585173101
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23055/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585249378
 
 
   **[Test build #118306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118306/testReport)** for PR 27507 at commit [`517251a`](https://github.com/apache/spark/commit/517251ad1072fc422f5971651df6747114b3d195).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585602633
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585246145
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585602633
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585739514
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118341/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585544384
 
 
   **[Test build #118335 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118335/testReport)** for PR 27507 at commit [`5e3c092`](https://github.com/apache/spark/commit/5e3c092dc055ca0f1a2f523efa5f305555b991e6).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378416104
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
+          groups. `idx` indicates which regex group to extract.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('100-200, 300-400', '(\\d+)-(\\d+)', 1);
+       ["100","300"]
+  """,
+  since = "3.0.0")
+case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult
 
 Review comment:
   where do we use this `mr`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585739514
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118341/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378237273
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
 
 Review comment:
   TBH I don't think there is much common code to share. Maybe we can have a
   `protected def setNotNullCode(ev: ExprCode) = ...` but that's all.
   
   How about we just let each sub-class implement `doGenCode` individually?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378282100
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2383,17 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585601649
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585544654
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585383244
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585783549
 
 
   **[Test build #118363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118363/testReport)** for PR 27507 at commit [`e9c5ef9`](https://github.com/apache/spark/commit/e9c5ef94934924472ae81c7a6e9c5f72bb08905a).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378238116
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
+          index of regex group.
 
 Review comment:
   `idx indicates which regex group to extract.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585601649
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583836093
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118091/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378857718
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,67 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
 
 Review comment:
   `Extracts a group from the first match of regexp.`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585384082
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378297287
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2383,17 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, an empty array is returned.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583836093
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118091/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-587362754
 
 
   It would be good to read [this](https://www.bytefusion.de/2019/07/11/java-code-cache-full-how-to-measure-fill-level/).   
   The code cache is limited. If the code cache is filled, no native code is generated furthremore.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585288636
 
 
   **[Test build #118297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118297/testReport)** for PR 27507 at commit [`ea29d66`](https://github.com/apache/spark/commit/ea29d66b73af276ce7c17e176ac9a69a25c6dc06).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585544658
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23092/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585289451
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118297/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585175290
 
 
   **[Test build #118297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118297/testReport)** for PR 27507 at commit [`ea29d66`](https://github.com/apache/spark/commit/ea29d66b73af276ce7c17e176ac9a69a25c6dc06).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378856977
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2385,19 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific groups matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, return an empty array.
+   * if the specified group index exceeds the group count of regex, an IllegalArgumentException
+   * will be throwing.
 
 Review comment:
   will be thrown

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585544384
 
 
   **[Test build #118335 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118335/testReport)** for PR 27507 at commit [`5e3c092`](https://github.com/apache/spark/commit/5e3c092dc055ca0f1a2f523efa5f305555b991e6).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378241896
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2383,17 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, an empty array is returned.
 
 Review comment:
   The behavior seems to be
   1. If the regex does not match, return an empty array
   2. if the specified group does not match, put an empty string to the result array.
   
   Can we document the behavior in SQL expression? And can you verify this is the standard behavior in other databases?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585935213
 
 
   **[Test build #118363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118363/testReport)** for PR 27507 at commit [`e9c5ef9`](https://github.com/apache/spark/commit/e9c5ef94934924472ae81c7a6e9c5f72bb08905a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585518801
 
 
   > I have a high-level question. Do we have huge advantage to generate Java code?
   > 
   > One advantage is to store the result of `Pattern.compile()` into each global variable for caching while the non-generated code shares one variable for cache.
   > On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?
   
   There is a discussion in https://github.com/apache/spark/pull/26875

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585081783
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118274/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378274860
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
+          index of regex group.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585245191
 
 
   **[Test build #118304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118304/testReport)** for PR 27507 at commit [`e50f010`](https://github.com/apache/spark/commit/e50f010c661f68134a93ec9bd186c8f0bdbd4ca0).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585301466
 
 
   I have a high-level question. Do we have huge advantage to generate Java code?
   
   One advantage is to store the result of `Pattern.compile()` into each global variable for caching while the non-generated code shares one variable for cache.   
   On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378239327
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
 
 Review comment:
   `an int expression of the regex group index.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585002672
 
 
   **[Test build #118274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118274/testReport)** for PR 27507 at commit [`1ed159f`](https://github.com/apache/spark/commit/1ed159f8a717d71060bcb4a3448ac7df552c4e68).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585602209
 
 
   **[Test build #118341 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118341/testReport)** for PR 27507 at commit [`5e3c092`](https://github.com/apache/spark/commit/5e3c092dc055ca0f1a2f523efa5f305555b991e6).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585602645
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23098/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585384082
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378645357
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -508,3 +535,99 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
     })
   }
 }
+
+/**
+ * Extract all specific(idx) groups identified by a Java regex.
+ *
+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression of the input string.
+      * regexp - a string expression of the regex string.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple
+          groups. `idx` indicates which regex group to extract.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('100-200, 300-400', '(\\d+)-(\\d+)', 1);
+       ["100","300"]
+  """,
+  since = "3.0.0")
+case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult
+    while(m.find) {
+      val mr: MatchResult = m.toMatchResult
+      val index = r.asInstanceOf[Int]
+      RegExpExtractBase.checkGroupIndex(mr.groupCount, index)
+      val group = mr.group(index)
+      if (group == null) { // Pattern matched, but not optional group
 
 Review comment:
   OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support string function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-583817614
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22856/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585602209
 
 
   **[Test build #118341 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118341/testReport)** for PR 27507 at commit [`5e3c092`](https://github.com/apache/spark/commit/5e3c092dc055ca0f1a2f523efa5f305555b991e6).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378274404
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
+
+          Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for `regexp` can be
+          "^\\abc$".
+
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585784282
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
kiszk commented on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-587264772
 
 
   @cloud-fan Make sense to support codegen for whole-stage-codegen. 
   
   @beliefer Another question: Do we need to generate the same whole code multiple times? Since I cannot see the specialized part in the generated code, how about generating the runtime call. [Here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3876) is an example.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378239103
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+    doNullSafeCodeGen(
+      ctx,
+      ev,
+      classNamePattern,
+      classNameRegExpExtractBase,
+      matcher,
+      matchResult,
+      termLastRegex,
+      termPattern,
+      setEvNotNull)
+  }
+
+  def doNullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      classNamePattern: String,
+      classNameRegExpExtractBase: String,
+      matcher: String,
+      matchResult: String,
+      termLastRegex: String,
+      termPattern: String,
+      setEvNotNull: String): ExprCode
+}
+
 /**
  * Extract a specific(idx) group identified by a Java regex.
  *
  * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
  */
 @ExpressionDescription(
   usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.
 
 Review comment:
   what is `Java regular expression`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378277156
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ##########
 @@ -419,39 +421,104 @@ object RegExpExtract {
   }
 }
 
+abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes {
+  def subject: Expression
+  def regexp: Expression
+  def idx: Expression
+
+  // last regex in string, we will update the pattern iff regexp value changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, IntegerType)
+  override def children: Seq[Expression] = subject :: regexp :: idx :: Nil
+
+  protected def getLastMatcher(s: Any, p: Any): Matcher = {
+    if (!p.equals(lastRegex)) {
+      // regex value changed
+      lastRegex = p.asInstanceOf[UTF8String].clone()
+      pattern = Pattern.compile(lastRegex.toString)
+    }
+    pattern.matcher(s.toString)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val classNameRegExpExtractBase = classOf[RegExpExtractBase].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+    val matchResult = ctx.freshName("matchResult")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
 
 Review comment:
   OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585544658
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23092/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#issuecomment-585246151
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23062/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
URL: https://github.com/apache/spark/pull/27507#discussion_r378290232
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -2383,6 +2383,17 @@ object functions {
     RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
   }
 
+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, an empty array is returned.
 
 Review comment:
   2. should throw a IllegalArgumentException. 
   https://github.com/apache/spark/pull/27508
   the behavior of Hive is :
   `FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘2’: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@2cf5e0f0 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {x=a3&x=18abc&x=2&y=3&x=4:java.lang.String, x=([0-9]+)[a-z]:java.lang.String, 2:java.lang.Integer} of size 3`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org