You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "cgivre (via GitHub)" <gi...@apache.org> on 2023/02/21 04:03:12 UTC

[GitHub] [drill] cgivre opened a new pull request, #2762: DRILL-8402: Add REGEXP_EXTRACT Function

cgivre opened a new pull request, #2762:
URL: https://github.com/apache/drill/pull/2762

   # [DRILL-8402](https://issues.apache.org/jira/browse/DRILL-8402): Add REGEXP_EXTRACT Function
   
   ## Description
   Adds `regexp_extract` functions to Drill.
   
   ## Documentation
   This PR adds support for `regexp_extract(<text>, <pattern>)` which returns an array of text corresponding with the capturing groups in the regex.  It also includes `regexp_extract(<text>, <pattern>, <index>)` which returns the text of a specific capturing group.
   
   ```sql
   SELECT regexp_extract('123-456-789', '([0-9]{3})-([0-9]{3})-([0-9]{3})');
   +---------------------+
   |       EXPR$0        |
   +---------------------+
   | ["123","456","789"] |
   +---------------------+
   
   SELECT regexp_extract('123-456-789', '([0-9]{3})-([0-9]{3})-([0-9]{3})', 0);
   +-------------+
   |   EXPR$0    |
   +-------------+
   | 123-456-789 |
   +-------------+
   
   SELECT regexp_extract('123-456-789', '([0-9]{3})-([0-9]{3})-([0-9]{3})', 3);
   +--------+
   | EXPR$0 |
   +--------+
   | 789    |
   +--------+
   ```
   
   
   ## Testing
   Added unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] cgivre commented on pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on PR #2762:
URL: https://github.com/apache/drill/pull/2762#issuecomment-1440613580

   > 
   
   @vvysotskyi  Should we proceed with this?  Is that a LGTM +1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] DRILL-8402: Add REGEXP_EXTRACT Function (drill)

Posted by "vvysotskyi (via GitHub)" <gi...@apache.org>.
vvysotskyi commented on code in PR #2762:
URL: https://github.com/apache/drill/pull/2762#discussion_r1116028724


##########
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java:
##########
@@ -293,6 +293,109 @@ public void eval() {
     }
   }
 
+  /*
+   * This function returns the capturing groups from a regex.
+   */
+  @FunctionTemplate(name = "regexp_extract", scope = FunctionScope.SIMPLE,
+      outputWidthCalculatorType = OutputWidthCalculatorType.CUSTOM_FIXED_WIDTH_DEFAULT)
+  public static class RegexpExtract implements DrillSimpleFunc {
+
+    @Param VarCharHolder input;
+    @Param(constant=true) VarCharHolder pattern;
+    @Inject
+    DrillBuf buffer;
+    @Workspace
+    java.util.regex.Matcher matcher;
+    @Workspace
+    org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper charSequenceWrapper;
+    @Output
+    ComplexWriter out;
+
+    @Override
+    public void setup() {
+      matcher = java.util.regex.Pattern.compile(org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,  pattern.end,  pattern.buffer)).matcher("");
+      charSequenceWrapper = new org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper();
+      matcher.reset(charSequenceWrapper);
+    }
+
+    @Override
+    public void eval() {
+      charSequenceWrapper.setBuffer(input.start, input.end, input.buffer);
+
+      // Reusing same charSequenceWrapper, no need to pass it in.
+      matcher.reset();
+      boolean result = matcher.find();
+
+      // Start the list here.  If there are no matches, we return an empty list.
+      org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter = out.rootAsList();
+      listWriter.startList();
+
+      if (result) {
+        org.apache.drill.exec.vector.complex.writer.VarCharWriter varCharWriter = listWriter.varChar();
+
+        for(int i = 1; i <= matcher.groupCount(); i++) {
+          final byte[] strBytes = matcher.group(i).getBytes(com.google.common.base.Charsets.UTF_8);

Review Comment:
   `matcher.group(i)` creates and returns string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] cgivre commented on pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on PR #2762:
URL: https://github.com/apache/drill/pull/2762#issuecomment-1438871668

   > Wouldn't this change introduce ReDoS vulnerability?
   
   Potentially, but we already allow `REGEXP_REPLACE` and `REGEX_MATCHES`, so I don't know that this actually makes anything worse.  I did try adding a validator with this `saferegex`[1] but that library is not suitable for inclusion in Drill. (It prints all kinds of stuff to STDOUT.) 
   
   [1]: https://github.com/jkutner/saferegex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] cgivre commented on a diff in pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on code in PR #2762:
URL: https://github.com/apache/drill/pull/2762#discussion_r1115197954


##########
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java:
##########
@@ -293,6 +293,115 @@ public void eval() {
     }
   }
 
+  /*
+   * This function returns the capturing groups from a regex.
+   */
+  @FunctionTemplate(name = "regexp_extract", scope = FunctionScope.SIMPLE,
+      outputWidthCalculatorType = OutputWidthCalculatorType.CUSTOM_FIXED_WIDTH_DEFAULT)
+  public static class RegexpExtract implements DrillSimpleFunc {
+
+    @Param VarCharHolder input;
+    @Param(constant=true) VarCharHolder pattern;
+    @Inject
+    DrillBuf buffer;
+    @Workspace
+    java.util.regex.Matcher matcher;
+    @Workspace
+    org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper charSequenceWrapper;
+    @Output
+    ComplexWriter out;
+
+    @Override
+    public void setup() {
+      matcher = java.util.regex.Pattern.compile(org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,  pattern.end,  pattern.buffer)).matcher("");
+      charSequenceWrapper = new org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper();
+      matcher.reset(charSequenceWrapper);
+    }
+
+    @Override
+    public void eval() {
+      charSequenceWrapper.setBuffer(input.start, input.end, input.buffer);
+
+      // Reusing same charSequenceWrapper, no need to pass it in.
+      matcher.reset();
+      boolean result = matcher.find();
+
+      // Start the list here.  If there are no matches, we return an empty list.
+      org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter = out.rootAsList();
+      listWriter.startList();
+
+      if (result) {
+        org.apache.drill.exec.vector.complex.writer.VarCharWriter varCharWriter = listWriter.varChar();
+        String extractedResult;
+        for(int i = 1; i <= matcher.groupCount(); i++) {
+          extractedResult = matcher.group(i);

Review Comment:
   Fixed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] cgivre commented on a diff in pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on code in PR #2762:
URL: https://github.com/apache/drill/pull/2762#discussion_r1115193957


##########
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java:
##########
@@ -90,7 +90,10 @@ public char charAt(int index) {
    */
   @Override
   public CharSequence subSequence(int start, int end) {
-    throw new UnsupportedOperationException();
+    // throw new UnsupportedOperationException();

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] vvysotskyi commented on a diff in pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "vvysotskyi (via GitHub)" <gi...@apache.org>.
vvysotskyi commented on code in PR #2762:
URL: https://github.com/apache/drill/pull/2762#discussion_r1114857738


##########
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java:
##########
@@ -293,6 +293,115 @@ public void eval() {
     }
   }
 
+  /*
+   * This function returns the capturing groups from a regex.
+   */
+  @FunctionTemplate(name = "regexp_extract", scope = FunctionScope.SIMPLE,
+      outputWidthCalculatorType = OutputWidthCalculatorType.CUSTOM_FIXED_WIDTH_DEFAULT)
+  public static class RegexpExtract implements DrillSimpleFunc {
+
+    @Param VarCharHolder input;
+    @Param(constant=true) VarCharHolder pattern;
+    @Inject
+    DrillBuf buffer;
+    @Workspace
+    java.util.regex.Matcher matcher;
+    @Workspace
+    org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper charSequenceWrapper;
+    @Output
+    ComplexWriter out;
+
+    @Override
+    public void setup() {
+      matcher = java.util.regex.Pattern.compile(org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,  pattern.end,  pattern.buffer)).matcher("");
+      charSequenceWrapper = new org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper();
+      matcher.reset(charSequenceWrapper);
+    }
+
+    @Override
+    public void eval() {
+      charSequenceWrapper.setBuffer(input.start, input.end, input.buffer);
+
+      // Reusing same charSequenceWrapper, no need to pass it in.
+      matcher.reset();
+      boolean result = matcher.find();
+
+      // Start the list here.  If there are no matches, we return an empty list.
+      org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter = out.rootAsList();
+      listWriter.startList();
+
+      if (result) {
+        org.apache.drill.exec.vector.complex.writer.VarCharWriter varCharWriter = listWriter.varChar();
+        String extractedResult;
+        for(int i = 1; i <= matcher.groupCount(); i++) {
+          extractedResult = matcher.group(i);

Review Comment:
   It is better to avoid creating extra objects in UDFs to reduce the load on the garbage collector. Matcher has `Matcher.start(int group)` and `Matcher.end(int group)`, so please use them to obtain bytes that correspond to marching subsequence.



##########
NOTICE:
##########
@@ -1,5 +1,5 @@
 Apache Drill
-Copyright 2013-2022 The Apache Software Foundation
+Copyright 2013-2023 The Apache Software Foundation

Review Comment:
   Looks like this PR should be rebased on the latest master. Probably these changes are present because of the force push to master.



##########
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java:
##########
@@ -90,7 +90,10 @@ public char charAt(int index) {
    */
   @Override
   public CharSequence subSequence(int start, int end) {
-    throw new UnsupportedOperationException();
+    // throw new UnsupportedOperationException();

Review Comment:
   Please remove commented code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] DRILL-8402: Add REGEXP_EXTRACT Function (drill)

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on PR #2762:
URL: https://github.com/apache/drill/pull/2762#issuecomment-1441742721

   @vvysotskyi Thanks for the review.  I refactored the functions so that they are not creating extra String objects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] vvysotskyi commented on pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "vvysotskyi (via GitHub)" <gi...@apache.org>.
vvysotskyi commented on PR #2762:
URL: https://github.com/apache/drill/pull/2762#issuecomment-1438117260

   Wouldn't this change introduce ReDoS vulnerability?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] cgivre commented on a diff in pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on code in PR #2762:
URL: https://github.com/apache/drill/pull/2762#discussion_r1115193471


##########
NOTICE:
##########
@@ -1,5 +1,5 @@
 Apache Drill
-Copyright 2013-2022 The Apache Software Foundation
+Copyright 2013-2023 The Apache Software Foundation

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] DRILL-8402: Add REGEXP_EXTRACT Function (drill)

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre merged PR #2762:
URL: https://github.com/apache/drill/pull/2762


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [drill] vvysotskyi commented on pull request #2762: DRILL-8402: Add REGEXP_EXTRACT Function

Posted by "vvysotskyi (via GitHub)" <gi...@apache.org>.
vvysotskyi commented on PR #2762:
URL: https://github.com/apache/drill/pull/2762#issuecomment-1439487042

   Ok, in this case, we can add this UDF.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org