You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/03/05 00:00:42 UTC

[GitHub] [arrow-rs] Jefffrey opened a new issue, #3803: regexp_match skips first match

Jefffrey opened a new issue, #3803:
URL: https://github.com/apache/arrow-rs/issues/3803

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   In some cases `regexp_match` will skip first and only match.
   
   e.g. if pattern is `foo` and string to match is `foo` then should return single match `foo`. Currently returning empty array for the match (correctly finds there is a match, but doesn't return the match correctly).
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   Example test in [arrow-string/src/regexp.rs](https://github.com/apache/arrow-rs/blob/79518cf67a6dd5fc391e271fd92c0c21ee7e8a74/arrow-string/src/regexp.rs)
   
   ```rust
       #[test]
       fn sandbox() {
           let array = StringArray::from(vec![Some("foo")]);
           let pattern = GenericStringArray::<i32>::from(vec![r"foo"]);
           let actual = regexp_match(&array, &pattern, None).unwrap();
           let result = actual.as_any().downcast_ref::<ListArray>().unwrap();
           let elem_builder: GenericStringBuilder<i32> = GenericStringBuilder::new();
           let mut expected_builder = ListBuilder::new(elem_builder);
           expected_builder.values().append_value("foo");
           expected_builder.append(true);
           let expected = expected_builder.finish();
           assert_eq!(&expected, result);
       }
   
   ```
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   Test should succeed.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   
   Seems its because by default skipping the first match in a capture group:
   
   https://github.com/apache/arrow-rs/blob/79518cf67a6dd5fc391e271fd92c0c21ee7e8a74/arrow-string/src/regexp.rs#L210-L218
   
   Where in the test example above, `caps` has value:
   
   ```
   [arrow-string/src/regexp.rs:212] &caps = Captures(
       {
           0: Some(
               "foo",
           ),
       },
   )
   ```
   
   Relevant regex doc: https://docs.rs/regex/latest/regex/struct.Regex.html#method.captures
   
   Specifically:
   
   > Capture group `0` always corresponds to the entire match.
   
   Original issue: https://github.com/apache/arrow-datafusion/issues/5479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jefffrey commented on issue #3803: regexp_match skips first match when returning match

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1456915003

   If parity with Postgres is desired, then this would be considered a bug. Relevant extract:
   
   > If a match is found, and the `pattern` contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern
   
   From: https://www.postgresql.org/docs/current/functions-matching.html
   
   Also it might be somewhat confusing as returning a not-null value in the output ListArray indicates a match was found (else it would be null instead of a StringArray), but resultant StringArray itself is empty without the match. The behaviour seems somewhat inconsistent?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3803: regexp_match skips first match when returning match

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1457938997

   > If parity with Postgres is desired, then this would be considered a bug
   
   Makes sense, we should probably update the function's docs to match


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Weijun-H commented on issue #3803: regexp_match skips first match when returning match

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1456921556

   > I'm not sure if I'm missing something but `foo` is not a capture group? If you instead change the pattern to contain a capture group, e.g. `(foo)` it works correctly?
   
   `(foo)` can work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3803: regexp_match skips first match when returning match

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #3803: regexp_match skips first match when returning match
URL: https://github.com/apache/arrow-rs/issues/3803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3803: regexp_match skips first match when returning match

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1456357073

   I'm not sure if I'm missing something but `"foo"` is not a capture group? If you instead change the pattern to contain a capture group, e.g. `(foo)` it works correctly?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3803: regexp_match skips first match when returning match

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1463940542

   `label_issue.py` automatically added labels {'arrow'} from #3807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org