You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/03/05 00:00:42 UTC
[GitHub] [arrow-rs] Jefffrey opened a new issue, #3803: regexp_match skips first match
Jefffrey opened a new issue, #3803:
URL: https://github.com/apache/arrow-rs/issues/3803
**Describe the bug**
<!--
A clear and concise description of what the bug is.
-->
In some cases `regexp_match` will skip first and only match.
e.g. if pattern is `foo` and string to match is `foo` then should return single match `foo`. Currently returning empty array for the match (correctly finds there is a match, but doesn't return the match correctly).
**To Reproduce**
<!--
Steps to reproduce the behavior:
-->
Example test in [arrow-string/src/regexp.rs](https://github.com/apache/arrow-rs/blob/79518cf67a6dd5fc391e271fd92c0c21ee7e8a74/arrow-string/src/regexp.rs)
```rust
#[test]
fn sandbox() {
let array = StringArray::from(vec![Some("foo")]);
let pattern = GenericStringArray::<i32>::from(vec![r"foo"]);
let actual = regexp_match(&array, &pattern, None).unwrap();
let result = actual.as_any().downcast_ref::<ListArray>().unwrap();
let elem_builder: GenericStringBuilder<i32> = GenericStringBuilder::new();
let mut expected_builder = ListBuilder::new(elem_builder);
expected_builder.values().append_value("foo");
expected_builder.append(true);
let expected = expected_builder.finish();
assert_eq!(&expected, result);
}
```
**Expected behavior**
<!--
A clear and concise description of what you expected to happen.
-->
Test should succeed.
**Additional context**
<!--
Add any other context about the problem here.
-->
Seems its because by default skipping the first match in a capture group:
https://github.com/apache/arrow-rs/blob/79518cf67a6dd5fc391e271fd92c0c21ee7e8a74/arrow-string/src/regexp.rs#L210-L218
Where in the test example above, `caps` has value:
```
[arrow-string/src/regexp.rs:212] &caps = Captures(
{
0: Some(
"foo",
),
},
)
```
Relevant regex doc: https://docs.rs/regex/latest/regex/struct.Regex.html#method.captures
Specifically:
> Capture group `0` always corresponds to the entire match.
Original issue: https://github.com/apache/arrow-datafusion/issues/5479
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Jefffrey commented on issue #3803: regexp_match skips first match when returning match
Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1456915003
If parity with Postgres is desired, then this would be considered a bug. Relevant extract:
> If a match is found, and the `pattern` contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern
From: https://www.postgresql.org/docs/current/functions-matching.html
Also it might be somewhat confusing as returning a not-null value in the output ListArray indicates a match was found (else it would be null instead of a StringArray), but resultant StringArray itself is empty without the match. The behaviour seems somewhat inconsistent?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on issue #3803: regexp_match skips first match when returning match
Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1457938997
> If parity with Postgres is desired, then this would be considered a bug
Makes sense, we should probably update the function's docs to match
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Weijun-H commented on issue #3803: regexp_match skips first match when returning match
Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1456921556
> I'm not sure if I'm missing something but `foo` is not a capture group? If you instead change the pattern to contain a capture group, e.g. `(foo)` it works correctly?
`(foo)` can work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold closed issue #3803: regexp_match skips first match when returning match
Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #3803: regexp_match skips first match when returning match
URL: https://github.com/apache/arrow-rs/issues/3803
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on issue #3803: regexp_match skips first match when returning match
Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1456357073
I'm not sure if I'm missing something but `"foo"` is not a capture group? If you instead change the pattern to contain a capture group, e.g. `(foo)` it works correctly?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on issue #3803: regexp_match skips first match when returning match
Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3803:
URL: https://github.com/apache/arrow-rs/issues/3803#issuecomment-1463940542
`label_issue.py` automatically added labels {'arrow'} from #3807
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org