You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/05/10 16:20:00 UTC

[jira] [Commented] (ARROW-12670) [C++] extract_regex gives bizarre behavior after nulls or non-matches

    [ https://issues.apache.org/jira/browse/ARROW-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341991#comment-17341991 ] 

Antoine Pitrou commented on ARROW-12670:
----------------------------------------

Thanks for the report!

> [C++] extract_regex gives bizarre behavior after nulls or non-matches
> ---------------------------------------------------------------------
>
>                 Key: ARROW-12670
>                 URL: https://issues.apache.org/jira/browse/ARROW-12670
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: Adam Hooper
>            Assignee: Antoine Pitrou
>            Priority: Major
>             Fix For: 4.0.1
>
>
> After a non-match, the *subsequent* string may match ... but its data is in the wrong array element.
> {code}
> >>> pa.compute.extract_regex(pa.array(["a", "b", "c", "d"]), pattern="(?P<x>[^b])")
> <pyarrow.lib.StructArray object at 0x7f80de918ee0>
> -- is_valid:
>   [
>     true,
>     false,
>     true,
>     true
>   ]
> -- child 0 type: string
>   [
>     "a",
>     "",
>     "",
>     "c"
>   ]
> {code}
> Same if trying to match after {{null}}:
> {code}
> >>> pa.compute.extract_regex(pa.array(["a", None, "c", "d", "e"]), pattern="(?P<x>[^b])")
> <pyarrow.lib.StructArray object at 0x7f80de918ee0>
> -- is_valid:
>   [
>     true,
>     false,
>     true,
>     true,
>     true
>   ]
> -- child 0 type: string
>   [
>     "a",
>     "",
>     "",
>     "c",
>     "d"
>   ]
> {code}
> Workaround: 1) filter out non-matches; 2) extract only the matching strings; 3) interpolate nulls:
> {code:python}
> def _extract_regex_workaround_arrow_12670(
>     array: pa.StringArray, *, pattern: str
> ) -> pa.StructArray:
>     ok = pa.compute.match_substring_regex(array, pattern=pattern)
>     good = array.filter(ok)
>     good_matches = pa.compute.extract_regex(good, pattern=pattern)
>     # Build array that looks like [None, 1, None, 2, 3, 4, None, 5]
>     # ... ok_nonnull: [False, True, False, True, True, True, False, True]
>     # (not ok.fill_null(False).cast(pa.int8()) because of ARROW-12672 segfault)
>     ok_nonnull = pa.compute.and_kleene(ok.is_valid(), ok)
>     # ... np_ok: [0, 1, 0, 1, 1, 1, 0, 1]
>     np_ok = ok_nonnull.cast(pa.int8()).to_numpy(zero_copy_only=False)
>     # ... np_index: [0, 1, 1, 2, 3, 4, 4, 5]
>     np_index = np.cumsum(np_ok, dtype=np.int64) - 1
>     # ...index_or_null: [None, 1, None, 3, 4, 5, None, 5]
>     valid = ok_nonnull.buffers()[1]
>     index_or_null = pa.Array.from_buffers(
>         pa.int64(), len(array), [valid, pa.py_buffer(np_index)]
>     )
>     return good_matches.take(index_or_null)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)