You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Niranda Perera (Jira)" <ji...@apache.org> on 2021/05/14 01:15:00 UTC

[jira] [Assigned] (ARROW-12774) [C++][Compute] replace_substring_regex() creates invalid arrays => crash

     [ https://issues.apache.org/jira/browse/ARROW-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niranda Perera reassigned ARROW-12774:
--------------------------------------

    Assignee: Niranda Perera

> [C++][Compute] replace_substring_regex() creates invalid arrays => crash
> ------------------------------------------------------------------------
>
>                 Key: ARROW-12774
>                 URL: https://issues.apache.org/jira/browse/ARROW-12774
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: Adam Hooper
>            Assignee: Niranda Perera
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.1
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> min
> {code:python}
> arr = pa.array(['A'] * 16)
> arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
> arr2.validate(full=True)
> {code}
> Expected results: a valid array
>  Actual results: {{pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63}}
> So if you run {{arr.diff(arr2)}}, you'll get something like:
> {code:java}
> terminate called after throwing an instance of 'std::length_error'
>   what():  basic_string::_S_create
> Aborted (core dumped)
> {code}
> This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:
> {code:python}
> def replace_substring_regex_workaround_12774(
>     array: pa.Array,
>     *,
>     pattern: str,
>     replacement: str
> ) -> pa.Array:
>     if len(array) > 0 and len(array) % 16 == 0:
>         chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
>         return pa.compute.replace_substring_regex(
>             chunked_array,
>             pattern=pattern,
>             replacement=replacement
>         ).combine_chunks()
>     else:
>         return pa.compute.replace_substring_regex(
>             array,
>             pattern=pattern,
>             replacement=replacement
>         )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)