You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Adam Hooper (Jira)" <ji...@apache.org> on 2021/05/13 13:18:00 UTC

[jira] [Updated] (ARROW-12774) replace_substring_regex() creates invalid arrays => crash

     [ https://issues.apache.org/jira/browse/ARROW-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Hooper updated ARROW-12774:
--------------------------------
    Description: 
{code:python}
arr = pa.array(['A'] * 16)
arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
arr2.validate(full=True)
{code}

Expected results: a valid array
Actual results: {{pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63}}

So if you run {{arr.diff(arr2)}}, you'll get something like:

{code}
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Aborted (core dumped)
{code}

This happens whenever the input array length is a multiple of 16. That leads to an ugly workaround:

{code:python}
def replace_substring_regex_workaround_12774(array: pa.Array, *, pattern: str, replacement: str) -> pa.Array:
    if len(array) > 0 and len(array) % 16 == 0:
        chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
        return pa.compute.replace_substring_regex(chunked_array, pattern=pattern, replacement=replacement).combine_chunks()
    else:
        return pa.compute.replace_substring_regex(array, pattern=pattern, replacement=replacement)
{code}

  was:
{code:python}
arr = pa.array(['Brussels', 'Brussels', 'Brussels', 'Brussels', 'Brussels', 'Brussels', 'Brussels', 'Brussels', 'Flanders', 'Flanders', 'Flanders', 'Flanders', 'Flanders', 'Flanders', 'Flanders', 'Flanders', 'Ostbelgien', 'Ostbelgien', 'Ostbelgien', 'Ostbelgien', 'Ostbelgien', 'Ostbelgien', 'Ostbelgien', 'Ostbelgien', 'Wallonia', 'Wallonia', 'Wallonia', 'Wallonia', 'Wallonia', 'Wallonia', 'Wallonia', 'Wallonia'])
arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
arr2.validate(full=True)
{code}

Expected results: a valid array
Actual results: {{pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 32: 0 < 264}}

So if you run {{arr.diff(arr2)}}, you'll get:

{code}
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Aborted (core dumped)
{code}


> replace_substring_regex() creates invalid arrays => crash
> ---------------------------------------------------------
>
>                 Key: ARROW-12774
>                 URL: https://issues.apache.org/jira/browse/ARROW-12774
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: Adam Hooper
>            Priority: Major
>
> {code:python}
> arr = pa.array(['A'] * 16)
> arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
> arr2.validate(full=True)
> {code}
> Expected results: a valid array
> Actual results: {{pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63}}
> So if you run {{arr.diff(arr2)}}, you'll get something like:
> {code}
> terminate called after throwing an instance of 'std::length_error'
>   what():  basic_string::_S_create
> Aborted (core dumped)
> {code}
> This happens whenever the input array length is a multiple of 16. That leads to an ugly workaround:
> {code:python}
> def replace_substring_regex_workaround_12774(array: pa.Array, *, pattern: str, replacement: str) -> pa.Array:
>     if len(array) > 0 and len(array) % 16 == 0:
>         chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
>         return pa.compute.replace_substring_regex(chunked_array, pattern=pattern, replacement=replacement).combine_chunks()
>     else:
>         return pa.compute.replace_substring_regex(array, pattern=pattern, replacement=replacement)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)