You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dror Speiser (Jira)" <ji...@apache.org> on 2021/05/26 19:56:00 UTC

[jira] [Closed] (ARROW-12889) [Python] compute.replace_substring_regex sometimes returns incorrect offsets, causing crashes/ub

     [ https://issues.apache.org/jira/browse/ARROW-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dror Speiser closed ARROW-12889.
--------------------------------
    Fix Version/s: 4.0.1
       Resolution: Fixed

> [Python] compute.replace_substring_regex sometimes returns incorrect offsets, causing crashes/ub
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12889
>                 URL: https://issues.apache.org/jira/browse/ARROW-12889
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 4.0.0
>         Environment: ubuntu 20.04 or macos catalina running docker engine 20.10.2 and python 3.8.6
>            Reporter: Dror Speiser
>            Priority: Major
>              Labels: compute, pyarrow
>             Fix For: 4.0.1
>
>
> I've come across examples where calling `pyarrow.compute.replace_substring_regex` caused a segfault once using the result. After some experimentation, I found that the problem lies in the offsets buffer in the result of the computation.
> Here is a docker file that reproduces the problem in a few lines (though without an immediate crash):
> {code:java}
> FROM python:3.8
> RUN pip install pyarrow
> RUN echo "import pyarrow; \
>     import pyarrow.compute; \
>     options = pyarrow.compute.ReplaceSubstringOptions('a', ''); \
>     values = [''] * 16; \
>     arr = pyarrow.array(values, pyarrow.string()); \
>     res = pyarrow.compute.replace_substring_regex(arr, options=options); \
>     offsets = res.buffers()[1]; \
>     assert any(offset != 0 for offset in offsets[-4:]);" > /test.py
> RUN python /test.py
> {code}
> The docker image installs pyarrow (4.0.0 at the time of submitting this issue), and then runs python code which creates an array of 16 empty strings, and calls `replace_substring_regex` on the array.
>  The offsets buffer's last 4 bytes (representing the last offset) are checked to be non-zero, which fails.
> Everything but the last offset looks fine: the valid buffer, the rest of the offsets, and the data buffer.
> I have more elaborate examples of arrays which return a random value for the last offset, causing crashes sooner than simply 0 at the end.
>  Another hint which might help, the problem occurs at multiples of 16, i.e. changing 16 to 32, 48, etc. still shows the problem, but other values don't have a problem.
>   
>  When I cloned the latest master, built arrow, and run the example - there was no problem. But since I didn't see the issue here on JIRA, I thought I should probably post it. I have no idea if I'm building correctly, and maybe I'm adding a bug to a bug :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)