You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Niranda Perera (Jira)" <ji...@apache.org> on 2021/05/13 19:59:00 UTC

[jira] [Comment Edited] (ARROW-12774) [C++][Compute] replace_substring_regex() creates invalid arrays => crash

    [ https://issues.apache.org/jira/browse/ARROW-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344106#comment-17344106 ] 

Niranda Perera edited comment on ARROW-12774 at 5/13/21, 7:58 PM:
------------------------------------------------------------------

[~apitrou] I tried the following test cases in C++
{code:java}
// ARROW-12774
  ReplaceSubstringOptions options_regex3{"X", "Y"};
  this->CheckUnary("replace_substring_regex", R"(["AAAAAAAAAAAAAAAA"])", this->type(),
                   R"(["AAAAAAAAAAAAAAAA"])", &options_regex3);
  this->CheckUnary("replace_substring_regex", R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", this->type(),
                   R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", &options_regex3);
{code}
The 2nd case fails with the following error.
{code:java}
../src/arrow/compute/kernels/test_util.cc:52: Failure
Failed
'actual->ValidateFull()' failed with Invalid: Negative offsets in binary array
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex() =  != A
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(\0) = \0 != A
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(\0) = \0 != A
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(\0) = \0 != A
../src/arrow/testing/gtest_util.cc:132: Failure
Failed@@ -0, +0 @@
-"A"
-"A"
-"A"
-"A"
+""
+"\0"
+"\0"
+"\0"
Expected:
  [
    "A",
    "A",
    "A",
    "A",
    "A"
  ]
Actual:
  [
    "",
    "\0",
    "\0",
    "\0",
    "A"
  ]
../src/arrow/testing/gtest_util.cc:221: Failure
Failed
Got: 
  [
    [
      "",
      "\0",
      "\0",
      "\0",
      "A"
    ],
    [
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A"
    ]
  ]
Expected: 
  [
    [
      "A",
      "A",
      "A",
      "A",
      "A"
    ],
    [
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A"
    ]
  ]
{code}
So, I believe the bug is confirmed. Let me see if I could figure out the reason behind this.


was (Author: niranda):
[~apitrou] I tried the following test cases in C++ 
{code:java}
// ARROW-12774
  ReplaceSubstringOptions options_regex3{"X", "Y"};
  this->CheckUnary("replace_substring_regex", R"(["AAAAAAAAAAAAAAAA"])", this->type(),
                   R"(["AAAAAAAAAAAAAAAA"])", &options_regex3);
  this->CheckUnary("replace_substring_regex", R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", this->type(),
                   R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", &options_regex3);
{code}
The 2nd case fails with the following error.
{code:java}
../src/arrow/compute/kernels/test_util.cc:52: Failure
Failed
'actual->ValidateFull()' failed with Invalid: Negative offsets in binary array
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != 
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != \0
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != \0
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != \0
../src/arrow/testing/gtest_util.cc:132: Failure
Failed@@ -0, +0 @@
-""
-"\0"
-"\0"
-"\0"
@@ -5, +1 @@
+"A"
+"A"
+"A"
+"A"
Expected:
  [
    "",
    "\0",
    "\0",
    "\0",
    "A"
  ]
Actual:
  [
    "A",
    "A",
    "A",
    "A",
    "A"
  ]
../src/arrow/testing/gtest_util.cc:221: Failure
Failed
Got: 
  [
    [
      "A",
      "A",
      "A",
      "A",
      "A"
    ],
    [
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A"
    ]
  ]
Expected: 
  [
    [
      "",
      "\0",
      "\0",
      "\0",
      "A"
    ],
    [
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A",
      "A"
    ]
  ]
{code}
So, I believe the bug is confirmed. Let me see if I could figure out the reason behind this.

> [C++][Compute] replace_substring_regex() creates invalid arrays => crash
> ------------------------------------------------------------------------
>
>                 Key: ARROW-12774
>                 URL: https://issues.apache.org/jira/browse/ARROW-12774
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: Adam Hooper
>            Priority: Major
>             Fix For: 4.0.1
>
>
> min
> {code:python}
> arr = pa.array(['A'] * 16)
> arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
> arr2.validate(full=True)
> {code}
> Expected results: a valid array
>  Actual results: {{pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63}}
> So if you run {{arr.diff(arr2)}}, you'll get something like:
> {code:java}
> terminate called after throwing an instance of 'std::length_error'
>   what():  basic_string::_S_create
> Aborted (core dumped)
> {code}
> This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:
> {code:python}
> def replace_substring_regex_workaround_12774(
>     array: pa.Array,
>     *,
>     pattern: str,
>     replacement: str
> ) -> pa.Array:
>     if len(array) > 0 and len(array) % 16 == 0:
>         chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
>         return pa.compute.replace_substring_regex(
>             chunked_array,
>             pattern=pattern,
>             replacement=replacement
>         ).combine_chunks()
>     else:
>         return pa.compute.replace_substring_regex(
>             array,
>             pattern=pattern,
>             replacement=replacement
>         )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)