You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Niranda Perera (Jira)" <ji...@apache.org> on 2021/05/13 19:59:00 UTC
[jira] [Comment Edited] (ARROW-12774) [C++][Compute]
replace_substring_regex() creates invalid arrays => crash
[ https://issues.apache.org/jira/browse/ARROW-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344106#comment-17344106 ]
Niranda Perera edited comment on ARROW-12774 at 5/13/21, 7:58 PM:
------------------------------------------------------------------
[~apitrou] I tried the following test cases in C++
{code:java}
// ARROW-12774
ReplaceSubstringOptions options_regex3{"X", "Y"};
this->CheckUnary("replace_substring_regex", R"(["AAAAAAAAAAAAAAAA"])", this->type(),
R"(["AAAAAAAAAAAAAAAA"])", &options_regex3);
this->CheckUnary("replace_substring_regex", R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", this->type(),
R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", &options_regex3);
{code}
The 2nd case fails with the following error.
{code:java}
../src/arrow/compute/kernels/test_util.cc:52: Failure
Failed
'actual->ValidateFull()' failed with Invalid: Negative offsets in binary array
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex() = != A
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(\0) = \0 != A
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(\0) = \0 != A
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(\0) = \0 != A
../src/arrow/testing/gtest_util.cc:132: Failure
Failed@@ -0, +0 @@
-"A"
-"A"
-"A"
-"A"
+""
+"\0"
+"\0"
+"\0"
Expected:
[
"A",
"A",
"A",
"A",
"A"
]
Actual:
[
"",
"\0",
"\0",
"\0",
"A"
]
../src/arrow/testing/gtest_util.cc:221: Failure
Failed
Got:
[
[
"",
"\0",
"\0",
"\0",
"A"
],
[
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A"
]
]
Expected:
[
[
"A",
"A",
"A",
"A",
"A"
],
[
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A"
]
]
{code}
So, I believe the bug is confirmed. Let me see if I could figure out the reason behind this.
was (Author: niranda):
[~apitrou] I tried the following test cases in C++
{code:java}
// ARROW-12774
ReplaceSubstringOptions options_regex3{"X", "Y"};
this->CheckUnary("replace_substring_regex", R"(["AAAAAAAAAAAAAAAA"])", this->type(),
R"(["AAAAAAAAAAAAAAAA"])", &options_regex3);
this->CheckUnary("replace_substring_regex", R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", this->type(),
R"(["A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"])", &options_regex3);
{code}
The 2nd case fails with the following error.
{code:java}
../src/arrow/compute/kernels/test_util.cc:52: Failure
Failed
'actual->ValidateFull()' failed with Invalid: Negative offsets in binary array
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A !=
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != \0
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != \0
../src/arrow/compute/kernels/test_util.cc:90: Failure
Failed
replace_substring_regex(A) = A != \0
../src/arrow/testing/gtest_util.cc:132: Failure
Failed@@ -0, +0 @@
-""
-"\0"
-"\0"
-"\0"
@@ -5, +1 @@
+"A"
+"A"
+"A"
+"A"
Expected:
[
"",
"\0",
"\0",
"\0",
"A"
]
Actual:
[
"A",
"A",
"A",
"A",
"A"
]
../src/arrow/testing/gtest_util.cc:221: Failure
Failed
Got:
[
[
"A",
"A",
"A",
"A",
"A"
],
[
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A"
]
]
Expected:
[
[
"",
"\0",
"\0",
"\0",
"A"
],
[
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A",
"A"
]
]
{code}
So, I believe the bug is confirmed. Let me see if I could figure out the reason behind this.
> [C++][Compute] replace_substring_regex() creates invalid arrays => crash
> ------------------------------------------------------------------------
>
> Key: ARROW-12774
> URL: https://issues.apache.org/jira/browse/ARROW-12774
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 4.0.0
> Reporter: Adam Hooper
> Priority: Major
> Fix For: 4.0.1
>
>
> min
> {code:python}
> arr = pa.array(['A'] * 16)
> arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
> arr2.validate(full=True)
> {code}
> Expected results: a valid array
> Actual results: {{pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63}}
> So if you run {{arr.diff(arr2)}}, you'll get something like:
> {code:java}
> terminate called after throwing an instance of 'std::length_error'
> what(): basic_string::_S_create
> Aborted (core dumped)
> {code}
> This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:
> {code:python}
> def replace_substring_regex_workaround_12774(
> array: pa.Array,
> *,
> pattern: str,
> replacement: str
> ) -> pa.Array:
> if len(array) > 0 and len(array) % 16 == 0:
> chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
> return pa.compute.replace_substring_regex(
> chunked_array,
> pattern=pattern,
> replacement=replacement
> ).combine_chunks()
> else:
> return pa.compute.replace_substring_regex(
> array,
> pattern=pattern,
> replacement=replacement
> )
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)