You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/09/23 15:11:00 UTC
[jira] [Commented] (ARROW-10172) [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's capacity overflows
[ https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608797#comment-17608797 ]
David Li commented on ARROW-10172:
----------------------------------
I happened to see this. The segfault is gone, casting is fixed, but the error is still there.
{code:python}
>>> import pyarrow as pa
>>> str_array = pa.array(['a' * 128] * 10**8)
>>> str_array
<pyarrow.lib.ChunkedArray object at 0x7f7198a89c70>
...
>>> type(str_array)
<class 'pyarrow.lib.ChunkedArray'>
>>> str_array.cast(pa.large_string())
<pyarrow.lib.ChunkedArray object at 0x7f71ca5bd400>
...
>>> large_array = pa.concat_arrays([str_array] * 50)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 2884, in pyarrow.lib.concat_arrays
TypeError: Iterable should contain Array objects, got <class 'pyarrow.lib.ChunkedArray'> instead
>>> pa.concat_arrays(str_array.chunks)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 2889, in pyarrow.lib.concat_arrays
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
{code}
There's more discussion now in ARROW-17828 so I'm going to close this in favor of that one to keep things in one place
> [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's capacity overflows
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-10172
> URL: https://issues.apache.org/jira/browse/ARROW-10172
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1, 2.0.0
> Reporter: Artem KOZHEVNIKOV
> Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that this should be handled by upcast to large_string.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)