You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/09/23 15:11:00 UTC

[jira] [Commented] (ARROW-10172) [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's capacity overflows

    [ https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608797#comment-17608797 ] 

David Li commented on ARROW-10172:
----------------------------------

I happened to see this. The segfault is gone, casting is fixed, but the error is still there.

{code:python}
>>> import pyarrow as pa
>>> str_array = pa.array(['a' * 128] * 10**8)
>>> str_array
<pyarrow.lib.ChunkedArray object at 0x7f7198a89c70>
...
>>> type(str_array)
<class 'pyarrow.lib.ChunkedArray'>
>>> str_array.cast(pa.large_string())
<pyarrow.lib.ChunkedArray object at 0x7f71ca5bd400>
...
>>> large_array = pa.concat_arrays([str_array] * 50)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 2884, in pyarrow.lib.concat_arrays
TypeError: Iterable should contain Array objects, got <class 'pyarrow.lib.ChunkedArray'> instead
>>> pa.concat_arrays(str_array.chunks)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 2889, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
{code}

There's more discussion now in ARROW-17828 so I'm going to close this in favor of that one to keep things in one place

> [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's capacity overflows
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-10172
>                 URL: https://issues.apache.org/jira/browse/ARROW-10172
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1, 2.0.0
>            Reporter: Artem KOZHEVNIKOV
>            Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that  this should be handled by upcast to large_string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)