You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/08/07 09:23:00 UTC
[jira] [Assigned] (ARROW-6132) [Python] ListArray.from_arrays does
not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-6132:
--------------------------------------------
Assignee: Joris Van den Bossche
> [Python] ListArray.from_arrays does not check validity of input arrays
> ----------------------------------------------------------------------
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5))
> In [62]: a
> Out[62]:
> <pyarrow.lib.ListArray object at 0x7fdd9c468678>
> [
> [
> 1,
> 2
> ],
> [
> 3,
> 4
> ]
> ]
> In [63]: a.flatten()
> Out[63]:
> <pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
> [
> 0, # <--- includes the 0
> 1,
> 2,
> 3,
> 4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does validation, but other `from_arrays` method don't seem to explicitly do this.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)