You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/08/05 08:53:00 UTC

[jira] [Created] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

Joris Van den Bossche created ARROW-6132:
--------------------------------------------

             Summary: [Python] ListArray.from_arrays does not check validity of input arrays
                 Key: ARROW-6132
                 URL: https://issues.apache.org/jira/browse/ARROW-6132
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Joris Van den Bossche


From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.

When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").

The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:

{code}
In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 
<pyarrow.lib.ListArray object at 0x7fdd9c468678>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

In [63]: a.flatten()
Out[63]: 
<pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more elements as garbage
{code}


Calling {{validate}} manually correctly raises:

{code}
In [65]: a.validate()
...
ArrowInvalid: Final offset invariant not equal to values length: 10!=5
{code}

In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to {{validate}} here?

A quick search seems to indicate that `pa.Array.from_buffers` does validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)