You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/03/03 17:30:00 UTC

[jira] [Updated] (ARROW-15837) [C++][Python][Doc] ListArray.offsets is wrong when it contains both lists and null values

     [ https://issues.apache.org/jira/browse/ARROW-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou updated ARROW-15837:
-----------------------------------
    Summary: [C++][Python][Doc] ListArray.offsets is wrong when it contains both lists and null values  (was: [Python] ListArray.offsets is wrong when it contains both lists and null values)

> [C++][Python][Doc] ListArray.offsets is wrong when it contains both lists and null values
> -----------------------------------------------------------------------------------------
>
>                 Key: ARROW-15837
>                 URL: https://issues.apache.org/jira/browse/ARROW-15837
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Documentation, Python
>    Affects Versions: 7.0.0
>            Reporter: quentin lhoest
>            Priority: Major
>             Fix For: 8.0.0
>
>
> Hi ! I noticed this bug by running this code:
> {code:java}
> import pyarrow as pa
> arr = pa.array([None, [0]])
> reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values)
> print(reconstructed_arr.to_pylist())
> # [[], [0]] {code}
> The resulting array, reconstructed from the offsets and values of the original array, {*}is not the same at the original array{*}.
> This is the case because it seems that `arr.offsets` is wrong. Indeed it returns `[0, 0, 1]` instead of `[None, 0, 1]`:
> {code:java}
> print(arr.offsets.to_pylist())
> # [0, 0, 1]
> fixed_reconstructed_arr = pa.ListArray.from_arrays(pa.array([None, 0, 1]), arr.values)
> print(fixed_reconstructed_arr.to_pylist())
> # [None, [0]]{code}
> If it can help, here is my investigation:
> The offsets seem to be wrong because they don't include the validity bitmap from `{{{}arr.buffers()[0]`{}}}, which is used to say which values are null and which values are non-null. Therefore the `None` is replaced by `0`.
> Though even if the validity bitmap is not taken into account at all, I checked its value and it  was not what I expected: the validity bitmap at `{{{}arr.buffers()[0]`{}}} is supposed to be `110` (in order to mask the None in `[None, 0, 1]`) but it is `10` for some reason:
> {code:java}
> bin(int(arr.buffers()[0].hex(), 16))
> # '0b10'
> # I think it should be 0b110 - 1 corresponds to non-null and 0 corresponds to null, if you take the bits in reverse order {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)