You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jim Pivarski (Jira)" <ji...@apache.org> on 2021/11/01 20:17:00 UTC

[jira] [Created] (ARROW-14547) Reading FixedSizeListArray from Parquet with nulls

Jim Pivarski created ARROW-14547:
------------------------------------

             Summary: Reading FixedSizeListArray from Parquet with nulls
                 Key: ARROW-14547
                 URL: https://issues.apache.org/jira/browse/ARROW-14547
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 6.0.0
            Reporter: Jim Pivarski


This one is easy to describe: given an array of fixed-sized lists, in which some are null,
{code:python}
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None])
>>> a
<pyarrow.lib.FixedSizeListArray object at 0x7ff801cb2760>
[
  [
    0,
    1,
    2,
    3,
    4
  ],
  null
]
{code}
you can write them to a Parquet file, but not read them back:
{code:python}
>>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had size=0
{code}
It could be that, at some level, the second list is considered to be empty.

For completeness, this doesn't happen if the fixed-sized lists have no nulls:
{code:python}
>>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5)
>>> b
<pyarrow.lib.FixedSizeListArray object at 0x7ff801c1ed60>
[
  [
    0,
    1,
    2,
    3,
    4
  ],
  [
    5,
    6,
    7,
    8,
    9
  ]
]
>>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet")
>>> pa.parquet.read_table("tmp2.parquet")
pyarrow.Table
: fixed_size_list<item: int64>[5]
  child 0, item: int64
----
: [[[0,1,2,3,4],[5,6,7,8,9]]]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)