You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/11/01 20:28:00 UTC

[jira] [Commented] (ARROW-14547) Reading FixedSizeListArray from Parquet with nulls

    [ https://issues.apache.org/jira/browse/ARROW-14547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437011#comment-17437011 ] 

Antoine Pitrou commented on ARROW-14547:
----------------------------------------

[~emkornfield]

> Reading FixedSizeListArray from Parquet with nulls
> --------------------------------------------------
>
>                 Key: ARROW-14547
>                 URL: https://issues.apache.org/jira/browse/ARROW-14547
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.0
>            Reporter: Jim Pivarski
>            Priority: Major
>
> This one is easy to describe: given an array of fixed-sized lists, in which some are null,
> {code:python}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> import pyarrow.parquet
> >>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None])
> >>> a
> <pyarrow.lib.FixedSizeListArray object at 0x7ff801cb2760>
> [
>   [
>     0,
>     1,
>     2,
>     3,
>     4
>   ],
>   null
> ]
> {code}
> you can write them to a Parquet file, but not read them back:
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet")
> >>> pa.parquet.read_table("tmp.parquet")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had size=0
> {code}
> It could be that, at some level, the second list is considered to be empty.
> For completeness, this doesn't happen if the fixed-sized lists have no nulls:
> {code:python}
> >>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5)
> >>> b
> <pyarrow.lib.FixedSizeListArray object at 0x7ff801c1ed60>
> [
>   [
>     0,
>     1,
>     2,
>     3,
>     4
>   ],
>   [
>     5,
>     6,
>     7,
>     8,
>     9
>   ]
> ]
> >>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet")
> >>> pa.parquet.read_table("tmp2.parquet")
> pyarrow.Table
> : fixed_size_list<item: int64>[5]
>   child 0, item: int64
> ----
> : [[[0,1,2,3,4],[5,6,7,8,9]]]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)