You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jim Pivarski (Jira)" <ji...@apache.org> on 2021/10/27 01:29:00 UTC
[jira] [Created] (ARROW-14485) ParquetFile.read_row_group looses
struct nullability when selecting one column from a struct
Jim Pivarski created ARROW-14485:
------------------------------------
Summary: ParquetFile.read_row_group looses struct nullability when selecting one column from a struct
Key: ARROW-14485
URL: https://issues.apache.org/jira/browse/ARROW-14485
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 6.0.0
Reporter: Jim Pivarski
Attachments: test8.parquet
This appeared minutes ago because we have a test suite that saw Arrow 6.0.0 land in PyPI. (Congrats, by the way! I've been looking forward to this one!)
Below, you'll see one thing that version 6 fixed (asking for one column in a nested struct returns only that one column) and a new error (it does not preserve nullability of the surrounding struct). Here, I'll write down the steps to reproduce and then explain.
{code:python}
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet
>>> pyarrow.__version__
'5.0.0'
>>> file = pyarrow.parquet.ParquetFile("test8.parquet")
>>> file.schema
<pyarrow._parquet.ParquetSchema object at 0x7fcf39be7c80>
required group field_id=-1 schema {
required group field_id=-1 x (List) {
repeated group field_id=-1 list {
required group field_id=-1 item {
required int64 field_id=-1 y;
required double field_id=-1 z;
}
}
}
}
>>> file.schema_arrow
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y"]).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
>>> file.read_row_group(0).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet
>>> pyarrow.__version__
'6.0.0'
>>> file = pyarrow.parquet.ParquetFile("test8.parquet")
>>> file.schema
<pyarrow._parquet.ParquetSchema object at 0x7f61e71321c0>
required group field_id=-1 schema {
required group field_id=-1 x (List) {
repeated group field_id=-1 list {
required group field_id=-1 item {
required int64 field_id=-1 y;
required double field_id=-1 z;
}
}
}
}
>>> file.schema_arrow
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y"]).schema
x: large_list<item: struct<y: int64 not null>> not null
child 0, item: struct<y: int64 not null>
child 0, y: int64 not null
>>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
>>> file.read_row_group(0).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
child 0, item: struct<y: int64 not null, z: double not null> not null
child 0, y: int64 not null
child 1, z: double not null
{code}
In Arrow 5, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null}}, which was undesirable because it has unnecessarily read the {{"z"}} column, but it got all of the {{"not null"}} types right. In test8.parquet, the data are non-nullable at each level.
In Arrow 6, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list<item: struct<y: int64 not null>> not null}}, which is great because it's not reading the {{"z"}} column, but the struct's nullability is wrong: we should see three {{"not nulls"}} here, one for the data in {{y}}, one for the {{struct}}, and one for the {{list}}. It's just missing the middle one.
When I ask for two columns specifically or don't specify the columns, the nullability is correct. I think that can help to narrow it down.
I've attached the file (test8.parquet). It was the same in both of the above tests (generated by Arrow 5).
I labeled this as "Python" because I've only seen the symptom in Python, but I suspect that the actual error is in C++.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)