You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jim Pivarski (Jira)" <ji...@apache.org> on 2021/10/27 01:29:00 UTC
[jira] [Created] (ARROW-14485) ParquetFile.read_row_group looses struct nullability when selecting one column from a struct

Jim Pivarski created ARROW-14485:
------------------------------------

             Summary: ParquetFile.read_row_group looses struct nullability when selecting one column from a struct
                 Key: ARROW-14485
                 URL: https://issues.apache.org/jira/browse/ARROW-14485
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.0
            Reporter: Jim Pivarski
         Attachments: test8.parquet

This appeared minutes ago because we have a test suite that saw Arrow 6.0.0 land in PyPI. (Congrats, by the way! I've been looking forward to this one!)

Below, you'll see one thing that version 6 fixed (asking for one column in a nested struct returns only that one column) and a new error (it does not preserve nullability of the surrounding struct). Here, I'll write down the steps to reproduce and then explain.
{code:python}
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet
>>> pyarrow.__version__
'5.0.0'
>>> file = pyarrow.parquet.ParquetFile("test8.parquet")
>>> file.schema
<pyarrow._parquet.ParquetSchema object at 0x7fcf39be7c80>
required group field_id=-1 schema {
  required group field_id=-1 x (List) {
    repeated group field_id=-1 list {
      required group field_id=-1 item {
        required int64 field_id=-1 y;
        required double field_id=-1 z;
      }
    }
  }
}

>>> file.schema_arrow
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y"]).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null
>>> file.read_row_group(0).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null

Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet
>>> pyarrow.__version__
'6.0.0'
>>> file = pyarrow.parquet.ParquetFile("test8.parquet")
>>> file.schema
<pyarrow._parquet.ParquetSchema object at 0x7f61e71321c0>
required group field_id=-1 schema {
  required group field_id=-1 x (List) {
    repeated group field_id=-1 list {
      required group field_id=-1 item {
        required int64 field_id=-1 y;
        required double field_id=-1 z;
      }
    }
  }
}

>>> file.schema_arrow
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y"]).schema
x: large_list<item: struct<y: int64 not null>> not null
  child 0, item: struct<y: int64 not null>
      child 0, y: int64 not null
>>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null
>>> file.read_row_group(0).schema
x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null
  child 0, item: struct<y: int64 not null, z: double not null> not null
      child 0, y: int64 not null
      child 1, z: double not null
{code}
 In Arrow 5, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list<item: struct<y: int64 not null, z: double not null> not null> not null}}, which was undesirable because it has unnecessarily read the {{"z"}} column, but it got all of the {{"not null"}} types right. In test8.parquet, the data are non-nullable at each level.

 In Arrow 6, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list<item: struct<y: int64 not null>> not null}}, which is great because it's not reading the {{"z"}} column, but the struct's nullability is wrong: we should see three {{"not nulls"}} here, one for the data in {{y}}, one for the {{struct}}, and one for the {{list}}. It's just missing the middle one.

When I ask for two columns specifically or don't specify the columns, the nullability is correct. I think that can help to narrow it down.

I've attached the file (test8.parquet). It was the same in both of the above tests (generated by Arrow 5).

I labeled this as "Python" because I've only seen the symptom in Python, but I suspect that the actual error is in C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)