You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neville Dipale (Jira)" <ji...@apache.org> on 2021/01/16 19:54:00 UTC

[jira] [Created] (ARROW-11271) [Rust] [Parquet] List schema to Arrow parser misinterpreting child nullability

Neville Dipale created ARROW-11271:
--------------------------------------

             Summary: [Rust] [Parquet] List schema to Arrow parser misinterpreting child nullability
                 Key: ARROW-11271
                 URL: https://issues.apache.org/jira/browse/ARROW-11271
             Project: Apache Arrow
          Issue Type: Bug
          Components: Rust
    Affects Versions: 2.0.0
            Reporter: Neville Dipale
            Assignee: Neville Dipale


We currently do not propagate child nullability correctly when reading parquet files from Spark 3.0.1 (parquet-mr 1.10.1).

For example, the below taken from [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] is currently interpreted incorrectly:

 
{code:java}
// List<String> (list nullable, elements non-null) 
optional group my_list (LIST) {
    repeated group list { 
        required binary element (UTF8); 
    } 
}{code}
The Arrow type should be:
{code:java}
Field::new(
    "my_list",
    DataType::List(
        box Field::new("element", DataType::Utf8, nullable: false),
    ),
    nullable: true
){code}
but we currently end up with 
{code:java}
Field::new(
   "my_list",
   DataType::List(
       box Field::new("list", DataType::Utf8, nullable: true),
   ),
   nullable: true
)
{code}
This doesn't seem to be an issue with the master branch as of opening this issue, so it might not be severe enough to try force into the 3.0.0 release.

I tested null and non-null Spark files, and was able to read them correctly. This becomes an issue with nested lists, which I'm working on.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)