You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "zeroshade (via GitHub)" <gi...@apache.org> on 2023/04/14 21:08:08 UTC

[GitHub] [arrow] zeroshade commented on issue #35119: [Go] parquet: what is the correct way write repeated fields?

zeroshade commented on issue #35119:
URL: https://github.com/apache/arrow/issues/35119#issuecomment-1509273233

   The misunderstanding is coming from the fact that you're using an old way to specify a list schema. 
   
   The schema you provided is:
   ```
   Schema:
   required group schema {
     optional group a (LIST) {
       repeated boolean a-datum;
     }
     optional group b (LIST) {
       repeated boolean b-datum;
     }
   }
   ```
   
   Note the usage of a 2-level format for describing the lists. From the Parquet Format spec, usage of this backwards-compatible format means that the elements are non-nullable:
   
   ```
   // List<Integer> (nullable list, non-null elements)
   optional group my_list (LIST) {
     repeated int32 element;
   }
   ```
   
   This means that the max definition level (describing whether data exists) is `2` not `3` like you assumed. If you try writing the values with a definition level of `2`, you should get the data you expect.
   
   The *correct* way to specify a list would be as this:
   
   ```
   repeated group field_id=-1 schema {
     optional group field_id=-1 a (List) {
       repeated group field_id=-1 list {
         optional boolean field_id=-1 element;
       }
     }
     optional group field_id=-1 b (List) {
       repeated group field_id=-1 list {
         optional boolean field_id=-1 element;
       }
     }
   }
   ```
   
   Note the 3-level format here. This enables specifying that the elements can be null and will result in a maximum definition level of `3` like you were expecting, where a `2` would indicate a null in the list.
   
   Hope this helps!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org