You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/12 18:23:26 UTC

[GitHub] [arrow-rs] helgikrs opened a new pull request #1166: Bugfix in parquet writing empty lists of structs

helgikrs opened a new pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166


   # Which issue does this PR close?
   
   Closes #703 
   
   # What changes are included in this PR?
   
   Fix a bug in the definition level calculation for fields nested within a struct and a list. When a list is empty or null in parquet the nested field gets a null value. However, in arrow, the value is simply missing.  When serializing an immediate child of the list, the list offsets are used to calculate the correct definition level for its children, but it is not carried further to fields nested deeper (e.g., fields on a struct within a list).  This (somewhat hacky) fix treats a struct within a list as if it were a list.
   
   # Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#issuecomment-1013345423


   Thanks @helgikrs !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#issuecomment-1011338811


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1166](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (608d211) into [master](https://codecov.io/gh/apache/arrow-rs/commit/884c6a633f5f7e9cb2609ba1e8ac05652089d768?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (884c6a6) will **increase** coverage by `0.04%`.
   > The diff coverage is `100.00%`.
   
   > :exclamation: Current head 608d211 differs from pull request most recent head 33b6543. Consider uploading reports for the commit 33b6543 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1166/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1166      +/-   ##
   ==========================================
   + Coverage   82.55%   82.59%   +0.04%     
   ==========================================
     Files         173      173              
     Lines       50673    50753      +80     
   ==========================================
   + Hits        41833    41920      +87     
   + Misses       8840     8833       -7     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.28% <100.00%> (+1.40%)` | :arrow_up: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.43% <0.00%> (-0.27%)` | :arrow_down: |
   | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `86.60% <0.00%> (+0.11%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.21% <0.00%> (+0.22%)` | :arrow_up: |
   | [arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9hcml0aG1ldGljLnJz) | `85.57% <0.00%> (+0.92%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `90.93% <0.00%> (+1.00%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [884c6a6...33b6543](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#issuecomment-1011338811


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1166](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (33b6543) into [master](https://codecov.io/gh/apache/arrow-rs/commit/884c6a633f5f7e9cb2609ba1e8ac05652089d768?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (884c6a6) will **increase** coverage by `0.04%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1166/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1166      +/-   ##
   ==========================================
   + Coverage   82.55%   82.59%   +0.04%     
   ==========================================
     Files         173      173              
     Lines       50673    50753      +80     
   ==========================================
   + Hits        41833    41921      +88     
   + Misses       8840     8832       -8     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.28% <100.00%> (+1.40%)` | :arrow_up: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.56% <0.00%> (-0.14%)` | :arrow_down: |
   | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `86.60% <0.00%> (+0.11%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.21% <0.00%> (+0.22%)` | :arrow_up: |
   | [arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9hcml0aG1ldGljLnJz) | `85.57% <0.00%> (+0.92%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1166/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `90.93% <0.00%> (+1.00%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [884c6a6...33b6543](https://codecov.io/gh/apache/arrow-rs/pull/1166?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#discussion_r785017970



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1675,4 +1682,95 @@ mod tests {
         };
         assert_eq!(list_level, &expected_level);
     }
+
+    #[test]
+    fn test_list_of_struct() {
+        // define schema
+        let int_field = Field::new("a", DataType::Int32, true);
+        let item_field =
+            Field::new("item", DataType::Struct(vec![int_field.clone()]), true);
+        let list_field = Field::new("list", DataType::List(Box::new(item_field)), true);
+
+        let int_builder = Int32Builder::new(10);
+        let struct_builder =
+            StructBuilder::new(vec![int_field], vec![Box::new(int_builder)]);
+        let mut list_builder = ListBuilder::new(struct_builder);
+
+        // [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
+        //
+        // [{a: 1}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(1)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // []
+        list_builder.append(true).unwrap();
+
+        // null
+        list_builder.append(false).unwrap();
+
+        // [null, null]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: null}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: 2}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(2)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        let array = Arc::new(list_builder.finish());
+
+        let schema = Arc::new(Schema::new(vec![list_field]));
+
+        let rb = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let batch_level = LevelInfo::new(0, rb.num_rows());
+        let list_level =
+            &batch_level.calculate_array_levels(rb.column(0), rb.schema().field(0))[0];
+
+        let expected_level = LevelInfo {

Review comment:
       Can confirm, these numbers check out :+1: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#discussion_r784236336



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1675,4 +1682,95 @@ mod tests {
         };
         assert_eq!(list_level, &expected_level);
     }
+
+    #[test]
+    fn test_list_of_struct() {
+        // define schema
+        let int_field = Field::new("a", DataType::Int32, true);
+        let item_field =
+            Field::new("item", DataType::Struct(vec![int_field.clone()]), true);
+        let list_field = Field::new("list", DataType::List(Box::new(item_field)), true);
+
+        let int_builder = Int32Builder::new(10);
+        let struct_builder =
+            StructBuilder::new(vec![int_field], vec![Box::new(int_builder)]);
+        let mut list_builder = ListBuilder::new(struct_builder);
+
+        // [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
+        //
+        // [{a: 1}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(1)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // []
+        list_builder.append(true).unwrap();
+
+        // null
+        list_builder.append(false).unwrap();
+
+        // [null, null]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: null}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: 2}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(2)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();

Review comment:
       I double checked that the code matches the comments about what structure is intended to be created

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1675,4 +1682,95 @@ mod tests {
         };
         assert_eq!(list_level, &expected_level);
     }
+
+    #[test]
+    fn test_list_of_struct() {
+        // define schema
+        let int_field = Field::new("a", DataType::Int32, true);
+        let item_field =
+            Field::new("item", DataType::Struct(vec![int_field.clone()]), true);
+        let list_field = Field::new("list", DataType::List(Box::new(item_field)), true);
+
+        let int_builder = Int32Builder::new(10);
+        let struct_builder =
+            StructBuilder::new(vec![int_field], vec![Box::new(int_builder)]);
+        let mut list_builder = ListBuilder::new(struct_builder);
+
+        // [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
+        //
+        // [{a: 1}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(1)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // []
+        list_builder.append(true).unwrap();
+
+        // null
+        list_builder.append(false).unwrap();
+
+        // [null, null]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: null}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: 2}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(2)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        let array = Arc::new(list_builder.finish());
+
+        let schema = Arc::new(Schema::new(vec![list_field]));
+
+        let rb = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let batch_level = LevelInfo::new(0, rb.num_rows());
+        let list_level =
+            &batch_level.calculate_array_levels(rb.column(0), rb.schema().field(0))[0];
+
+        let expected_level = LevelInfo {

Review comment:
       it would be great if someone else who knew this better than I could double check this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#issuecomment-1012413080


   cc @mosyp and @chadbrewbaker


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] helgikrs commented on a change in pull request #1166: Bugfix in parquet writing empty lists of structs

Posted by GitBox <gi...@apache.org>.
helgikrs commented on a change in pull request #1166:
URL: https://github.com/apache/arrow-rs/pull/1166#discussion_r784256363



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1675,4 +1682,95 @@ mod tests {
         };
         assert_eq!(list_level, &expected_level);
     }
+
+    #[test]
+    fn test_list_of_struct() {
+        // define schema
+        let int_field = Field::new("a", DataType::Int32, true);
+        let item_field =
+            Field::new("item", DataType::Struct(vec![int_field.clone()]), true);
+        let list_field = Field::new("list", DataType::List(Box::new(item_field)), true);
+
+        let int_builder = Int32Builder::new(10);
+        let struct_builder =
+            StructBuilder::new(vec![int_field], vec![Box::new(int_builder)]);
+        let mut list_builder = ListBuilder::new(struct_builder);
+
+        // [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
+        //
+        // [{a: 1}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(1)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // []
+        list_builder.append(true).unwrap();
+
+        // null
+        list_builder.append(false).unwrap();
+
+        // [null, null]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(false).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: null}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_null()
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        // [{a: 2}]
+        let values = list_builder.values();
+        values
+            .field_builder::<Int32Builder>(0)
+            .unwrap()
+            .append_value(2)
+            .unwrap();
+        values.append(true).unwrap();
+        list_builder.append(true).unwrap();
+
+        let array = Arc::new(list_builder.finish());
+
+        let schema = Arc::new(Schema::new(vec![list_field]));
+
+        let rb = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let batch_level = LevelInfo::new(0, rb.num_rows());
+        let list_level =
+            &batch_level.calculate_array_levels(rb.column(0), rb.schema().field(0))[0];
+
+        let expected_level = LevelInfo {

Review comment:
       I'm not super confident in this either--it would be great if someone with knowledge about the details of this code could chime in.
   
   The definition and repetition levels I compared with what the c++ parquet writer produces. I exported the above record batch and used the C++ parquet writer to generate a parquet file. I then used `parquet-dump` on the resulting file, which produced the following
   ```
   value 1: R:0 D:4 V:1
   value 2: R:0 D:1 V:<null>
   value 3: R:0 D:0 V:<null>
   value 4: R:0 D:2 V:<null>
   value 5: R:1 D:2 V:<null>
   value 6: R:0 D:3 V:<null>
   value 7: R:0 D:4 V:2
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org