You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/18 23:02:19 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1588: WIP: Add support for nested list arrays (#993)

tustvold opened a new pull request, #1588:
URL: https://github.com/apache/arrow-rs/pull/1588

   **WIP and likely containing bugs, creating as draft for visibility that I'm working on this**
   
   # Which issue does this PR close?
   
   Closes #993.
   
   # Rationale for this change
    
   Adds support for reading nested lists from parquet.
   
   # What changes are included in this PR?
   
   Reworks the ListArrayReader to handle nested repetition levels.
   
   # Are there any user-facing changes?
   
   This no longer ensures that nulls are zero-length, which currently causes tests to fail due to https://github.com/apache/arrow-rs/issues/626. Code that relies on this property, whilst incorrect, may break with this change
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#issuecomment-1117547806

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1588?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1588](https://codecov.io/gh/apache/arrow-rs/pull/1588?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (70c16c3) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8f24c45fbf944486b2f9fd921eeede0725b4316b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8f24c45) will **increase** coverage by `0.03%`.
   > The diff coverage is `80.07%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1588      +/-   ##
   ==========================================
   + Coverage   83.02%   83.06%   +0.03%     
   ==========================================
     Files         193      193              
     Lines       55612    55709      +97     
   ==========================================
   + Hits        46174    46276     +102     
   + Misses       9438     9433       -5     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1588?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `89.76% <ø> (-0.07%)` | :arrow_down: |
   | [parquet/src/schema/visitor.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvc2NoZW1hL3Zpc2l0b3IucnM=) | `68.00% <0.00%> (+0.89%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader/test\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL3Rlc3RfdXRpbC5ycw==) | `78.57% <50.00%> (-4.77%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J1aWxkZXIucnM=) | `68.97% <59.77%> (+0.96%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader/list\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2xpc3RfYXJyYXkucnM=) | `92.85% <92.52%> (+4.88%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.37% <0.00%> (-0.19%)` | :arrow_down: |
   | [arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy90YWtlLnJz) | `95.27% <0.00%> (-0.14%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.80% <0.00%> (+0.11%)` | :arrow_up: |
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `83.30% <0.00%> (+0.25%)` | :arrow_up: |
   | ... and [3 more](https://codecov.io/gh/apache/arrow-rs/pull/1588/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1588?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1588?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8f24c45...70c16c3](https://codecov.io/gh/apache/arrow-rs/pull/1588?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r864988394


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -193,122 +239,267 @@ mod tests {
     use crate::file::reader::{FileReader, SerializedFileReader};
     use crate::schema::parser::parse_message_type;
     use crate::schema::types::SchemaDescriptor;
-    use arrow::array::{Array, LargeListArray, ListArray, PrimitiveArray};
-    use arrow::datatypes::{Field, Int32Type as ArrowInt32};
+    use arrow::array::{Array, ArrayDataBuilder, PrimitiveArray};
+    use arrow::datatypes::{Field, Int32Type as ArrowInt32, Int32Type};
     use std::sync::Arc;
 
-    #[test]
-    fn test_list_array_reader() {
-        // [[1, null, 2], null, [3, 4]]
+    fn list_type<OffsetSize: OffsetSizeTrait>(
+        data_type: ArrowType,
+        item_nullable: bool,
+    ) -> ArrowType {
+        let field = Box::new(Field::new("item", data_type, item_nullable));
+        match OffsetSize::is_large() {
+            true => ArrowType::LargeList(field),
+            false => ArrowType::List(field),
+        }
+    }
+
+    fn downcast<OffsetSize: OffsetSizeTrait>(
+        array: &ArrayRef,
+    ) -> &'_ GenericListArray<OffsetSize> {
+        array
+            .as_any()
+            .downcast_ref::<GenericListArray<OffsetSize>>()
+            .unwrap()
+    }
+
+    fn to_offsets<OffsetSize: OffsetSizeTrait>(values: Vec<usize>) -> Buffer {
+        Buffer::from_iter(
+            values
+                .into_iter()
+                .map(|x| OffsetSize::from_usize(x).unwrap()),
+        )
+    }
+
+    fn test_nested_list<OffsetSize: OffsetSizeTrait>() {

Review Comment:
   I think this wins the prize of the most :exploding_head: test I've ever written



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r852471389


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -212,7 +196,7 @@ mod tests {
         let item_array_reader = InMemoryArrayReader::new(
             ArrowType::Int32,
             array,
-            Some(vec![3, 2, 3, 0, 3, 3]),
+            Some(vec![2, 1, 2, 0, 2, 2]),

Review Comment:
   This is just a drive-by fix, it isn't required to get the test to pass, but was technically wrong before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: Add support for nested list arrays from parquet to arrow arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r867520183


##########
parquet/src/arrow/array_reader.rs:
##########
@@ -1532,15 +1532,15 @@ mod tests {
             ArrowType::Int32,
             array_1.clone(),
             Some(vec![0, 1, 2, 3, 1]),
-            Some(vec![1, 1, 1, 1, 1]),
+            Some(vec![0, 1, 1, 1, 1]),

Review Comment:
   Drive by fix, this set of levels is impossible, as the first rep level must be 0. Nothing cares as this test doesn't actually decode the implied list, but :shrug: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] andrei-ionescu commented on pull request #1588: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

andrei-ionescu commented on PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#issuecomment-1117610364

   Thank you @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: Add support for nested list arrays from parquet to arrow arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r867510844


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -193,122 +245,267 @@ mod tests {
     use crate::file::reader::{FileReader, SerializedFileReader};
     use crate::schema::parser::parse_message_type;
     use crate::schema::types::SchemaDescriptor;
-    use arrow::array::{Array, LargeListArray, ListArray, PrimitiveArray};
-    use arrow::datatypes::{Field, Int32Type as ArrowInt32};
+    use arrow::array::{Array, ArrayDataBuilder, PrimitiveArray};
+    use arrow::datatypes::{Field, Int32Type as ArrowInt32, Int32Type};
     use std::sync::Arc;
 
-    #[test]
-    fn test_list_array_reader() {
-        // [[1, null, 2], null, [3, 4]]
+    fn list_type<OffsetSize: OffsetSizeTrait>(
+        data_type: ArrowType,
+        item_nullable: bool,
+    ) -> ArrowType {
+        let field = Box::new(Field::new("item", data_type, item_nullable));
+        match OffsetSize::is_large() {
+            true => ArrowType::LargeList(field),
+            false => ArrowType::List(field),
+        }
+    }
+
+    fn downcast<OffsetSize: OffsetSizeTrait>(
+        array: &ArrayRef,
+    ) -> &'_ GenericListArray<OffsetSize> {
+        array
+            .as_any()
+            .downcast_ref::<GenericListArray<OffsetSize>>()
+            .unwrap()
+    }
+
+    fn to_offsets<OffsetSize: OffsetSizeTrait>(values: Vec<usize>) -> Buffer {
+        Buffer::from_iter(
+            values
+                .into_iter()
+                .map(|x| OffsetSize::from_usize(x).unwrap()),
+        )
+    }
+
+    fn test_nested_list<OffsetSize: OffsetSizeTrait>() {
+        // 3 lists, with first and third nullable
+        // [
+        //     [
+        //         [[1, null], null, [4], []],
+        //         [],
+        //         [[7]],
+        //         [[]]
+        //     ],
+        //     null,
+        //     [],
+        //     [[[11]]]
+        // ]
+
+        let l3_item_type = ArrowType::Int32;
+        let l3_type = list_type::<OffsetSize>(l3_item_type.clone(), true);
+
+        let l2_item_type = l3_type.clone();
+        let l2_type = list_type::<OffsetSize>(l2_item_type.clone(), true);
+
+        let l1_item_type = l2_type.clone();
+        let l1_type = list_type::<OffsetSize>(l1_item_type.clone(), false);
+
+        let leaf = PrimitiveArray::<Int32Type>::from_iter(vec![
+            Some(1),
+            None,
+            Some(4),
+            Some(7),
+            Some(11),
+        ]);
+
+        // [[1, null], null, [4], [], [7], [], [11]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 2, 2, 3, 3, 4, 4, 5]);
+        let l3 = ArrayDataBuilder::new(l3_type.clone())
+            .len(7)
+            .add_buffer(offsets)
+            .add_child_data(leaf.data().clone())
+            .null_bit_buffer(Buffer::from([0b01111101]))
+            .build()
+            .unwrap();
+
+        // [[[1, null], null, [4], []],[], [[7]], [[]], [[11]]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 5, 6, 7]);
+        let l2 = ArrayDataBuilder::new(l2_type.clone())
+            .len(5)
+            .add_buffer(offsets)
+            .add_child_data(l3)
+            .build()
+            .unwrap();
+
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]);
+        let l1 = ArrayDataBuilder::new(l1_type.clone())
+            .len(4)
+            .add_buffer(offsets)
+            .add_child_data(l2)
+            .null_bit_buffer(Buffer::from([0b00001101]))
+            .build()
+            .unwrap();
+
+        let expected = GenericListArray::<OffsetSize>::from(l1);
+
+        let values = Arc::new(PrimitiveArray::<Int32Type>::from(vec![
+            Some(1),
+            None,
+            None,
+            Some(4),
+            None,
+            None,
+            Some(7),
+            None,
+            None,
+            None,
+            Some(11),
+        ]));
+
+        let item_array_reader = InMemoryArrayReader::new(
+            ArrowType::Int32,
+            values,
+            Some(vec![6, 5, 3, 6, 4, 2, 6, 4, 0, 1, 6]),
+            Some(vec![0, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0]),
+        );
+
+        let l3 = ListArrayReader::<OffsetSize>::new(
+            Box::new(item_array_reader),
+            l3_type,
+            l3_item_type,
+            5,
+            3,
+            true,
+        );
+
+        let l2 = ListArrayReader::<OffsetSize>::new(
+            Box::new(l3),
+            l2_type,
+            l2_item_type,
+            3,
+            2,
+            false,
+        );
+
+        let mut l1 = ListArrayReader::<OffsetSize>::new(
+            Box::new(l2),
+            l1_type,
+            l1_item_type,
+            2,
+            1,
+            true,
+        );
+
+        let actual = l1.next_batch(1024).unwrap();

Review Comment:
   Done in https://github.com/apache/arrow-rs/pull/1588/commits/05c231190aef2c9054b9ebb7fbb59f4a1daf129f, required adding functionality to InMemoryArrayReader to actually respect it



##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -193,122 +245,267 @@ mod tests {
     use crate::file::reader::{FileReader, SerializedFileReader};
     use crate::schema::parser::parse_message_type;
     use crate::schema::types::SchemaDescriptor;
-    use arrow::array::{Array, LargeListArray, ListArray, PrimitiveArray};
-    use arrow::datatypes::{Field, Int32Type as ArrowInt32};
+    use arrow::array::{Array, ArrayDataBuilder, PrimitiveArray};
+    use arrow::datatypes::{Field, Int32Type as ArrowInt32, Int32Type};
     use std::sync::Arc;
 
-    #[test]
-    fn test_list_array_reader() {
-        // [[1, null, 2], null, [3, 4]]
+    fn list_type<OffsetSize: OffsetSizeTrait>(
+        data_type: ArrowType,
+        item_nullable: bool,
+    ) -> ArrowType {
+        let field = Box::new(Field::new("item", data_type, item_nullable));
+        match OffsetSize::is_large() {
+            true => ArrowType::LargeList(field),
+            false => ArrowType::List(field),
+        }
+    }
+
+    fn downcast<OffsetSize: OffsetSizeTrait>(
+        array: &ArrayRef,
+    ) -> &'_ GenericListArray<OffsetSize> {
+        array
+            .as_any()
+            .downcast_ref::<GenericListArray<OffsetSize>>()
+            .unwrap()
+    }
+
+    fn to_offsets<OffsetSize: OffsetSizeTrait>(values: Vec<usize>) -> Buffer {
+        Buffer::from_iter(
+            values
+                .into_iter()
+                .map(|x| OffsetSize::from_usize(x).unwrap()),
+        )
+    }
+
+    fn test_nested_list<OffsetSize: OffsetSizeTrait>() {
+        // 3 lists, with first and third nullable
+        // [
+        //     [
+        //         [[1, null], null, [4], []],
+        //         [],
+        //         [[7]],
+        //         [[]]
+        //     ],
+        //     null,
+        //     [],
+        //     [[[11]]]
+        // ]
+
+        let l3_item_type = ArrowType::Int32;
+        let l3_type = list_type::<OffsetSize>(l3_item_type.clone(), true);
+
+        let l2_item_type = l3_type.clone();
+        let l2_type = list_type::<OffsetSize>(l2_item_type.clone(), true);
+
+        let l1_item_type = l2_type.clone();
+        let l1_type = list_type::<OffsetSize>(l1_item_type.clone(), false);
+
+        let leaf = PrimitiveArray::<Int32Type>::from_iter(vec![
+            Some(1),
+            None,
+            Some(4),
+            Some(7),
+            Some(11),
+        ]);
+
+        // [[1, null], null, [4], [], [7], [], [11]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 2, 2, 3, 3, 4, 4, 5]);
+        let l3 = ArrayDataBuilder::new(l3_type.clone())
+            .len(7)
+            .add_buffer(offsets)
+            .add_child_data(leaf.data().clone())
+            .null_bit_buffer(Buffer::from([0b01111101]))
+            .build()
+            .unwrap();
+
+        // [[[1, null], null, [4], []],[], [[7]], [[]], [[11]]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 5, 6, 7]);
+        let l2 = ArrayDataBuilder::new(l2_type.clone())
+            .len(5)
+            .add_buffer(offsets)
+            .add_child_data(l3)
+            .build()
+            .unwrap();
+
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]);

Review Comment:
   Done in https://github.com/apache/arrow-rs/pull/1588/commits/05c231190aef2c9054b9ebb7fbb59f4a1daf129f



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a diff in pull request #1588: Add support for nested list arrays from parquet to arrow arrays (#993)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r867065942


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -35,12 +35,12 @@ pub struct ListArrayReader<OffsetSize: OffsetSizeTrait> {
     item_reader: Box<dyn ArrayReader>,
     data_type: ArrowType,
     item_type: ArrowType,
-    list_def_level: i16,
-    list_rep_level: i16,
-    list_empty_def_level: i16,
-    list_null_def_level: i16,
-    def_level_buffer: Option<Buffer>,
-    rep_level_buffer: Option<Buffer>,
+    // The definition level at which this list is not null
+    def_level: i16,
+    // The repetition level that corresponds to a new value in this array

Review Comment:
   ```suggestion
       /// The repetition level that corresponds to a new value in this array
   ```



##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -35,12 +35,12 @@ pub struct ListArrayReader<OffsetSize: OffsetSizeTrait> {
     item_reader: Box<dyn ArrayReader>,
     data_type: ArrowType,
     item_type: ArrowType,
-    list_def_level: i16,
-    list_rep_level: i16,
-    list_empty_def_level: i16,
-    list_null_def_level: i16,
-    def_level_buffer: Option<Buffer>,
-    rep_level_buffer: Option<Buffer>,
+    // The definition level at which this list is not null
+    def_level: i16,
+    // The repetition level that corresponds to a new value in this array
+    rep_level: i16,
+    // If this list is nullable

Review Comment:
   ```suggestion
       /// If this list is nullable
   ```



##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -35,12 +35,12 @@ pub struct ListArrayReader<OffsetSize: OffsetSizeTrait> {
     item_reader: Box<dyn ArrayReader>,
     data_type: ArrowType,
     item_type: ArrowType,
-    list_def_level: i16,
-    list_rep_level: i16,
-    list_empty_def_level: i16,
-    list_null_def_level: i16,
-    def_level_buffer: Option<Buffer>,
-    rep_level_buffer: Option<Buffer>,
+    // The definition level at which this list is not null

Review Comment:
   ```suggestion
       /// The definition level at which this list is not null
   ```



##########
parquet/src/schema/visitor.rs:
##########
@@ -27,17 +27,30 @@ pub trait TypeVisitor<R, C> {
 
     /// Default implementation when visiting a list.
     ///
-    /// It checks list type definition and calls `visit_list_with_item` with extracted
+    /// It checks list type definition and calls [`Self::visit_list_with_item`] with extracted
     /// item type.
     ///
     /// To fully understand this algorithm, please refer to
     /// [parquet doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md).
+    ///
+    /// For example, a standard list type looks like:

Review Comment:
   ❤️ 



##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -193,122 +245,267 @@ mod tests {
     use crate::file::reader::{FileReader, SerializedFileReader};
     use crate::schema::parser::parse_message_type;
     use crate::schema::types::SchemaDescriptor;
-    use arrow::array::{Array, LargeListArray, ListArray, PrimitiveArray};
-    use arrow::datatypes::{Field, Int32Type as ArrowInt32};
+    use arrow::array::{Array, ArrayDataBuilder, PrimitiveArray};
+    use arrow::datatypes::{Field, Int32Type as ArrowInt32, Int32Type};
     use std::sync::Arc;
 
-    #[test]
-    fn test_list_array_reader() {
-        // [[1, null, 2], null, [3, 4]]
+    fn list_type<OffsetSize: OffsetSizeTrait>(
+        data_type: ArrowType,
+        item_nullable: bool,
+    ) -> ArrowType {
+        let field = Box::new(Field::new("item", data_type, item_nullable));
+        match OffsetSize::is_large() {
+            true => ArrowType::LargeList(field),
+            false => ArrowType::List(field),
+        }
+    }
+
+    fn downcast<OffsetSize: OffsetSizeTrait>(
+        array: &ArrayRef,
+    ) -> &'_ GenericListArray<OffsetSize> {
+        array
+            .as_any()
+            .downcast_ref::<GenericListArray<OffsetSize>>()
+            .unwrap()
+    }
+
+    fn to_offsets<OffsetSize: OffsetSizeTrait>(values: Vec<usize>) -> Buffer {
+        Buffer::from_iter(
+            values
+                .into_iter()
+                .map(|x| OffsetSize::from_usize(x).unwrap()),
+        )
+    }
+
+    fn test_nested_list<OffsetSize: OffsetSizeTrait>() {
+        // 3 lists, with first and third nullable
+        // [
+        //     [
+        //         [[1, null], null, [4], []],
+        //         [],
+        //         [[7]],
+        //         [[]]
+        //     ],
+        //     null,
+        //     [],
+        //     [[[11]]]
+        // ]
+
+        let l3_item_type = ArrowType::Int32;
+        let l3_type = list_type::<OffsetSize>(l3_item_type.clone(), true);
+
+        let l2_item_type = l3_type.clone();
+        let l2_type = list_type::<OffsetSize>(l2_item_type.clone(), true);
+
+        let l1_item_type = l2_type.clone();
+        let l1_type = list_type::<OffsetSize>(l1_item_type.clone(), false);
+
+        let leaf = PrimitiveArray::<Int32Type>::from_iter(vec![
+            Some(1),
+            None,
+            Some(4),
+            Some(7),
+            Some(11),
+        ]);
+
+        // [[1, null], null, [4], [], [7], [], [11]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 2, 2, 3, 3, 4, 4, 5]);
+        let l3 = ArrayDataBuilder::new(l3_type.clone())
+            .len(7)
+            .add_buffer(offsets)
+            .add_child_data(leaf.data().clone())
+            .null_bit_buffer(Buffer::from([0b01111101]))
+            .build()
+            .unwrap();
+
+        // [[[1, null], null, [4], []],[], [[7]], [[]], [[11]]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 5, 6, 7]);
+        let l2 = ArrayDataBuilder::new(l2_type.clone())
+            .len(5)
+            .add_buffer(offsets)
+            .add_child_data(l3)
+            .build()
+            .unwrap();
+
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]);

Review Comment:
   Would it be valuable to have a sublist that has more than 1 item? 
   
   It appears that this test only has `null`, `[]` or a single element `[x]` list. I may be misunderstanding the intent of the test or what is possible in nested lists



##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -193,122 +245,267 @@ mod tests {
     use crate::file::reader::{FileReader, SerializedFileReader};
     use crate::schema::parser::parse_message_type;
     use crate::schema::types::SchemaDescriptor;
-    use arrow::array::{Array, LargeListArray, ListArray, PrimitiveArray};
-    use arrow::datatypes::{Field, Int32Type as ArrowInt32};
+    use arrow::array::{Array, ArrayDataBuilder, PrimitiveArray};
+    use arrow::datatypes::{Field, Int32Type as ArrowInt32, Int32Type};
     use std::sync::Arc;
 
-    #[test]
-    fn test_list_array_reader() {
-        // [[1, null, 2], null, [3, 4]]
+    fn list_type<OffsetSize: OffsetSizeTrait>(
+        data_type: ArrowType,
+        item_nullable: bool,
+    ) -> ArrowType {
+        let field = Box::new(Field::new("item", data_type, item_nullable));
+        match OffsetSize::is_large() {
+            true => ArrowType::LargeList(field),
+            false => ArrowType::List(field),
+        }
+    }
+
+    fn downcast<OffsetSize: OffsetSizeTrait>(
+        array: &ArrayRef,
+    ) -> &'_ GenericListArray<OffsetSize> {
+        array
+            .as_any()
+            .downcast_ref::<GenericListArray<OffsetSize>>()
+            .unwrap()
+    }
+
+    fn to_offsets<OffsetSize: OffsetSizeTrait>(values: Vec<usize>) -> Buffer {
+        Buffer::from_iter(
+            values
+                .into_iter()
+                .map(|x| OffsetSize::from_usize(x).unwrap()),
+        )
+    }
+
+    fn test_nested_list<OffsetSize: OffsetSizeTrait>() {
+        // 3 lists, with first and third nullable
+        // [
+        //     [
+        //         [[1, null], null, [4], []],
+        //         [],
+        //         [[7]],
+        //         [[]]
+        //     ],
+        //     null,
+        //     [],
+        //     [[[11]]]
+        // ]
+
+        let l3_item_type = ArrowType::Int32;
+        let l3_type = list_type::<OffsetSize>(l3_item_type.clone(), true);
+
+        let l2_item_type = l3_type.clone();
+        let l2_type = list_type::<OffsetSize>(l2_item_type.clone(), true);
+
+        let l1_item_type = l2_type.clone();
+        let l1_type = list_type::<OffsetSize>(l1_item_type.clone(), false);
+
+        let leaf = PrimitiveArray::<Int32Type>::from_iter(vec![
+            Some(1),
+            None,
+            Some(4),
+            Some(7),
+            Some(11),
+        ]);
+
+        // [[1, null], null, [4], [], [7], [], [11]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 2, 2, 3, 3, 4, 4, 5]);
+        let l3 = ArrayDataBuilder::new(l3_type.clone())
+            .len(7)
+            .add_buffer(offsets)
+            .add_child_data(leaf.data().clone())
+            .null_bit_buffer(Buffer::from([0b01111101]))
+            .build()
+            .unwrap();
+
+        // [[[1, null], null, [4], []],[], [[7]], [[]], [[11]]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 5, 6, 7]);
+        let l2 = ArrayDataBuilder::new(l2_type.clone())
+            .len(5)
+            .add_buffer(offsets)
+            .add_child_data(l3)
+            .build()
+            .unwrap();
+
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]);
+        let l1 = ArrayDataBuilder::new(l1_type.clone())
+            .len(4)
+            .add_buffer(offsets)
+            .add_child_data(l2)
+            .null_bit_buffer(Buffer::from([0b00001101]))
+            .build()
+            .unwrap();
+
+        let expected = GenericListArray::<OffsetSize>::from(l1);
+
+        let values = Arc::new(PrimitiveArray::<Int32Type>::from(vec![
+            Some(1),
+            None,
+            None,
+            Some(4),
+            None,
+            None,
+            Some(7),
+            None,
+            None,
+            None,
+            Some(11),
+        ]));
+
+        let item_array_reader = InMemoryArrayReader::new(
+            ArrowType::Int32,
+            values,
+            Some(vec![6, 5, 3, 6, 4, 2, 6, 4, 0, 1, 6]),
+            Some(vec![0, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0]),
+        );
+
+        let l3 = ListArrayReader::<OffsetSize>::new(
+            Box::new(item_array_reader),
+            l3_type,
+            l3_item_type,
+            5,
+            3,
+            true,
+        );
+
+        let l2 = ListArrayReader::<OffsetSize>::new(
+            Box::new(l3),
+            l2_type,
+            l2_item_type,
+            3,
+            2,
+            false,
+        );
+
+        let mut l1 = ListArrayReader::<OffsetSize>::new(
+            Box::new(l2),
+            l1_type,
+            l1_item_type,
+            2,
+            1,
+            true,
+        );
+
+        let actual = l1.next_batch(1024).unwrap();

Review Comment:
   Would it also make sense to test with `l1.next_batch(2)` (aka something that doesn't decode the entire array in one go?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] viirya commented on a diff in pull request #1588: Add support for nested list arrays from parquet to arrow arrays (#993)

Posted by GitBox <gi...@apache.org>.

viirya commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r867524585


##########
parquet/src/schema/visitor.rs:
##########
@@ -27,17 +27,30 @@ pub trait TypeVisitor<R, C> {
 
     /// Default implementation when visiting a list.
     ///
-    /// It checks list type definition and calls `visit_list_with_item` with extracted
+    /// It checks list type definition and calls [`Self::visit_list_with_item`] with extracted
     /// item type.
     ///
     /// To fully understand this algorithm, please refer to
     /// [parquet doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md).
+    ///
+    /// For example, a standard list type looks like:

Review Comment:
   nice document addition 👍 



##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -35,12 +35,12 @@ pub struct ListArrayReader<OffsetSize: OffsetSizeTrait> {
     item_reader: Box<dyn ArrayReader>,
     data_type: ArrowType,
     item_type: ArrowType,
-    list_def_level: i16,
-    list_rep_level: i16,
-    list_empty_def_level: i16,
-    list_null_def_level: i16,
-    def_level_buffer: Option<Buffer>,
-    rep_level_buffer: Option<Buffer>,
+    /// The definition level at which this list is not null
+    def_level: i16,
+    /// The repetition level that corresponds to a new value in this array
+    rep_level: i16,
+    /// If this list is nullable
+    nullable: bool,

Review Comment:
   The comment looks clear. 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r864990665


##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -295,96 +297,80 @@ impl<'a> TypeVisitor<Option<Box<dyn ArrayReader>>, &'a ArrayReaderBuilderContext
             .first()
             .ok_or_else(|| ArrowError("List field must have a child.".to_string()))?
             .clone();
-        let mut new_context = context.clone();
 
+        // If the list can contain nulls
+        let nullable = match list_type.get_basic_info().repetition() {
+            Repetition::REQUIRED => false,
+            Repetition::OPTIONAL => true,
+            Repetition::REPEATED => {
+                return Err(general_err!("List type cannot be repeated"))
+            }
+        };
+
+        let mut new_context = context.clone();
         new_context.path.append(vec![list_type.name().to_string()]);
-        // We need to know at what definition a list or its child is null
-        let list_null_def = new_context.def_level;
-        let mut list_empty_def = new_context.def_level;
+
+        // The repeated field
+        new_context.rep_level += 1;
+        new_context.def_level += 1;
 
         // If the list's root is nullable
-        if let Repetition::OPTIONAL = list_type.get_basic_info().repetition() {
+        if nullable {
             new_context.def_level += 1;
-            // current level is nullable, increment to get level for empty list slot
-            list_empty_def += 1;
         }
 
-        match list_child.get_basic_info().repetition() {
-            Repetition::REPEATED => {
-                new_context.def_level += 1;
-                new_context.rep_level += 1;
-            }
-            Repetition::OPTIONAL => {
-                new_context.def_level += 1;
-            }
-            _ => (),
-        }
+        match self.dispatch(item_type, &new_context) {
+            Ok(Some(item_reader)) => {
+                let item_type = item_reader.get_data_type().clone();
+
+                // a list is a group type with a single child. The list child's
+                // name comes from the child's field name.
+                // if the child's name is "list" and it has a child, then use this child
+                if list_child.name() == "list" && !list_child.get_fields().is_empty() {
+                    list_child = list_child.get_fields().first().unwrap();
+                }
 
-        let reader = self.dispatch(item_type.clone(), &new_context);
-        if let Ok(Some(item_reader)) = reader {
-            let item_reader_type = item_reader.get_data_type().clone();
-
-            match item_reader_type {
-                ArrowType::List(_)
-                | ArrowType::FixedSizeList(_, _)
-                | ArrowType::Dictionary(_, _) => Err(ArrowError(format!(
-                    "reading List({:?}) into arrow not supported yet",

Review Comment:
   This is the error that would previously be returned on a nested list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: Add support for nested list arrays from parquet to arrow arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r867163842


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -193,122 +245,267 @@ mod tests {
     use crate::file::reader::{FileReader, SerializedFileReader};
     use crate::schema::parser::parse_message_type;
     use crate::schema::types::SchemaDescriptor;
-    use arrow::array::{Array, LargeListArray, ListArray, PrimitiveArray};
-    use arrow::datatypes::{Field, Int32Type as ArrowInt32};
+    use arrow::array::{Array, ArrayDataBuilder, PrimitiveArray};
+    use arrow::datatypes::{Field, Int32Type as ArrowInt32, Int32Type};
     use std::sync::Arc;
 
-    #[test]
-    fn test_list_array_reader() {
-        // [[1, null, 2], null, [3, 4]]
+    fn list_type<OffsetSize: OffsetSizeTrait>(
+        data_type: ArrowType,
+        item_nullable: bool,
+    ) -> ArrowType {
+        let field = Box::new(Field::new("item", data_type, item_nullable));
+        match OffsetSize::is_large() {
+            true => ArrowType::LargeList(field),
+            false => ArrowType::List(field),
+        }
+    }
+
+    fn downcast<OffsetSize: OffsetSizeTrait>(
+        array: &ArrayRef,
+    ) -> &'_ GenericListArray<OffsetSize> {
+        array
+            .as_any()
+            .downcast_ref::<GenericListArray<OffsetSize>>()
+            .unwrap()
+    }
+
+    fn to_offsets<OffsetSize: OffsetSizeTrait>(values: Vec<usize>) -> Buffer {
+        Buffer::from_iter(
+            values
+                .into_iter()
+                .map(|x| OffsetSize::from_usize(x).unwrap()),
+        )
+    }
+
+    fn test_nested_list<OffsetSize: OffsetSizeTrait>() {
+        // 3 lists, with first and third nullable
+        // [
+        //     [
+        //         [[1, null], null, [4], []],
+        //         [],
+        //         [[7]],
+        //         [[]]
+        //     ],
+        //     null,
+        //     [],
+        //     [[[11]]]
+        // ]
+
+        let l3_item_type = ArrowType::Int32;
+        let l3_type = list_type::<OffsetSize>(l3_item_type.clone(), true);
+
+        let l2_item_type = l3_type.clone();
+        let l2_type = list_type::<OffsetSize>(l2_item_type.clone(), true);
+
+        let l1_item_type = l2_type.clone();
+        let l1_type = list_type::<OffsetSize>(l1_item_type.clone(), false);
+
+        let leaf = PrimitiveArray::<Int32Type>::from_iter(vec![
+            Some(1),
+            None,
+            Some(4),
+            Some(7),
+            Some(11),
+        ]);
+
+        // [[1, null], null, [4], [], [7], [], [11]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 2, 2, 3, 3, 4, 4, 5]);
+        let l3 = ArrayDataBuilder::new(l3_type.clone())
+            .len(7)
+            .add_buffer(offsets)
+            .add_child_data(leaf.data().clone())
+            .null_bit_buffer(Buffer::from([0b01111101]))
+            .build()
+            .unwrap();
+
+        // [[[1, null], null, [4], []],[], [[7]], [[]], [[11]]]
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 5, 6, 7]);
+        let l2 = ArrayDataBuilder::new(l2_type.clone())
+            .len(5)
+            .add_buffer(offsets)
+            .add_child_data(l3)
+            .build()
+            .unwrap();
+
+        let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]);

Review Comment:
   There is a `[1, null]`, can probably add more though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #1588: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#issuecomment-1117608855

   I've confirmed this can read [nested_schema.parquet](https://github.com/apache/arrow-datafusion/files/7621703/nested_schema_1row.parquet.zip) from @andrei-ionescu on https://github.com/apache/arrow-datafusion/issues/1383.
   
   I've also confirmed this can read https://github.com/Igosuki/arrow2/blob/main/part-00000-b4749aa1-94e4-4ddb-bab2-954c4d3a290f.c000.snappy.parquet and https://github.com/chauhanVritul/sampleparquet/blob/main/part-00000-8e5acb24-eb4e-491c-8c85-88799f25d1f0-c000.snappy.parquet provided by @Igosuki on #720 
   
   I've done a manual inspection that the output is the same as duckdb, and it looks good


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r865045026


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -88,97 +84,153 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
         if next_batch_array.len() == 0 {
             return Ok(new_empty_array(&self.data_type));
         }
+
         let def_levels = self
             .item_reader
             .get_def_levels()
-            .ok_or_else(|| ArrowError("item_reader def levels are None.".to_string()))?;
+            .ok_or_else(|| general_err!("item_reader def levels are None."))?;
+
         let rep_levels = self
             .item_reader
             .get_rep_levels()
-            .ok_or_else(|| ArrowError("item_reader rep levels are None.".to_string()))?;
+            .ok_or_else(|| general_err!("item_reader rep levels are None."))?;
 
-        if !((def_levels.len() == rep_levels.len())
-            && (rep_levels.len() == next_batch_array.len()))
-        {
-            return Err(ArrowError(
-                format!("Expected item_reader def_levels {} and rep_levels {} to be same length as batch {}", def_levels.len(), rep_levels.len(), next_batch_array.len()),
+        if OffsetSize::from_usize(next_batch_array.len()).is_none() {
+            return Err(general_err!(
+                "offset of {} would overflow list array",
+                next_batch_array.len()
             ));
         }
 
-        // List definitions can be encoded as 4 values:
-        // - n + 0: the list slot is null
-        // - n + 1: the list slot is not null, but is empty (i.e. [])
-        // - n + 2: the list slot is not null, but its child is empty (i.e. [ null ])
-        // - n + 3: the list slot is not null, and its child is not empty
-        // Where n is the max definition level of the list's parent.
-        // If a Parquet schema's only leaf is the list, then n = 0.
-
-        // If the list index is at empty definition, the child slot is null
-        let non_null_list_indices =
-            def_levels.iter().enumerate().filter_map(|(index, def)| {
-                (*def > self.list_empty_def_level).then(|| index as u32)
-            });
-        let indices = UInt32Array::from_iter_values(non_null_list_indices);
-        let batch_values =
-            arrow::compute::take(&*next_batch_array.clone(), &indices, None)?;
-
-        // first item in each list has rep_level = 0, subsequent items have rep_level = 1
-        let mut offsets: Vec<OffsetSize> = Vec::new();
-        let mut cur_offset = OffsetSize::zero();
-        def_levels.iter().zip(rep_levels).for_each(|(d, r)| {
-            if *r == 0 || d == &self.list_empty_def_level {
-                offsets.push(cur_offset);
-            }
-            if d > &self.list_empty_def_level {
-                cur_offset += OffsetSize::one();
-            }
-        });
-
-        offsets.push(cur_offset);
-
-        let num_bytes = bit_util::ceil(offsets.len(), 8);
-        // TODO: A useful optimization is to use the null count to fill with
-        // 0 or null, to reduce individual bits set in a loop.
-        // To favour dense data, set every slot to true, then unset
-        let mut null_buf = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
-        let null_slice = null_buf.as_slice_mut();
-        let mut list_index = 0;
-        for i in 0..rep_levels.len() {
-            // If the level is lower than empty, then the slot is null.
-            // When a list is non-nullable, its empty level = null level,
-            // so this automatically factors that in.
-            if rep_levels[i] == 0 && def_levels[i] < self.list_empty_def_level {
-                bit_util::unset_bit(null_slice, list_index);
+        // A non-nullable list has a single definition level indicating if the list is empty
+        //
+        // A nullable list has two definition levels associated with it:
+        //
+        // The first identifies if the list is null
+        // The second identifies if the list is empty
+        //
+        // The child data returned above is padded with a value for each not-fully defined level.
+        // Therefore null and empty lists will correspond to a value in the child array.
+        //
+        // Whilst nulls may have a non-zero slice in the offsets array, empty lists must
+        // be of zero length. As a result we MUST filter out values corresponding to empty
+        // lists, and for consistency we do the same for nulls.
+
+        // The output offsets for the computed ListArray
+        let mut list_offsets: Vec<OffsetSize> =
+            Vec::with_capacity(next_batch_array.len());
+
+        // The validity mask of the computed ListArray if nullable
+        let mut validity = self
+            .nullable
+            .then(|| BooleanBufferBuilder::new(next_batch_array.len()));
+
+        // The offset into the filtered child data of the current level being considered
+        let mut cur_offset = 0;
+
+        // Identifies the start of a run of values to copy from the source child data
+        let mut filter_start = None;
+
+        // The number of child values skipped due to empty lists or nulls
+        let mut skipped = 0;
+
+        // Builder used to construct the filtered child data, skipping empty lists and nulls
+        let mut child_data_builder = MutableArrayData::new(

Review Comment:
   I switched to using this over the take kernel as I think it makes it easier to follow what is going on, it should also be significantly faster where nulls and empty lists are rare



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold merged pull request #1588: Add support for nested list arrays from parquet to arrow arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold merged PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r864742901


##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -295,96 +297,80 @@ impl<'a> TypeVisitor<Option<Box<dyn ArrayReader>>, &'a ArrayReaderBuilderContext
             .first()
             .ok_or_else(|| ArrowError("List field must have a child.".to_string()))?
             .clone();
-        let mut new_context = context.clone();
 
+        // If the list can contain nulls

Review Comment:
   I recommend reading this with whitespace disabled https://github.com/apache/arrow-rs/pull/1588/files?w=1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r852471853


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -88,97 +84,85 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
         if next_batch_array.len() == 0 {
             return Ok(new_empty_array(&self.data_type));
         }
+
         let def_levels = self
             .item_reader
             .get_def_levels()
-            .ok_or_else(|| ArrowError("item_reader def levels are None.".to_string()))?;
+            .ok_or_else(|| general_err!("item_reader def levels are None."))?;
+
         let rep_levels = self
             .item_reader
             .get_rep_levels()
-            .ok_or_else(|| ArrowError("item_reader rep levels are None.".to_string()))?;
-
-        if !((def_levels.len() == rep_levels.len())
-            && (rep_levels.len() == next_batch_array.len()))
-        {
-            return Err(ArrowError(
-                format!("Expected item_reader def_levels {} and rep_levels {} to be same length as batch {}", def_levels.len(), rep_levels.len(), next_batch_array.len()),
-            ));
-        }
+            .ok_or_else(|| general_err!("item_reader rep levels are None."))?;
 
-        // List definitions can be encoded as 4 values:
-        // - n + 0: the list slot is null
-        // - n + 1: the list slot is not null, but is empty (i.e. [])
-        // - n + 2: the list slot is not null, but its child is empty (i.e. [ null ])
-        // - n + 3: the list slot is not null, and its child is not empty
-        // Where n is the max definition level of the list's parent.
-        // If a Parquet schema's only leaf is the list, then n = 0.
-
-        // If the list index is at empty definition, the child slot is null
-        let non_null_list_indices =
-            def_levels.iter().enumerate().filter_map(|(index, def)| {
-                (*def > self.list_empty_def_level).then(|| index as u32)
-            });
-        let indices = UInt32Array::from_iter_values(non_null_list_indices);
-        let batch_values =
-            arrow::compute::take(&*next_batch_array.clone(), &indices, None)?;

Review Comment:
   We no longer do this, which is likely better for performance, but comes with the caveat that nulls take up more space than before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r852471853


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -88,97 +84,85 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
         if next_batch_array.len() == 0 {
             return Ok(new_empty_array(&self.data_type));
         }
+
         let def_levels = self
             .item_reader
             .get_def_levels()
-            .ok_or_else(|| ArrowError("item_reader def levels are None.".to_string()))?;
+            .ok_or_else(|| general_err!("item_reader def levels are None."))?;
+
         let rep_levels = self
             .item_reader
             .get_rep_levels()
-            .ok_or_else(|| ArrowError("item_reader rep levels are None.".to_string()))?;
-
-        if !((def_levels.len() == rep_levels.len())
-            && (rep_levels.len() == next_batch_array.len()))
-        {
-            return Err(ArrowError(
-                format!("Expected item_reader def_levels {} and rep_levels {} to be same length as batch {}", def_levels.len(), rep_levels.len(), next_batch_array.len()),
-            ));
-        }
+            .ok_or_else(|| general_err!("item_reader rep levels are None."))?;
 
-        // List definitions can be encoded as 4 values:
-        // - n + 0: the list slot is null
-        // - n + 1: the list slot is not null, but is empty (i.e. [])
-        // - n + 2: the list slot is not null, but its child is empty (i.e. [ null ])
-        // - n + 3: the list slot is not null, and its child is not empty
-        // Where n is the max definition level of the list's parent.
-        // If a Parquet schema's only leaf is the list, then n = 0.
-
-        // If the list index is at empty definition, the child slot is null
-        let non_null_list_indices =
-            def_levels.iter().enumerate().filter_map(|(index, def)| {
-                (*def > self.list_empty_def_level).then(|| index as u32)
-            });
-        let indices = UInt32Array::from_iter_values(non_null_list_indices);
-        let batch_values =
-            arrow::compute::take(&*next_batch_array.clone(), &indices, None)?;

Review Comment:
   We no longer do this, which is likely better for performance, but comes with the caveat that nulls take up more space than before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r864741142


##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -208,42 +191,61 @@ impl<'a> TypeVisitor<Option<Box<dyn ArrayReader>>, &'a ArrayReaderBuilderContext
         // Add map type to context
         let mut new_context = context.clone();
         new_context.path.append(vec![map_type.name().to_string()]);
-        if let Repetition::OPTIONAL = map_type.get_basic_info().repetition() {
-            new_context.def_level += 1;
+
+        match map_type.get_basic_info().repetition() {
+            Repetition::REQUIRED => {}
+            Repetition::OPTIONAL => {
+                new_context.def_level += 1;
+            }
+            Repetition::REPEATED => {
+                return Err(ArrowError("Map cannot be repeated".to_string()))
+            }
+        }
+
+        if map_type.get_fields().len() != 1 {
+            return Err(ArrowError(format!(
+                "Map field must have exactly one key_value child, found {}",
+                map_type.get_fields().len()
+            )));
         }
 
         // Add map entry (key_value) to context
-        let map_key_value = map_type.get_fields().first().ok_or_else(|| {
-            ArrowError("Map field must have a key_value entry".to_string())
-        })?;
+        let map_key_value = &map_type.get_fields()[0];
+        if map_key_value.get_basic_info().repetition() != Repetition::REPEATED {
+            return Err(ArrowError(
+                "Child of map field must be repeated".to_string(),
+            ));
+        }
+
         new_context
             .path
             .append(vec![map_key_value.name().to_string()]);
+
         new_context.rep_level += 1;
+        new_context.def_level += 1;
+
+        if map_key_value.get_fields().len() != 2 {
+            // According to the specification the values are optional (#1642)
+            return Err(ArrowError(format!(
+                "Child of map field must have two children, found {}",
+                map_key_value.get_fields().len()
+            )));
+        }
 
         // Get key and value, and create context for each
-        let map_key = map_key_value
-            .get_fields()
-            .first()
-            .ok_or_else(|| ArrowError("Map entry must have a key".to_string()))?;
-        let map_value = map_key_value
-            .get_fields()
-            .get(1)
-            .ok_or_else(|| ArrowError("Map entry must have a value".to_string()))?;
-
-        let key_reader = {
-            let mut key_context = new_context.clone();
-            key_context.def_level += 1;
-            key_context.path.append(vec![map_key.name().to_string()]);

Review Comment:
   This is a drive-by fix, the context is the context of the parent, not the value being dispatched. This would result in map_key appearing twice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r864989404


##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -129,9 +129,10 @@ impl<'a> TypeVisitor<Option<Box<dyn ArrayReader>>, &'a ArrayReaderBuilderContext
 
             let null_mask_only = match cur_type.get_basic_info().repetition() {
                 Repetition::REPEATED => {
-                    new_context.def_level += 1;
-                    new_context.rep_level += 1;
-                    false
+                    return Err(ArrowError(format!(

Review Comment:
   This just moves the error to earlier, there is no point continuing if this is not supported anyway :grin: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1588: WIP: Add support for nested list arrays (#993)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1588:
URL: https://github.com/apache/arrow-rs/pull/1588#discussion_r852471389


##########
parquet/src/arrow/array_reader/list_array.rs:
##########
@@ -212,7 +196,7 @@ mod tests {
         let item_array_reader = InMemoryArrayReader::new(
             ArrowType::Int32,
             array,
-            Some(vec![3, 2, 3, 0, 3, 3]),
+            Some(vec![2, 1, 2, 0, 2, 2]),

Review Comment:
   This is just a drive-by fix, it isn't required to get the test to pass, but was technically wrong before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org