You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/15 20:39:41 UTC

[GitHub] [arrow-rs] viirya opened a new pull request #1448: Fix Parquet reader for null list

viirya opened a new pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448


   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1399.
   
   # Rationale for this change
    
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r828219982



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       @novemberkilo could you open a PR against `parquet-testing` and add your test file there?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831805322



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -885,6 +885,9 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
+        ArrowType::Null => {

Review comment:
       Yea, seems so.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r828239497



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       yea, I think it is good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r827403115



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -924,7 +927,7 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
             && (rep_levels.len() == next_batch_array.len()))
         {
             return Err(ArrowError(
-                "Expected item_reader def_levels and rep_levels to be same length as batch".to_string(),
+                format!("Expected item_reader def_levels {} and rep_levels {} to be same length as batch {}", def_levels.len(), rep_levels.len(), next_batch_array.len()),

Review comment:
       Purely for debugging purpose. I feel it is clear.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#issuecomment-1075351627


   Thank you @alamb @tustvold @novemberkilo @sunchao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r828713806



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       Oh, I see. Thanks. I added the file manually to test it before.
   
   And seems updating `.gitmodules` doesn't work for CI here. We may need to wait for the test file being merged in `parquet-testing`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r827715048



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       You could point to my parquet-testing fork like I have done here https://github.com/novemberkilo/arrow-rs/commit/e5952ae74344d253fad7212ec46cb913d5e4f1cb#diff-fe7afb5c9c916e521401d3fcfb4277d5071798c3baf83baf11d6071742823584 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r827392367



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       I downloaded the test parquet file from @novemberkilo's `parquet-testing` fork. Not sure how we can put the file there?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#issuecomment-1068835459


   Thanks for taking this on @viirya -- I'm following along with interest :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831658520



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);
+                    } else if let ArrowType::List(list_field) = f.data_type() {

Review comment:
       Yea, this is not specific to nulls.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r828693723



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       @viirya fwiw I did something like:
   
   - delete the parquet-testing submodule
   - `git submodule add -b topic/null-list-reader git@github.com:novemberkilo/parquet-testing`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r827392367



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       I downloaded the test parquet file from @novemberkilo's `parquet-testing` fork to test it locally. Not sure how we can put the file there?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#issuecomment-1074802098


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1448?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1448](https://codecov.io/gh/apache/arrow-rs/pull/1448?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (acd8d47) into [master](https://codecov.io/gh/apache/arrow-rs/commit/717216f2a296566b10d786a113ceea937e95a459?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (717216f) will **increase** coverage by `0.04%`.
   > The diff coverage is `82.75%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1448      +/-   ##
   ==========================================
   + Coverage   82.67%   82.72%   +0.04%     
   ==========================================
     Files         185      187       +2     
     Lines       53866    54186     +320     
   ==========================================
   + Hits        44535    44824     +289     
   - Misses       9331     9362      +31     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1448?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `78.72% <64.28%> (+0.45%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `93.79% <100.00%> (+0.22%)` | :arrow_up: |
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `82.95% <0.00%> (-0.21%)` | :arrow_down: |
   | [arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2xpc3QucnM=) | `95.52% <0.00%> (ø)` | |
   | [integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aW50ZWdyYXRpb24tdGVzdGluZy9zcmMvbGliLnJz) | `0.00% <0.00%> (ø)` | |
   | [arrow/src/compute/kernels/window.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy93aW5kb3cucnM=) | `100.00% <0.00%> (ø)` | |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (ø)` | |
   | [arrow/src/array/transform/union.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91bmlvbi5ycw==) | `75.00% <0.00%> (ø)` | |
   | [arrow/src/array/transform/fixed\_size\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9maXhlZF9zaXplX2xpc3QucnM=) | `73.68% <0.00%> (ø)` | |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (+0.03%)` | :arrow_up: |
   | ... and [6 more](https://codecov.io/gh/apache/arrow-rs/pull/1448/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1448?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1448?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [717216f...acd8d47](https://codecov.io/gh/apache/arrow-rs/pull/1448?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r828691588



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       @sunchao https://github.com/apache/parquet-testing/pull/23




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#issuecomment-1074978752


   Merging while we get the permissions for @tustvold  sorted out. Thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831503296



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       Thanks @tustvold. The parquet file was merged. It'd be great to get testing parquet test files checked in faster.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831229793



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -964,7 +967,10 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
                 cur_offset += OffsetSize::one();
             }
         });
-        offsets.push(cur_offset);
+
+        if !cur_offset.is_zero() {

Review comment:
       I'm not sure this is correct, I think it will convert an array containing no non-null values, into an empty array?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r828249526



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       Hmm, seems updating `.gitmodules` doesn't work? I also updated it locally and do `git submodule update --init` again, but I don't see the file in `parquet-testing/data`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831212258



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);
+        let parquet_file_reader =
+            SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
+        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
+        let mut record_batch_reader = arrow_reader
+            .get_record_reader(60)
+            .expect("Failed to read into array!");
+
+        let batch = record_batch_reader.next();
+        assert!(batch.is_none());

Review comment:
       I don't think this is correct, locally I ran
   
   ```
   > duckdb.query(f"select count(*) from 'null_list.parquet'").fetchall()
   [(1,)]
   ```
   
   Which would imply that the data contains a single row, with a single null value




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r827723838



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       Okay, I think it works for CI purpose in the PR. Not sure if we can merge into master like that. Anyway, let me point to your fork first to test it. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831212258



##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);
+        let parquet_file_reader =
+            SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
+        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
+        let mut record_batch_reader = arrow_reader
+            .get_record_reader(60)
+            .expect("Failed to read into array!");
+
+        let batch = record_batch_reader.next();
+        assert!(batch.is_none());

Review comment:
       I don't think this is correct, locally I ran
   
   ```
   > duckdb.query(f"select count(*) from 'null_list.parquet'").fetchall()
   [(1,)]
   > duckdb.query(f"select * from 'null_list.parquet'").fetchall()
   [(None,)]
   ```
   
   Which would imply that the data contains a single row, with a single null value
   
   
   Perhaps it should be
   
   ```
   let batch = record_batch_reader.next().unwrap().unwrap();
   assert_eq!(batch.num_rows(), 1);
   assert_eq!(batch.num_columns(), 1);
   assert_eq!(batch.column(0).len(), 1);
   
   let list = batch.column(0).as_any().downcast_ref::<ListArray>().unwrap();
   assert_eq!(list.len(), 1);
   assert_eq!(list.is_valid(0), true);
   
   let val = list.value(0);
   assert_eq!(val.len(), 0);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831080215



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -885,6 +885,9 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
+        ArrowType::Null => {

Review comment:
       Probably beyond the scope of this PR, but it occurs to me that this `remove_indices` function is basically an inverse of the arrow `take` kernel. I wonder if we could flip the construction of `null_list_indices` to be instead the non-null positions - and just use a regular take kernel?
   
   I might have a play with this :thinking: 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();

Review comment:
       Not part of this PR, but I wonder why this doesn't just initialize `field` to be `self.arrow_schema.field_with_name(parts[1]).ok()` and then do `skip(2)` given it has already done the length check above...

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);

Review comment:
       Perhaps a match block might be clearer?

##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       I will have a play with creating a test that generates a parquet file, so that we can get this PR in without waiting for parquet-testing. I will also file a ticket to start a discussion on a faster way to get parquet test files checked in, without relying on an upstream repo 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);
+                    } else if let ArrowType::List(list_field) = f.data_type() {

Review comment:
       If I'm not mistaken this change is necessary to support a StructArray containing a ListArray in general, without being specific to nulls?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org