You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/21 13:20:53 UTC

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

tustvold commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831080215



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -885,6 +885,9 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
+        ArrowType::Null => {

Review comment:
       Probably beyond the scope of this PR, but it occurs to me that this `remove_indices` function is basically an inverse of the arrow `take` kernel. I wonder if we could flip the construction of `null_list_indices` to be instead the non-null positions - and just use a regular take kernel?
   
   I might have a play with this :thinking: 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();

Review comment:
       Not part of this PR, but I wonder why this doesn't just initialize `field` to be `self.arrow_schema.field_with_name(parts[1]).ok()` and then do `skip(2)` given it has already done the length check above...

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);

Review comment:
       Perhaps a match block might be clearer?

##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       I will have a play with creating a test that generates a parquet file, so that we can get this PR in without waiting for parquet-testing. I will also file a ticket to start a discussion on a faster way to get parquet test files checked in, without relying on an upstream repo 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);
+                    } else if let ArrowType::List(list_field) = f.data_type() {

Review comment:
       If I'm not mistaken this change is necessary to support a StructArray containing a ListArray in general, without being specific to nulls?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org