You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/27 15:14:47 UTC

[GitHub] [arrow] rdettai opened a new pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

rdettai opened a new pull request #7291:
URL: https://github.com/apache/arrow/pull/7291


   This is a fix for the faulty https://github.com/apache/arrow/pull/6935
   
   It addresses https://issues.apache.org/jira/browse/ARROW-8455
   >Seen behavior: When reading a Parquet file into Arrow with get_record_reader_by_columns, it will fail if one of the column of the file is a list (or any other unsupported type).
   
   >Expected behavior: it should only fail if you are actually reading the column with unsuported type.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke commented on pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
maxburke commented on pull request #7291:
URL: https://github.com/apache/arrow/pull/7291#issuecomment-635062239


   @rdettai Your change works for me! Thank you so much for looking into this! :smile: 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao closed pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #7291:
URL: https://github.com/apache/arrow/pull/7291


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke commented on a change in pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
maxburke commented on a change in pull request #7291:
URL: https://github.com/apache/arrow/pull/7291#discussion_r431553463



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -609,23 +609,34 @@ where
 {
     let mut leaves = HashMap::<*const Type, usize>::new();
 
-    let mut filtered_fields: Vec<Rc<Type>> = Vec::new();
+    let mut filtered_root_names = HashSet::<String>::new();
 
     for c in column_indices {
         let column = parquet_schema.column(c).self_type() as *const Type;
         leaves.insert(column, c);
 
         let root = parquet_schema.get_column_root_ptr(c);
-        filtered_fields.push(root);
+        filtered_root_names.insert(root.name().to_string());
     }
 
     if leaves.is_empty() {
         return Err(general_err!("Can't build array reader without columns!"));
     }
 
+    // Only pass root fields that take part in the projection
+    // to avoid traversal of columns that are not read.
+    // TODO: also prune unread parts of the tree in child structures
+    let filtered_root_fields = parquet_schema
+        .root_schema()
+        .get_fields()
+        .into_iter()
+        .filter(|field| filtered_root_names.contains(field.name()))
+        .map(|field| field.clone())

Review comment:
       I think this could probably use `.cloned()` instead of `.map(|field| field.clone())`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7291:
URL: https://github.com/apache/arrow/pull/7291#issuecomment-634732266


   https://issues.apache.org/jira/browse/ARROW-8455


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7291:
URL: https://github.com/apache/arrow/pull/7291#issuecomment-635063404


   Merged. Thanks @rdettai for the fix and @maxburke for reporting this!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #7291:
URL: https://github.com/apache/arrow/pull/7291#issuecomment-634741622


   @maxburke this should definitively fix your issue ! Sorry for the inconvenience. There are really very few safety nets on this codebase :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #7291: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files FIX

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #7291:
URL: https://github.com/apache/arrow/pull/7291#discussion_r431617782



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -609,23 +609,34 @@ where
 {
     let mut leaves = HashMap::<*const Type, usize>::new();
 
-    let mut filtered_fields: Vec<Rc<Type>> = Vec::new();
+    let mut filtered_root_names = HashSet::<String>::new();
 
     for c in column_indices {
         let column = parquet_schema.column(c).self_type() as *const Type;
         leaves.insert(column, c);
 
         let root = parquet_schema.get_column_root_ptr(c);
-        filtered_fields.push(root);
+        filtered_root_names.insert(root.name().to_string());
     }
 
     if leaves.is_empty() {
         return Err(general_err!("Can't build array reader without columns!"));
     }
 
+    // Only pass root fields that take part in the projection
+    // to avoid traversal of columns that are not read.
+    // TODO: also prune unread parts of the tree in child structures
+    let filtered_root_fields = parquet_schema
+        .root_schema()
+        .get_fields()
+        .into_iter()
+        .filter(|field| filtered_root_names.contains(field.name()))
+        .map(|field| field.clone())

Review comment:
       Thanks for the tip !




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org