You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/05 20:36:10 UTC
[GitHub] [arrow-rs] sunchao commented on a change in pull request #1389: Introduce `ReadOptions` with builder API, filter row groups that satisfy all filters, and enable filter row groups by range.

sunchao commented on a change in pull request #1389:
URL: https://github.com/apache/arrow-rs/pull/1389#discussion_r820143715



##########
File path: parquet/src/file/serialized_reader.rs
##########
@@ -127,6 +127,56 @@ pub struct SerializedFileReader<R: ChunkReader> {
     metadata: ParquetMetaData,
 }
 
+/// A builder for [`ReadOptions`].
+/// For the predicates that are added to the builder,
+/// they will be chained using 'AND' to filter the row groups.
+pub struct ReadOptionsBuilder {
+    predicates: Vec<Box<dyn FnMut(&RowGroupMetaData, usize) -> bool>>,
+}
+
+impl ReadOptionsBuilder {
+    /// New builder
+    pub fn new() -> Self {
+        ReadOptionsBuilder { predicates: vec![] }
+    }
+
+    /// Add a predicate on row group metadata to the reading option,
+    /// Filter only row groups that match the predicate criteria
+    pub fn with_predicate(
+        mut self,
+        predicate: Box<dyn FnMut(&RowGroupMetaData, usize) -> bool>,
+    ) -> Self {
+        self.predicates.push(predicate);
+        self
+    }
+
+    /// Add a range predicate on filtering row groups if their midpoints are within the range

Review comment:
       nit: maybe indicate whether the start and end is inclusive or exclusive

##########
File path: parquet/src/file/serialized_reader.rs
##########
@@ -138,25 +188,48 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
         })
     }
 
-    /// Filters row group metadata to only those row groups,
-    /// for which the predicate function returns true
-    pub fn filter_row_groups(
-        &mut self,
-        predicate: &dyn Fn(&RowGroupMetaData, usize) -> bool,
-    ) {
+    /// Creates file reader from a Parquet file with read options.
+    /// Returns error if Parquet file does not exist or is corrupt.
+    pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> {
+        let metadata = footer::parse_metadata(&chunk_reader)?;
+        let mut predicates = options.predicates;
+        let row_groups = metadata.row_groups().to_vec();
         let mut filtered_row_groups = Vec::<RowGroupMetaData>::new();
-        for (i, row_group_metadata) in self.metadata.row_groups().iter().enumerate() {
-            if predicate(row_group_metadata, i) {
-                filtered_row_groups.push(row_group_metadata.clone());
+        for (i, rg_meta) in row_groups.into_iter().enumerate() {
+            let mut keep = true;
+            for predicate in &mut predicates {
+                if !predicate(&rg_meta, i) {
+                    keep = false;
+                    break;
+                }
+            }
+            if keep {
+                filtered_row_groups.push(rg_meta);
             }
         }
-        self.metadata = ParquetMetaData::new(
-            self.metadata.file_metadata().clone(),
-            filtered_row_groups,
-        );
+
+        Ok(Self {
+            chunk_reader: Arc::new(chunk_reader),
+            metadata: ParquetMetaData::new(
+                metadata.file_metadata().clone(),
+                filtered_row_groups,
+            ),
+        })
     }
 }
 
+/// Get midpoint offset for a row group
+fn get_midpoint_offset(meta: &RowGroupMetaData) -> i64 {
+    let col = meta.column(0);
+    let mut offset = col.data_page_offset();

Review comment:
       For encrypted Parquet files we'll need to use `file_offset` but it's fine for now since it's not supported anyways.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org