You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/04 20:46:35 UTC

[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

sunchao commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551558307



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -62,17 +64,37 @@ impl TableProvider for ParquetTable {
         self.schema.clone()
     }
 
+    fn supports_filter_pushdown(
+        &self,
+        _filter: &Expr,
+    ) -> Result<TableProviderFilterPushDown> {
+        Ok(TableProviderFilterPushDown::Inexact)
+    }
+
     /// Scan the file(s), using the provided projection, and return one BatchIterator per
     /// partition.
     fn scan(
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
-        _filters: &[Expr],
+        filters: &[Expr],

Review comment:
       Ideally, I feel we should have a proper filter API defined in data fusion which can be shared among various data sources. On the other hand, the actual filtering logic should be implemented by different data sources / formats, probably via converting the data fusion's filter API to the corresponding ones from the latter. 
   
   But this is a very good start and we can probably do them as follow ups (if we don't care much for API changes).

##########
File path: rust/parquet/src/file/serialized_reader.rs
##########
@@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
             metadata,
         })
     }
+
+    pub fn filter_row_groups(

Review comment:
       Yeah I think we can either move this to the application layer (i.e., data fusion), or expose it as a util function from `footer.rs`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org