You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/15 15:10:38 UTC

[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543418383



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I would be tempted to make this async in case filtering requires some IO.

##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       Shouldn't we try to see how this could work with parquet and its footer metadata?

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       I feel this makes us evaluate the expression twice: once for testing filter support and once at scan time. Moreover, it is likely that different partitions will have different `TableProviderFilterPushDown` values. It would be great optimize out filters on that finer granularity 🚀 But that would imply doing that optimization in the physical plan.

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       How will the distinction between a filter that is `Unsupported` and `Inexact` be leveraged? If `TableProviderFilterPushDown` is only used to know which filter expression can be optimized out, I think that a boolean would be enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org