You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/14 19:44:16 UTC

[GitHub] [arrow] returnString opened a new pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

returnString opened a new pull request #8917:
URL: https://github.com/apache/arrow/pull/8917


   I've got a use case for this with a custom TableProvider implementation, so thought I'd give this a go :)
   
   This PR allows TableProviders to optionally indicate that they support handling filter expressions either:
   - Inexactly, to simply optimise data retrieval in an approximate fashion; e.g. pruning in your classic chunked storage system with min/max column metadata stored per chunk
   - Exactly, in which case the relevant filter plan nodes can be optimised out entirely
   
   Some preemptive concerns from my side:
   - Most of these concepts could probably have better names, open to suggestions here.
   - I'm not sure whether expressions are the correct thing to be pushing down to the provider.
   - I've had to update quite a few `scan` callsites with empty filter lists. Could this be handled in a better way?
   - Currently, only table scans using TableSource::FromProvider are supported, because we need a reference to the provider at optimisation time. #8910 removes the provider/named-based reference distinction entirely so I can rebase this once that's merged and add an extra test using an ordinary sql statement, rather than just a `ctx.read_table(provider)` call.
   
   I'd appreciate any thoughts or feedback!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543730441



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       If we don't like passing around `&[]` all over the place, we could name `scan` --> `scan_with_filter` and add a default implementation of `scan` 
   
   That would also have the nice benefit of keeping the signatures the same for existing TableProvider implementations. I do not feel strongly about this

##########
File path: rust/datafusion/src/logical_plan/plan.rs
##########
@@ -477,17 +479,29 @@ impl LogicalPlan {
                     LogicalPlan::TableScan {
                         ref source,
                         ref projection,
+                        ref filters,
                         ..
-                    } => match source {
-                        TableSource::FromContext(table_name) => write!(
-                            f,
-                            "TableScan: {} projection={:?}",
-                            table_name, projection
-                        ),
-                        TableSource::FromProvider(_) => {
-                            write!(f, "TableScan: projection={:?}", projection)
+                    } => {
+                        match source {
+                            TableSource::FromContext(table_name) => write!(
+                                f,
+                                "TableScan: {} projection={:?}",
+                                table_name, projection
+                            )?,
+                            TableSource::FromProvider(_) => {
+                                write!(f, "TableScan: projection={:?}", projection)?
+                            }
+                        };
+
+                        // TODO: this probably shouldn't be conditional on a non-empty list

Review comment:
       I think it is fine to not print out filters if they aren't present

##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       > I think it makes sense to separate the generic support for predicate push-down to the data source from the implementation for various data sources such as parquet because each change will be fairly big so makes sense to split into smaller changes;
   
   I think this approach makes a lot of sense. 
   
   I have also been thinking about pushing predicates down into custom table providers, and FWIW I think this interface would work well for my use. 

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I think that these names make sense as does the behavior described

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I think if filtering requires IO it could be done as part of the actual Execution. 
   
   It seems like the main benefit of removing plan nodes during planning is to avoid potentially expensive I/O that could be ruled out with Metadata queries (as @returnString  describes is his usecase) 
   
   Thus if the system has to go and do I/O anyways to figure out if it rule out a plan node, much of the benefit of pruning such nodes from the plan seem to be lost. In that case it seems better to me to do do such pruning as part of Execution. 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544353356



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I agree that adding async requires quite a bit of propagation. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543497080



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       I honestly know very little about the internals of Parquet and I'd like to avoid too much extra complexity in this PR, I wonder if making use of this in the various existing implementations where feasible would work better as a followup task (or even task per provider)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543527000



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       > Maybe `supports_filter_pushdown`?
   
   That's definitely better, other people seem to like it as well, sold :p
   
   > I feel this makes us evaluate the expression twice: once for testing filter support and once at scan time. Moreover, it is likely that different partitions will have different `TableProviderFilterPushDown` values. It would be great optimize out filters on that finer granularity 🚀 But that would imply doing that optimization in the physical plan.
   
   Yeah, I think this depends on whether it's preferable to:
   - offer extensibility for providers, allowing them to interpret filters however they like
   - promote some notion of external data chunk into an actual DataFusion concept (I need to read through your catalogue proposal doc properly!) and get providers to return a list of available chunks with their bounds, then having DataFusion perform the high-level filtering where possible, and pass that more selective list of chunk IDs back for actual data retrieval
   
   And personally I see pros and cons for both of those options (flexibility is cool, but sometimes dangerous).
   
   I'm not familiar enough with partitioning in DataFusion yet to quite understand what you mean for the different pushdown support return values across partitions; I would expect that the filter pushdown status is usually a property of the column in question? e.g. depending on whether a column tracks sufficient statistics and the provider storage mechanism allows for it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r545483003



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       Basically I don't have a strong opinion either way -- putting in three enum variants when one may be of limited utility seems ok to me -- we could always remove / collapse them later
   
   @rdettai  -- what do you think about merging this PR in and then proposing a clean up  in a subsequent PR? I feel like the design by @returnString is pretty good, even if one of the enum values is of limited utility.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8917:
URL: https://github.com/apache/arrow/pull/8917


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543494569



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       This result is used to drive the logical plan optimisation. Returning `Unsupported` _or_ `Inexact` will both result in the original filter node being preserved in the plan (only `Exact` can provide optimisation guarantees that the post-hoc filtering is unecessary) but `Inexact` means that the filter expr in question will be passed to `TableProvider::scan` when creating the execution plan, whereas `Unsupported` means the provider can do nothing useful with the expr at all (relevant code is [here](https://github.com/apache/arrow/pull/8917/files#diff-5f6ab9edcef3a4b9a8eb75f641ac0930cdbae92156f056fea400e0c58218a094R354-R369)).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544337093



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       > I'm not familiar enough with partitioning in DataFusion yet to quite understand what you mean for the different pushdown support return values across partitions
   
   Sorry, I was not very clear! Let me try to do better 😄 
   
   When the logical plan is converted, DataFusion allows the target physical plan to specify a number of partitions. This typically allows you to use that many threads (or workers) to execute your physical plan. In the context of datasources, this is often the number of files, as they can always be decoded separately and in parallel. But it could be any other relevant partitioning of your data. My point, is that you will often have the statistics at the granularity of the file <=> DataFusion partitions. 
   
   So the workflow here will be the following:
   - In the logical plan, you are evaluating the pushed-down expressions on the entire datasource (all of its files) to know if some of the expressions can be simplified out (using `supports_filter_pushdown()`). To achieve this in your TableProvider implem, you will need to merge all the statistics of your files into one that characterizes the entire datasource, then select the expressions.
   - You will likely need to do the same expression pruning again at the partition level (so in the physical expression). Indeed each file will have more focused (= accurate) statistics that you will want to take advantage of, because you will likely be able to prune out more parts of the filter expression.
   
   My question is whether it is worth adding the complexity of trying to simplify the filtering expression here in the logical plan when it will better done at the partition level in the physical plan.

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I agree the names explicit! 
   
   I am challenging whether having 3 values is necessary. `Unsupported` vs `Inexact` is something that you don't need to expose publicly here as they are used internally by the the TableProvider when they will be passed back to the `scan()` method (or not). Said otherwise, you can leave it to the the implementation of `scan()` to decide what to do with the expressions, you don't need do do it with `supports_filter_pushdown()`. Only `Exact` vs `Unsupported|Inexact` really needs to be exposed because it is used externally (by the planner)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543504211



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       That's an interesting idea! In my use case, the TableProvider impl interprets the filters synchronously after _asynchronously_ preparing the necessary data externally from DataFusion, but yeah, I could see that happening too.
   
   If we didn't make this change, I suppose implementations could still use these filters within the returned `ExecutionPlan`'s `execute` method for now to enable async use cases?
   
   Edit: my main concern here would be code churn, propagating that new async-ness up through all the callsites.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-745164026


   > FYI this will conflict with parts of #8910. IMO we should merge #8910 first, as it simplifies some of the code, making this PR easier.
   
   Yeah definitely; like I mentioned, that's actually _required_ for this to work with ordinary query methods (not just `read_table(provider)` as the tests are using currently) because we need a reference to the provider whilst doing logical plan optimisation :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-747071989


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=h1) Report
   > Merging [#8917](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=desc) (2ba19ba) into [master](https://codecov.io/gh/apache/arrow/commit/d65ba4ec5daeb93ca5031f883d08d559b68753b2?el=desc) (d65ba4e) will **increase** coverage by `0.01%`.
   > The diff coverage is `87.41%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8917/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8917      +/-   ##
   ==========================================
   + Coverage   83.25%   83.27%   +0.01%     
   ==========================================
     Files         196      198       +2     
     Lines       48116    48251     +135     
   ==========================================
   + Hits        40059    40181     +122     
   - Misses       8057     8070      +13     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `75.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `52.94% <ø> (ø)` | |
   | [rust/datafusion/tests/custom\_sources.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL2N1c3RvbV9zb3VyY2VzLnJz) | `75.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `83.59% <75.00%> (ø)` | |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `80.45% <80.00%> (-0.10%)` | :arrow_down: |
   | [rust/datafusion/tests/provider\_filter\_pushdown.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3Byb3ZpZGVyX2ZpbHRlcl9wdXNoZG93bi5ycw==) | `85.24% <85.24%> (ø)` | |
   | [rust/datafusion/src/optimizer/filter\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvZmlsdGVyX3B1c2hfZG93bi5ycw==) | `97.65% <88.33%> (-1.74%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.31% <100.00%> (ø)` | |
   | [rust/datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vY29udGV4dC5ycw==) | `89.98% <100.00%> (+0.02%)` | :arrow_up: |
   | ... and [6 more](https://codecov.io/gh/apache/arrow/pull/8917/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=footer). Last update [d65ba4e...2ba19ba](https://codecov.io/gh/apache/arrow/pull/8917?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543418383



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I would be tempted to make this async in case filtering requires some IO.

##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       Shouldn't we try to see how this could work with parquet and its footer metadata?

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       I feel this makes us evaluate the expression twice: once for testing filter support and once at scan time. Moreover, it is likely that different partitions will have different `TableProviderFilterPushDown` values. It would be great optimize out filters on that finer granularity 🚀 But that would imply doing that optimization in the physical plan.

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       How will the distinction between a filter that is `Unsupported` and `Inexact` be leveraged? If `TableProviderFilterPushDown` is only used to know which filter expression can be optimized out, I think that a boolean would be enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544457191



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I think I'll leave this for the moment; on reflection, perhaps it'd be better to come up with a more future-proof design if we anticipate needing to expose more info to providers at scan-time; something along the lines of passing a "ScanArgs" struct rather than separate function args here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544353356



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I agree that adding async requires quite a bit of propagation. I don't think it is important right now (and it may never be 😄 )




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544457191



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I think I'll leave this for the moment; on reflection, perhaps it'd be better to come up with a more future-proof design if we anticipate needing to expose more info to providers at scan-time; something along the lines of passing a "ScanArgs" struct rather than separate function args here? (as a future PR, I mean)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r542747164



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I think those are good names and good docs as well




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-748529089


   Thanks again @returnString  -- @rdettai I believe I understand your concern -- and we can always enhance this API / trait to do pruning at the physical level as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-747059764


   One more rebase for some minor test changes that just landed :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-745333390


   Can you take a look at [ARROW-10902](https://issues.apache.org/jira/browse/ARROW-10902). I would be very interested by your input as it is very closely related.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544351700



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I agree the names are explicit! 
   
   I am challenging whether having 3 values is necessary. `Unsupported` vs `Inexact` is something that you don't need to expose publicly here as they are used internally by the the TableProvider when they will be passed back to the `scan()` method (or not). Said otherwise, you can leave it to the the implementation of `scan()` to decide what to do with the expressions, you don't need to do it with `supports_filter_pushdown()`. Only `Exact` vs `Unsupported|Inexact` really needs to be exposed because it is used externally (by the planner)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-746426491


   I am very happy with the general heading of this PR. Thanks for your great work @returnString 😄 . 
   
   I can't shake the filling that we are adding too much complexity to the `TableProvider` enum (1 extra method + 1 signature change + extra enum type). Are we really sure that we couldn't achieve the same with less? (cf [comment](https://github.com/apache/arrow/pull/8917/files#r544351700) and [comment](https://github.com/apache/arrow/pull/8917#discussion_r544337093))


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r546227882



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I leave it to you to decide, it won't take me long to adapt my code to these changes anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543706456



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       @yordan-pavlov Likewise!
   
   Are you broadly happy that this interface and implementation would suffice for your Parquet-specific work? My use case is an entirely external TableProvider implementation, so hopefully that's kept my assumptions to a minimum.
   
   If you've got time, seeing as you've been working in this area, I'd really appreciate any other thoughts you might have here :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544355966



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       Okay for doing it later, I was just curious to confront the API changes with a concrete use case. If Yordan tried it out, I'm reassured 😉 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-748462140


   Nobody replied to this [comment](https://github.com/apache/arrow/pull/8917#discussion_r544337093). The question is whether it is worth doing the filter pruning at the logical plan level or if it is more something for the physical plan. Otherwise you can merge! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r545240967



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I see... that's annoying! how many tests fail approximately? 
   
   If tests influence our design, we should be sure that they are doing so for a reason (not just because they are poorly written). Here I feel that maybe we should fix them, no? In #8910, I already tweaked an API because otherwise I would have had lots of tests to change (more details [here](https://github.com/apache/arrow/pull/8910#discussion_r543331995)). This was a mistake (due mainly to laziness 🙃), but here I believe the interface is too important to do the same 😄. Any opinion @alamb ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544460440



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       Right, I see what you mean now :) Yes, from a semantics point-of-view, that would result in the same results being delivered. The major issue is that this would add filters to scan nodes unnecessarily and there's currently quite a lot of exact string matching for plan outputs in tests, so this would break lots of existing code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-746628943


   Rebase is done, plus I've augmented the test coverage to also use the SQL code paths now that providers are resolved eagerly (thanks @rdettai 🥇)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-745083848


   FYI this will conflict with parts of #8910. IMO we should merge #8910 first, as it simplifies some of the code, making this PR easier.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r546286050



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       @rdettai  -- I think it is reasonable to assume something (I am purposely being vague here) could be done at logical planning time that could save execution time later on. While I don't fully follow the scenario you are describing, I can see how in some cases you won't save much by trying to do partition pruning at the logical planning level.
   
   However, I can think of scenarios where you might be able to (like deciding what hosts to send a plan to, for example). You might be right that that should be done at Physical Planning time. 
   
   It is my opinion that DataFusion is moving sufficiently fast at this point that we are discovering the requirements as we go that having a potentially more complicated API is an acceptable tradeoff, especially given how well this one is documented. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r542743173



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       Maybe `supports_filter_pushdown`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-744668250


   ...aaand a new conflict has already landed since I started writing up this post, will rebase shortly :D


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai edited a comment on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
rdettai edited a comment on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-746426491


   I am very happy with the general heading of this PR. Thanks for your great work @returnString 😄 . 
   
   I can't shake the feeling that we are adding too much complexity to the `TableProvider` enum (1 extra method + 1 signature change + extra enum type). Are we really sure that we couldn't achieve the same with less? (cf [comment](https://github.com/apache/arrow/pull/8917/files#r544351700) and [comment](https://github.com/apache/arrow/pull/8917#discussion_r544337093))


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-744683114


   https://issues.apache.org/jira/browse/ARROW-9828


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543655758



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       @returnString it's great to see someone else working on predicate push-down as well; 
   I have been working on this for a couple of weeks, targeting an end-to-end implementation for parquet and have done similar changes to the filter push-down optimizer but your implementation is better because of the idea for full vs partial filter push-down; in my version I have `predicate: &Option<Expr>`, but `filters: &[Expr]` should work as well;
   
   I think it makes sense to separate the generic support for predicate push-down to the data source from the implementation for various data sources such as parquet because each change will be fairly big so makes sense to split into smaller changes;
   
   regarding a parquet implementation of predicate push-down I have been working on the idea of building arrays from the min / max statistics in row groups and then reusing the existing physical expressions already implemented in datafusion; I already have the code that builds statistics arrays, next working on the expression evaluation - hopefully will have enough to start a PR in the next couple of weeks;
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-744675886


   Thank you @returnString ! This looks awesome -- I will try and review it carefully tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543504211



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       That's an interesting idea! In my use case, the TableProvider impl interprets the filters synchronously after _asynchronously_ preparing the necessary data externally from DataFusion, but yeah, I could see that happening too.
   
   If we didn't make this change, I suppose implementations could still use these filters within the returned `ExecutionPlan`'s `execute` method for now to enable async use cases?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#issuecomment-745599394


   > #8910 removes the provider/named-based reference distinction entirely so I can rebase this once that's merged and add an extra test using an ordinary sql statement, rather than just a ctx.read_table(provider) call.
   
   https://github.com/apache/arrow/pull/8910 is merged


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org