You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "avantgardnerio (via GitHub)" <gi...@apache.org> on 2023/02/21 19:40:19 UTC

[GitHub] [arrow-datafusion] avantgardnerio opened a new issue, #5357: `supports_filter_pushdown()` receives broken-up `filter`s

avantgardnerio opened a new issue, #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I have a `TableProvider` that supports multi-column indexes and could theoretically support filter pushdown, however when running the query: 
   
   ```
   SELECT s_quantity, s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10 FROM stock WHERE s_i_id = $1 AND s_w_id = $2 FOR UPDATE
   ```
   
   on the table
   
   ```
   create table stock (
   s_i_id int not null, 
   s_w_id smallint not null, 
   ...
   s_data varchar(50),
   PRIMARY KEY(s_w_id, s_i_id) ) ;
   ```
   
   I observe the following in my logs: 
   
   ```
   ArrowTableProvider::supports_filter_pushdown() proj=stock.s_i_id = Int32(3604)
   ArrowTableProvider::supports_filter_pushdown() proj=stock.s_w_id = Int32(1)
   ArrowTableProvider::supports_filter_pushdown() proj=stock.s_i_id = Int32(3604)
   ArrowTableProvider::supports_filter_pushdown() proj=stock.s_w_id = Int32(1)
   ```
   
   ```
       fn supports_filter_pushdown(&self, filter: &Expr) -> Result<TableProviderFilterPushDown, DataFusionError> {
           // TODO: parse expression and return "true" if it's a simple range on an index we have
           println!("ArrowTableProvider::supports_filter_pushdown() filter={filter:?}");
           Ok(TableProviderFilterPushDown::Unsupported)
       }
   ```
   
   So I would have to do complicated logic (append each portion of the predicate, always return "true", then "and" them back together during scan()) to properly apply this filter push down in my code.
   
   **Describe the solution you'd like**
   
   DataFusion would provide raw `Expr`s of the entire filter it is trying to push down, or - as a compromise, try first with the whole filter, then if the `TableProvider` returns `Unsupported` try again with each part as suggested by @andygrove .
   
   **Describe alternatives you've considered**
   
   - Try to reconstruct the original filter in my TableProvider asynchronously
   - Give up and don't use indexes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio closed issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio closed issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s
URL: https://github.com/apache/arrow-datafusion/issues/5357


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1440678455

   An alternative approach might be something like: 
   
   ```
       fn supports_filter_pushdown(&self, filter: &Expr) -> Result<Expr, DataFusionError> {
           // reduce the input expression down to only the parts of the expression this TableProvider supports
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1439115607

   @alamb I guess the alternative is going full-bore on what we had talked about and having TableProviders expose something like:
   
   ```
   pub fn get_indexes(&self) -> Vec<Vec<(Column, SortOrder)>> {}
   ```
   
   And taking care of this in the planner / optimizer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1439010562

   Just for my understanding, what kind of `TableProvider` are you implementing?
   If it's a relational database, I guess it would be best to just `AND` the expressions together and push the full filter down to the database and let it find out the index by itself? What would be the reason to push down certain parts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1442191620

   I like that latest suggestion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1439066749

   Honestly, I think the fix might just be removing this line: https://github.com/apache/arrow-datafusion/blob/2a5ef432aef45793b60c2847552d6d597589d67d/datafusion/optimizer/src/push_down_filter.rs#L706
   
   @Dandandan @andygrove , thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1439101861

   @Dandandan what do you think about Andy's suggestion to:
   
   1. try with the whole filter - the TableProvider is free to say "no"
   2. if "no" split the conjunction and try with each part
   
   ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1443747066

   The API change seems reasonable to me too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1441721936

   @avantgardnerio  I wonder if your `TableProvider` could return `TableProviderFilterPushDown::Inexact` for all the filters
   * DataFusion will not rely on pushed down filters for correctness (there will still be a Filter after the scan)
   * The `TableProvider`  can then figure out what, if any, indexes can be used given the list of predicates it gets.
   * Any predicate that can't be used in the index is simply ignored
   
   https://docs.rs/datafusion/18.0.0/datafusion/datasource/datasource/enum.TableProviderFilterPushDown.html#variant.Inexact
   
   Not only does this seem (?) to handle the usecase in this ticket's description, it would also allow you to handle predicates where you only have indexes on part of the predicate. For example, instead of 
   
   ```sql
    WHERE s_i_id = $1 AND s_w_id = $2
   ```
   
   What if you had a predicate like 
   ```sql
    WHERE s_i_id = $1 AND s_w_id = $2 AND s_i_age > 21
   ```
   
   Or something? You still probably want to use the  multi-column index on `(s_i_id, s_w_id)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1441974889

   > return TableProviderFilterPushDown::Inexact for all the filters
   
   I think this would work, but not perform optimally - the TableProvider would actually match exactly, but DataFusion would still have a (useless) `Filter` in the plan.
   
   I think I have a solution that I believe would work for all the cases. Simply change the existing signature of:
   
   ```
   fn supports_filter_pushdown( &self, filters: &Expr) -> Result<TableProviderFilterPushDown>
   ```
   
   to:
   
   ```
   fn supports_filter_pushdown( &self, filters: &[Expr]) -> Result<Vec<TableProviderFilterPushDown>>
   ```
   
   Then my TableProvider could see both `s_i_id` and `s_w_id` at the same time, find the appropriate index, and return `Exact` for both of those but `NotSupported` for `s_i_age`.
   
   It's barely a change from the existing signature, and we can even wrap it in an adapter method with a default impl that will keep existing TableProviders working by calling it iteratively, mark the old one `@Deprecated` and gracefully change it out over the next few releases.
   
   Thoughts @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1442341847

   > I like that latest suggestion
   
   done: https://github.com/apache/arrow-datafusion/pull/5367


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1439060639

   > what kind of TableProvider are you implementing? .. If it's a relational database
   yes, it is
   
   > push the full filter down to the database
   replace "database" with TableProvider and I agree with you there
   
   > What would be the reason to push down certain parts?
   Only that this is how it happens now. I assume others are attached to the present behavior? It seems way more simple to have the TableProvider split the conjunction (if it wants) than have it re-assemble the original filter from a bunch of separate calls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #5357: `supports_filter_pushdown()` receives broken-up `filter`s

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #5357:
URL: https://github.com/apache/arrow-datafusion/issues/5357#issuecomment-1439072041

   I think the idea of splitting the filter is that a tableprovider might support certain expressions/functions but not others. This way it can be communicated, so filter push down can push down only certain expressions. If you would pass the complete expression, it only could communicate this for the entire filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org