You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/15 13:41:23 UTC

[GitHub] [arrow-datafusion] jackwener opened a new issue #2015: Eliminate some filter which was pushed down

jackwener opened a new issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I found there are some filter was pushed down, but it's still in the logicplan.
   
   ```sql
    explain select c1, c2 from test where c3 = true and c2 = 0.000001;
   
   +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                |
   +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #test.c1, #test.c2                                                                                                      |
   |               |   Filter: #test.c3                                                                                                                  |
   |               |     Filter: #test.c2 = Float64(0.000001)                                                                                            |
   |               |       TableScan: test projection=Some([0, 1, 2]), filters=[#test.c3, #test.c2 = Float64(0.000001)]                                  |
   | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                       |
   |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                       |
   |               |     FilterExec: c3@2                                                                                                                |
   |               |       CoalesceBatchesExec: target_batch_size=4096                                                                                   |
   |               |         FilterExec: c2@1 = 0.000001                                                                                                 |
   |               |           RepartitionExec: partitioning=RoundRobinBatch(8)                                                                          |
   |               |             CsvExec: files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv], has_header=true, limit=None |
   |               |                                                                                                                                     |
   +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   ```
   
   ` Filter: #test.c2 = Float64(0.000001)`   is included in the `TableScan`
   
   Maybe we can eliminate it. How do you think about it ? @alamb @Dandandan 
   
   **Describe the solution you'd like**
   Correct the `push down filter` in the optimizer
   
   **Describe alternatives you've considered**
   
   **Additional context**
   In PG
   ```sql
   create table scientist (id integer, firstname varchar(100), lastname varchar(100));
   insert into scientist (id, firstname, lastname) values (1, 'albert', 'einstein');
   insert into scientist (id, firstname, lastname) values (2, 'isaac', 'newton');
   insert into scientist (id, firstname, lastname) values (3, 'marie', 'curie');
   explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   ```
   
   ```
    explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
                              QUERY PLAN
   ----------------------------------------------------------------
    Seq Scan on scientist  (cost=0.00..12.55 rows=1 width=222)
      Filter: ((id > 2) AND ((lastname)::text = 'einstein'::text))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068016853


   > **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   > I found there are some filter was pushed down, but it's still in the logicplan.
   > 
   > ```sql
   >  explain select c1, c2 from test where c3 = true and c2 = 0.000001;
   > 
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | plan_type     | plan                                                                                                                                |
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | logical_plan  | Projection: #test.c1, #test.c2                                                                                                      |
   > |               |   Filter: #test.c3                                                                                                                  |
   > |               |     Filter: #test.c2 = Float64(0.000001)                                                                                            |
   > |               |       TableScan: test projection=Some([0, 1, 2]), filters=[#test.c3, #test.c2 = Float64(0.000001)]                                  |
   > | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                       |
   > |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                       |
   > |               |     FilterExec: c3@2                                                                                                                |
   > |               |       CoalesceBatchesExec: target_batch_size=4096                                                                                   |
   > |               |         FilterExec: c2@1 = 0.000001                                                                                                 |
   > |               |           RepartitionExec: partitioning=RoundRobinBatch(8)                                                                          |
   > |               |             CsvExec: files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv], has_header=true, limit=None |
   > |               |                                                                                                                                     |
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > ```
   > 
   > ` Filter: #test.c2 = Float64(0.000001)`   is included in the `TableScan`
   > 
   > Maybe we can eliminate it. How do you think about it ? @alamb @Dandandan 
   > 
   > **Describe the solution you'd like**
   > Correct the `push down filter` in the optimizer
   > 
   > **Describe alternatives you've considered**
   > 
   > **Additional context**
   > In PG
   > ```sql
   > create table scientist (id integer, firstname varchar(100), lastname varchar(100));
   > insert into scientist (id, firstname, lastname) values (1, 'albert', 'einstein');
   > insert into scientist (id, firstname, lastname) values (2, 'isaac', 'newton');
   > insert into scientist (id, firstname, lastname) values (3, 'marie', 'curie');
   > explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   > ```
   > 
   > ```
   >  explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   >                            QUERY PLAN
   > ----------------------------------------------------------------
   >  Seq Scan on scientist  (cost=0.00..12.55 rows=1 width=222)
   >    Filter: ((id > 2) AND ((lastname)::text = 'einstein'::text))
   > ```
   
   Pushing down the filter to table scan doesn't mean the filter can be removed from the plan. For example, parquet data doesn't support exact filtering: it returns more rows than those matching the predicates based onstatistics in the file. Another filter needs to filter out those rows not matching the predicate.
   
   PostgreSQL has a different way of scanning data: it will read rows one by one and apply the filter directly to the rows.
   
   What can be done I think in terms of optimizing is combining the two filters in one `FilterExec` (`FilterExec: c2@1 = 0.000001 AND c3@2`).
    I guess that could be in some cases faster than having two filters.
   
   Another optimization could be to apply the most selective filter expressjon first and/or the most expensive expression (such as a complex regex or case-when expression) last.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jackwener closed issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
jackwener closed issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1069078448


   > I wonder if we could/should change the DataFusion explain plans to make it clearer that the predicate / filter in the scan may be inexact / partially applied
   > 
   > In other words, if `TableProvider:: supports_filter_pushdown` returns `TableProviderFilterPushDown::Inexact`
   > 
   > https://github.com/apache/arrow-datafusion/blob/d31c1579fdc2ad060bc46c4fcfef14cc7676da6b/datafusion/src/datasource/datasource.rs#L116-L123
   > 
   > https://github.com/apache/arrow-datafusion/blob/d31c1579fdc2ad060bc46c4fcfef14cc7676da6b/datafusion/src/datasource/datasource.rs#L54-L67
   
   I will have a try  😊  and familiarize the explain path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068820110


   I wonder if we could/should change the DataFusion explain plans to make it clearer that the predicate / filter in the scan may be inexact / partially applied
   
   In other words, if `TableProvider:: supports_filter_pushdown` returns `TableProviderFilterPushDown::Inexact`
   
   https://github.com/apache/arrow-datafusion/blob/d31c1579fdc2ad060bc46c4fcfef14cc7676da6b/datafusion/src/datasource/datasource.rs#L116-L123
   
   https://github.com/apache/arrow-datafusion/blob/d31c1579fdc2ad060bc46c4fcfef14cc7676da6b/datafusion/src/datasource/datasource.rs#L54-L67


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068016853


   > **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   > I found there are some filter was pushed down, but it's still in the logicplan.
   > 
   > ```sql
   >  explain select c1, c2 from test where c3 = true and c2 = 0.000001;
   > 
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | plan_type     | plan                                                                                                                                |
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | logical_plan  | Projection: #test.c1, #test.c2                                                                                                      |
   > |               |   Filter: #test.c3                                                                                                                  |
   > |               |     Filter: #test.c2 = Float64(0.000001)                                                                                            |
   > |               |       TableScan: test projection=Some([0, 1, 2]), filters=[#test.c3, #test.c2 = Float64(0.000001)]                                  |
   > | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                       |
   > |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                       |
   > |               |     FilterExec: c3@2                                                                                                                |
   > |               |       CoalesceBatchesExec: target_batch_size=4096                                                                                   |
   > |               |         FilterExec: c2@1 = 0.000001                                                                                                 |
   > |               |           RepartitionExec: partitioning=RoundRobinBatch(8)                                                                          |
   > |               |             CsvExec: files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv], has_header=true, limit=None |
   > |               |                                                                                                                                     |
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > ```
   > 
   > ` Filter: #test.c2 = Float64(0.000001)`   is included in the `TableScan`
   > 
   > Maybe we can eliminate it. How do you think about it ? @alamb @Dandandan 
   > 
   > **Describe the solution you'd like**
   > Correct the `push down filter` in the optimizer
   > 
   > **Describe alternatives you've considered**
   > 
   > **Additional context**
   > In PG
   > ```sql
   > create table scientist (id integer, firstname varchar(100), lastname varchar(100));
   > insert into scientist (id, firstname, lastname) values (1, 'albert', 'einstein');
   > insert into scientist (id, firstname, lastname) values (2, 'isaac', 'newton');
   > insert into scientist (id, firstname, lastname) values (3, 'marie', 'curie');
   > explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   > ```
   > 
   > ```
   >  explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   >                            QUERY PLAN
   > ----------------------------------------------------------------
   >  Seq Scan on scientist  (cost=0.00..12.55 rows=1 width=222)
   >    Filter: ((id > 2) AND ((lastname)::text = 'einstein'::text))
   > ```
   
   Pushing down the filter to table scan doesn't mean the filter can be removed from the plan. For example, parquet data doesn't support exact filtering: it returns more rows than those matching the predicates based onstatistics in the file. Another filter needs to filter out those rows not matching the predicate.
   
   PostgreSQL has a different way of scanning data: it will read rows one by one and apply the filter directly to the rows.
   
   What can be done I think in terms of optimizing is combining the two filters in one `FilterExec`: (`FilterExec: c2@1 = 0.000001 AND c3@2`).
    I guess that could be in some cases faster than having two filters plus a coalescebatches.
   
   Another optimization could be to apply the most selective filter expressjon first and/or the most expensive expression (such as a complex regex or case-when expression) last.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1069080124


   Filed https://github.com/apache/arrow-datafusion/issues/2021 to track -- thanks @Ted-Jiang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068016853


   > **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   > I found there are some filter was pushed down, but it's still in the logicplan.
   > 
   > ```sql
   >  explain select c1, c2 from test where c3 = true and c2 = 0.000001;
   > 
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | plan_type     | plan                                                                                                                                |
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | logical_plan  | Projection: #test.c1, #test.c2                                                                                                      |
   > |               |   Filter: #test.c3                                                                                                                  |
   > |               |     Filter: #test.c2 = Float64(0.000001)                                                                                            |
   > |               |       TableScan: test projection=Some([0, 1, 2]), filters=[#test.c3, #test.c2 = Float64(0.000001)]                                  |
   > | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                       |
   > |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                       |
   > |               |     FilterExec: c3@2                                                                                                                |
   > |               |       CoalesceBatchesExec: target_batch_size=4096                                                                                   |
   > |               |         FilterExec: c2@1 = 0.000001                                                                                                 |
   > |               |           RepartitionExec: partitioning=RoundRobinBatch(8)                                                                          |
   > |               |             CsvExec: files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv], has_header=true, limit=None |
   > |               |                                                                                                                                     |
   > +---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > ```
   > 
   > ` Filter: #test.c2 = Float64(0.000001)`   is included in the `TableScan`
   > 
   > Maybe we can eliminate it. How do you think about it ? @alamb @Dandandan 
   > 
   > **Describe the solution you'd like**
   > Correct the `push down filter` in the optimizer
   > 
   > **Describe alternatives you've considered**
   > 
   > **Additional context**
   > In PG
   > ```sql
   > create table scientist (id integer, firstname varchar(100), lastname varchar(100));
   > insert into scientist (id, firstname, lastname) values (1, 'albert', 'einstein');
   > insert into scientist (id, firstname, lastname) values (2, 'isaac', 'newton');
   > insert into scientist (id, firstname, lastname) values (3, 'marie', 'curie');
   > explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   > ```
   > 
   > ```
   >  explain select id, firstname from scientist where lastname = 'einstein' and id > 2;
   >                            QUERY PLAN
   > ----------------------------------------------------------------
   >  Seq Scan on scientist  (cost=0.00..12.55 rows=1 width=222)
   >    Filter: ((id > 2) AND ((lastname)::text = 'einstein'::text))
   > ```
   
   Pushing down the filter to table scan doesn't mean the filter can be removed from the plan. For example, parquet data doesn't support exact filtering: it returns more rows than those matching the predicates based onstatistics in the file. Another filter needs to filter out those rows not matching the predicate.
   
   PostgreSQL has a different way of scanning data: it will read rows one by one and apply the filter directly to the rows.
   
   What can be done I think in terms of optimizing is combining the two filters in one `FilterExec`: (`FilterExec: c2@1 = 0.000001 AND c3@2`).
    I guess that could be in some cases faster than having two filters.
   
   Another optimization could be to apply the most selective filter expressjon first and/or the most expensive expression (such as a complex regex or case-when expression) last.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068022406


   I agree with @Dandandan. 
   
   Btw I noticed you have done some works in #2002 (sorry for not helping review). I want to point out if we can directly process the real redundant filter in the **filter push down** optimize rule. 
   Because as far as I remember, the filter push-down optimization rule has a data structure that stores the scalar exprs in the filter operator, and we can directly delete it in there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jackwener commented on issue #2015: Eliminate some filter which was pushed down

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #2015:
URL: https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068028891


   I got it. Thanks your reply. @xudong963 @Dandandan .
   
   >  combining the two filters 
   
   Yes, I'm try to add a `merge_adjacent_project` rule to enhance our optimizer. It was during my attempts that I discovered this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org