You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/24 13:20:07 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

alamb opened a new issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Sometimes something about how a query is constructed (e.g. via an automated tool or rewrite) results in  redundant filters (aka the same exact predicate twice). Datafusion could remove that redundancy
   
   Reproducer:
   ```
   echo "1" > /tmp/foo.csv
   cargo run -p datafusion-cli
   ```
   
   And then in the datafusion CLI:
   ```
   CREATE EXTERNAL TABLE foo(ts int)
   STORED AS CSV
   LOCATION '/tmp/foo.csv';
   
   explain verbose select * from foo where ts>5 AND ts>5;
   ```
   
   
   On master you can still see both predicates are still present:
   ```
   |                                         |   FilterExec: CAST(ts AS Int64) > 5 AND CAST(ts AS Int64) > 5            |
   ```
   
   Here is the entire plan:
   
   ```
   > explain verbose select * from foo where ts>5 AND ts>5;
   +-----------------------------------------+--------------------------------------------------------------------------+
   | plan_type                               | plan                                                                     |
   +-----------------------------------------+--------------------------------------------------------------------------+
   | logical_plan                            | Projection: #ts                                                          |
   |                                         |   Filter: #ts Gt Int64(5) And #ts Gt Int64(5)                            |
   |                                         |     TableScan: foo projection=None                                       |
   | logical_plan after projection_push_down | Projection: #ts                                                          |
   |                                         |   Filter: #ts Gt Int64(5) And #ts Gt Int64(5)                            |
   |                                         |     TableScan: foo projection=Some([0])                                  |
   | logical_plan after projection_push_down | Projection: #ts                                                          |
   |                                         |   Filter: #ts Gt Int64(5) And #ts Gt Int64(5)                            |
   |                                         |     TableScan: foo projection=Some([0])                                  |
   | physical_plan                           | ProjectionExec: expr=[ts]                                                |
   |                                         |   FilterExec: CAST(ts AS Int64) > 5 AND CAST(ts AS Int64) > 5            |
   |                                         |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   +-----------------------------------------+--------------------------------------------------------------------------+
   ```
   
   
   
   **Describe the solution you'd like**
   I propose updating the filter pushdown rule we already have to remove redundant filters at the same time
   
   **Describe alternatives you've considered**
   We could also potentially add an entirely separate optimizer rule to remove the redundant filters
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan closed issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
Dandandan closed issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jgoday commented on issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
jgoday commented on issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410#issuecomment-850603248


   @alamb I've just created the PR (#436), will try to fix unit tests as you mentioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jgoday commented on issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
jgoday commented on issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410#issuecomment-847310247


   @alamb Can I try this ?
   I was doing some experiments implementing a remove_duplicate_filters that transforms an Exp::BinaryExpr, if both left and right are equal, into only one of his inner expressions (a>5 AND a>5 => a>5).
   Could this be a valid solution ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410#issuecomment-847318273


   Sure -- sounds great @jgoday 
   
   I think the transformation is valid for predicates connected by `AND` at any "depth"
   
   so like I would hope the transformation would work for both the following:
   ```
   (c > 5) AND (c > 5) --> (c > 5)
   ((c > 5) AND (d < 6)) AND (c > 5) --> (c > 5) AND (d < 6)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan closed issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
Dandandan closed issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jgoday commented on issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
jgoday commented on issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410#issuecomment-849907865


   @alamb Would this be a valid implementation [#410](https://github.com/jgoday/arrow-datafusion/commit/37423dc2968e9e06b9bbd73331cec8cc7c3fae40) ?
   RemoveDuplicateFilters uses some kind of boolean algebra laws to simplify expressions.
   
   Of course, more tests and complete the filter doc is still pending.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #410: Remove reundant filters (e.g. `c> 5 AND c>5` --> `c>5`)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #410:
URL: https://github.com/apache/arrow-datafusion/issues/410#issuecomment-850339651


   > @alamb Would this be a valid implementation #410 ?
   
   @jgoday  it looks like a good start to me. I'll leave some comments on the PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org