You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/28 12:30:27 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #1694: Public Expr simplification API

alamb opened a new issue #1694:
URL: https://github.com/apache/arrow-datafusion/issues/1694


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   In IOx each table is broken up logically into chunks (like row groups in parquet files) but the chunks might be missing some columns and each chunk has its own statistics
   
   When predicates are applied to scan / filter these chunks, they are potentially in terms of all columns of a table. If a chunk is missing a column (or we know from statistics that it is not null) expressions like `col IS NULL` and `col IS NOT NULL` can be replaced with `true` or `false` and predicates like `col > 5` can be replaced with `null > 5` in some cases
   
   Once this substitution is done, that may allow additional simplification of the predicate -- ideally all the way down to `true` or `false` 
   
   One particular type of this expression we will use in IOx is to map `null` to a `''` value like this:
   
   ```sql
   CASE 
     WHEN col is NULL THEN '' 
     ELSE col 
   END
   ```
   
   The same general pattern likely holds for ParquetExec now that @thinkharderdev  has added support to merge schemas for multiple files in https://github.com/apache/arrow-datafusion/pull/1622 once DataFusion is able to push predicates down into the parquet scans, simplifying the predicates as much as possible beforehand would be ideal. 
   
   The current API in https://github.com/apache/arrow-datafusion/blob/03075d5f4b3fdfd8f82144fcd409418832a4bf69/datafusion/src/optimizer/simplify_expressions.rs is 
   1. Private
   2. Requires `ExecutionProps` which is fairly entangled with the overall machinery of how plans are executed (and means we see issues like #1690 )
   
   **Describe the solution you'd like**
   I would like a DataFusion to have a public API for simplifying expressions. Proposed looks like
   
   ```rust
   pub trait ExprEvalContext {
   }
   
   struct Expr {
     fn simplify(self, &dyn ExprEvalContext) -> Self {
     }
   
   }
   ```
   
   I am thinking like `ExprEvalContext` as a trait so that it is clear what Expression Evaluation actually requires as well as allow Expr's to be simplified prior to execution or in the bowels of DataFusion's planer (and I will implement it for ExecutionProps). 
   
   **Describe alternatives you've considered**
   I am not fully sure about the API design -- I'll know more when I sketch one out
   
   **Additional context**
   https://github.com/apache/arrow-datafusion/issues/1693 
   https://github.com/influxdata/influxdb_iox/pull/3557


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #1694: Public Expr simplification API

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1694:
URL: https://github.com/apache/arrow-datafusion/issues/1694


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org