You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/02 09:53:19 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Dandandan opened a new issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The (logical) optimizer contains some support for folding (boolean) constants. This can help, especially with other optimization passes, to optimize queries. For example, `LIMIT (0 + 0)` could be optimized first to `LIMIT 0`, to enable eliminating the whole plan.
   
   We should try to extend this support to most datatypes & expressions.
   
   **Describe the solution you'd like**
   `Expr`s can already be evaluated against a `RecordBatch`, and there is code to evaluate scalar values without going through Arrow. To make sure that the constant evaluation is implemented correctly & the same as the evaluation code, we should be able to reuse the code from there.
   
   **Describe alternatives you've considered**
   Manually implement the constant folding support. Downside here is that we end up with two implementations, which has a higher maintenance burden.
   
   **Additional context**
   Not in scope: add it to physical optimizer too. Here it could help too, especially if we have support for partitions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
pjmore commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-845477448


   So as far as evaluating literal expressions, such as ```abs(lit(12) - lit(3.5))``` goes another approach that can be taken is breaking out the core logic of pure physical expressions that don't read data from the ```RecordBatch``` from the evaluate function and putting it into a associated function that takes the ```ColumnarValue``` from sub expression evaluations and operates on that. For example for a ```BinaryExpr``` the signature would be:
     ``` fn evaluate_binop(left_value: ColumnarValue, op: &Operator, right_value: ColumnarValue, num_rows: usize)->Result<ColumnarValue>```
   
   The ```PhysicalExpr```'s evaluate function becomes:
   ``` 
   fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
           let left_value = self.left.evaluate(batch)?;
           let right_value = self.right.evaluate(batch)?;
           BinaryExpr::evaluate_binop(left_value, &self.op, right_value, batch.num_rows())
   }
   ```
   The split allows ```Expr```s that are made up of literal values to be evaluated directly using mostly the same logic.  There is still  some implementation divergence, mainly due to expressions that are inserted during the creation of a physical expression, such as ```BinaryExpr``` inserting a ```TryCastExpr``` to coerce types. I've implemented a proof of concept [here](https://github.com/pjmore/arrow-datafusion/tree/master/datafusion). Word of warning, the organization of the code and the expression rewriting are both quite messy, and I ignored all expression renaming issues.  There is also potentially an issue with how I handled array return values, which was to convert them to a scalar if it has one element and a scalar list otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-844329264


   FYI while I was reviewing the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/parquet.rs in the context of https://github.com/apache/arrow-datafusion/issues/363 I noticed there is already a way to do "partial evaluation" for expressions -- maybe we could fake the same to evaluate `Exprs` that have no inputs --  pass in a 1 row null array as input or something. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-939930047


   Possibly related to https://github.com/apache/arrow-datafusion/issues/1070


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-833836462


   It is a problem already in the current constant folding! I am opening an issue for this.
   
   ```
   > SELECT TRUE = TRUE;
   +---------------+
   | Boolean(true) |
   +---------------+
   | true          |
   +---------------+
   1 rows in set. Query took 0 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-834695012


   @Dandandan 
   
   > @alamb just picking your brain here - do you think this should be part of the logical optimizations or physical optimizations?
   I think the suggested route (re-use evaluation code) is only feasible for the physical optimization, not the logical optimization rules.
   
   I would imagine this to be done on `Exprs`, not `PhysicalExprs` to allow the rewritten expressions to be used as much as possible by other optimization passes (e.g. filter and projection pushdown, which is done at the  `LogicalPlan` level)
   
   > I am wondering here in general, whether we can/should unify LogicalPlan/PhysicalPlan Expr/PhysicalExpr a bit more in order to not have to write two versions of the same thing or being limited in the optimizations / optimization order.
   
   I think the LogicalPlan / PhysicalPlan distinction makes sense (b/c logically a Join is just a Join -- but physically maybe we would be using a CROSS JOIN w/ filter, or an Hash Inner Join, or a Merge Join, etc)
   
   I am not as sure about the distinction between `Expr` and `PhysicalExpr` -- I haven't looked carefully at the code to know what additional information a `PhysicalExpr` needs that an `Expr` doesn't have -- and you can make a `PhysicalExpr` from an `Expr` and a `Schema` [code link](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/planner.rs#L431) 
   
   If we could directly evaluate `Exprs` without having to apply a transformation to them that would be pretty cool (and clean up a lot of duplication I think)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-833830342


   @jorgecarleitao that's a good one - I did also see something in the same order recently when looking at this https://github.com/apache/arrow-datafusion/pull/268 problem.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-833824075


   @alamb just picking your brain here - do you think this should be part of the logical optimizations or physical optimizations? 
   I think the suggested route (re-use evaluation code) is only feasible for the physical optimization, not the logical optimization rules.
   
   A way that could work within the current setup for `PhysicalExpr`:
   * Evaluation is implemented for `PhysicalExpr`s not `Exprs`. An empty `RecordBatch` could be used to call into the code and extract the scalar values (if any).
   
   This makes it a bit less useful (still useful nonetheless), as some other optimizations might benefit from constant folding
   
   I am wondering here in general, whether we can/should unify `LogicalPlan`/`PhysicalPlan` `Expr`/`PhysicalExpr` a bit more in order to not have to write two versions of the same thing or being limited in the optimizations / optimization order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-833828498


   fwiw, when a logical optimization is applied, the expressions are re-written and the "column name" is consequently re-written. Thus, what was named `LIMIT (0 + 0)` becomes `LIMIT 0`.
   
   To apply it on the logical level, we may need to wrap the expression by an `.alias` for it to preserve the column name.
   
   I agree that the sooner in the optimization these are applied, the higher the likelihood of synergies between optimizers.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-831533626


   This would be a neat feature. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore removed a comment on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
pjmore removed a comment on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-845477448


   So as far as evaluating literal expressions, such as ```abs(lit(12) - lit(3.5))``` goes another approach that can be taken is breaking out the core logic of pure physical expressions that don't read data from the ```RecordBatch``` from the evaluate function and putting it into a associated function that takes the ```ColumnarValue``` from sub expression evaluations and operates on that. For example for a ```BinaryExpr``` the signature would be:
     ``` fn evaluate_binop(left_value: ColumnarValue, op: &Operator, right_value: ColumnarValue, num_rows: usize)->Result<ColumnarValue>```
   
   The ```PhysicalExpr```'s evaluate function becomes:
   ``` 
   fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
           let left_value = self.left.evaluate(batch)?;
           let right_value = self.right.evaluate(batch)?;
           BinaryExpr::evaluate_binop(left_value, &self.op, right_value, batch.num_rows())
   }
   ```
   The split allows ```Expr```s that are made up of literal values to be evaluated directly using mostly the same logic.  There is still  some implementation divergence, mainly due to expressions that are inserted during the creation of a physical expression, such as ```BinaryExpr``` inserting a ```TryCastExpr``` to coerce types. I've implemented a proof of concept [here](https://github.com/pjmore/arrow-datafusion/tree/master/datafusion). Word of warning, the organization of the code and the expression rewriting are both quite messy, and I ignored all expression renaming issues.  There is also potentially an issue with how I handled array return values, which was to convert them to a scalar if it has one element and a scalar list otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #237: Extend & generalize constant folding / evaluation in logical optimizer

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #237:
URL: https://github.com/apache/arrow-datafusion/issues/237#issuecomment-939930047


   Possibly related to https://github.com/apache/arrow-datafusion/issues/237


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org