You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/04/30 21:57:50 UTC

[GitHub] [arrow-datafusion] wjones127 opened a new issue, #6171: Implement Expr SimplifyWithGuarantee

wjones127 opened a new issue, #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171

   ### Is your feature request related to a problem or challenge?
   
   We are starting to look for more advanced methods for filter pushdown. I was starting to think of porting [SimplifyWithGuarantee](https://github.com/apache/arrow/blob/c95980f4bee406504a68ff3a3d39cd86c1d97ec0/cpp/src/arrow/compute/expression.h#L215-L221). The critical functionality we are looking for is being able to evaluate a predicate against some statistics and get the residual expression. For example, if I have the predicate `x = 1 AND y < 2`:
   
   * file 1 with stats `0 <= x <= 20` and `0 < y <= 1` => residual filter `x = 1` (`y < 2` is always satisfied) => scan this file with `x = 1` filter
   * file 2 with stats `3 < y <= 10` => residual filter `false` => don't scan this file since it will never satisfy the predicate
   
   ### Describe the solution you'd like
   
   I think a straightforward port of that function would be useful, but if there is a design that integrates better with existing functionality, I'm open to other designs.
   
   ### Describe alternatives you've considered
   
   It seems like the current solutions with `PruningPredicate` don't give you the residual expression.
   
   ### Additional context
   
   This is related to #5830


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1530069105

   It will support all integral types (booleans, integers etc.), floats and dates/timestamps. Most of that support is already there. You can not do arithmetic with strings, so we haven't focused on those yet. But you can certainly analyze inequalities etc, and the code structure is amenable to that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1704411059

   > It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds
   
   I'm taking a look at this right now. Two issues I see right now:
   
   1. The `Interval` bounds doesn't include any information about nullability. I'd like to simplify expressions like `X IS NOT NULL` to `true` or `false` if null statistics support that simplification.
   2. The interval library operates on physical expressions, while the simplification operates on logical expressions.
   
   I'm working on a PR right now that will include a struct that is parallel to `Interval` for logical expressions that includes the null information. Once I have that working I'll see if it's worth consolidating that with the existing `Interval` struct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1530012554

   Would the interval library support non-integer data types? Such as strings, booleans, dates, timestamps? When I was looking recently it seemed to mostly support integers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1705069642

   I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial.
   
   Let's exchange ideas so we can support your use case with as little code/functionality duplication as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1705896888

   > I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial.
   
   Could I extract the `Interval` struct into `datafusion_common`, add the nullability field, and then leave a note in the `cp_solver` module that nullability isn't handled yet?
   
   Here is the null tracking definition I created:
   
   https://github.com/apache/arrow-datafusion/blob/a6b57e38eb00da2a6c5396dca0b5f1772578ac78/datafusion/optimizer/src/simplify_expressions/guarantees.rs#L71-L84
   
   Does that seem good to you?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 closed issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 closed issue #6171: Implement Expr SimplifyWithGuarantee
URL: https://github.com/apache/arrow-datafusion/issues/6171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1529714108

   I think it would be really nice to integrate this idea with the range analysis that we already have in DataFusion - see https://github.com/apache/arrow-datafusion/issues/5535 (I think @ozankabak  and @mustafasrepo are thinking about this)
   
   It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1529838864

   This is totally doable with the interval library, we will be happy to help. BTW @alamb the interval/range unification is coming soon -- the interval library needs a couple more features and then we should be able to retire the old code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on issue #6171: Implement Expr SimplifyWithGuarantee

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on issue #6171:
URL: https://github.com/apache/arrow-datafusion/issues/6171#issuecomment-1706117544

   Based on your requirements, it appears that you only need two intervals in a struct. One interval would be for range analysis, and the other would be a boolean interval for null status. By having (false, false) as the input, you could indicate that the value is never null. (false, true) would indicate that it may be null, and (true, true) would indicate that it is always null. 
   
   Therefore, there is no need to modify the interval library. Instead, you can create your own logic while still using the `cp_solver`'s helper functions. I also agree with your suggestion to move the interval library into `datafusion_common`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org