You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/10/20 16:23:42 UTC

[I] Consolidate interval analysies from `Interval` and `PruningPredicate` [arrow-datafusion]

alamb opened a new issue, #7887:
URL: https://github.com/apache/arrow-datafusion/issues/7887

   ### Is your feature request related to a problem or challenge?
   
   We now have two ways to do range / interval analysis in DataFusion. 
   
   # `Interval` based  analysis 
   The [`Interval`](https://docs.rs/datafusion/32.0.0/datafusion/physical_expr/intervals/struct.Interval.html) library is used for cardinality estimation and has 
   
   # Pruning Predicate
   The [`Pruning Predicate`](https://docs.rs/datafusion/32.0.0/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html)  is used to prune row groups based on min/max values.
   
   
   
   
   
   ### Describe the solution you'd like
   
   I would like one interval analysis library (probably based on `Interval`) 
   
   Having two representations is challenging because
   
   1. Interval analysis has a natural story for handle arbitrary predicates while `PruningPredicate` does not due to how it is is implemented as a rewrite
   2. The types of expressions handled are different (to add support for LIKE we would have to change both PruningPrediate and Intervals)
   3. There is no way to combine the BloomFilter support added in https://github.com/apache/arrow-datafusion/pull/7821 with the row groups (so it can't handle predicates like `col_a < 5 or col_b = <id>` if we had stats for `col_a` but a bloom filter for `col_b`
   4. The pruning predicate evaluation is vectorized so it would work well for 1000s of row groups
   
   ### Describe alternatives you've considered
   
   I propose unify the range analysis on `Interval` and implement the `PruningPredicate` in terms of Interval. Here is an example of doing so for a bloom filters, and I think we could extend the pattern to PruningPredicate: https://github.com/alamb/arrow-datafusion/pull/14
   
   Doing so would likely require extending the interval analysis arithmetic to support more operators (like `IN` lists)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Consolidate interval analysies from `Interval` and `PruningPredicate` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7887:
URL: https://github.com/apache/arrow-datafusion/issues/7887#issuecomment-1944213049

   I just updated this ticket's description with a more coherent story and examples that @appletreeisyellow  and I have hit recently while working on in https://github.com/apache/arrow-datafusion/issues/9171
   
   We were talking today and I think @appletreeisyellow  may try to prototype what this solution could look like, if she has time, to move the conversation forward. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Consolidate interval analysies from `Interval` and `PruningPredicate` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7887:
URL: https://github.com/apache/arrow-datafusion/issues/7887#issuecomment-1773039963

   @ozankabak and @tustvold  I swear we have talked about this topic before but I could not find an existing ticket or discussion. Do you have any other pointers to past discussions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Consolidate interval analysies from `Interval` and `PruningPredicate` [arrow-datafusion]

Posted by "appletreeisyellow (via GitHub)" <gi...@apache.org>.
appletreeisyellow commented on issue #7887:
URL: https://github.com/apache/arrow-datafusion/issues/7887#issuecomment-1944266937

   Thank you @alamb for the updating the description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Consolidate interval analysies from `Interval` and `PruningPredicate` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7887:
URL: https://github.com/apache/arrow-datafusion/issues/7887#issuecomment-1773911548

   > I am not sure this is what you searched for but there was an issue https://github.com/apache/arrow-datafusion/issues/5535.
   
   Thank you -- this is exactly what I was looking for. 
   
   > I will again think about how to insert Interval library there without sacrificing performance.
   
   I may be able to help with this too. One way is use `Datum` (from arrow) (rather than ScalarValue)` in the `Interval` representation -- that way each can store a single value or multiple. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Consolidate interval analysies from `Interval` and `PruningPredicate` [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on issue #7887:
URL: https://github.com/apache/arrow-datafusion/issues/7887#issuecomment-1773792948

   I am not sure this is what you searched for but there was an issue https://github.com/apache/arrow-datafusion/issues/5535.
   
   Actually, I have tried to apply cp_solver strategy to prune row groups. But we observed a performance degradation since this method sacrifices vectorized computing power, meaning that the process needs to be run for each set of statistics. As the number of sets increases, the efficiency decreases. 
   
   I will again think about how to insert Interval library there without sacrificing performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org