You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/11/07 10:34:34 UTC

[I] Introduce a way to represent constrained staitistics [arrow-datafusion]

alamb opened a new issue, #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078

   ### Is your feature request related to a problem or challenge?
   
   This has come up a few times, most recently in discussions with @berkaysynnada  on https://github.com/apache/arrow-rs/issues/5037#issuecomment-1796384939
   
   Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down
   
   For example, for a string column that has very long values, it requires much less space to store a short value slightly _lower_ than the actual minimum as the "minimum" statistics value, and one that is slightly _higher_ than the actual maximum as the "maximum" statistics value.
   
   For example:
   
   | actual min in data | actual max in data | "min" value in statistics | "max" value in statistics |
   |--------|--------|--------|--------|
   | `aaa......z` | `qqq......q` | `a` | `r` | 
   
   
   There is a similar usecase when applying a Filter, as described by @korowa  on  https://github.com/apache/arrow-datafusion/issues/5646#issuecomment-1796178380 and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column 
   
   Currently [`Precision`](https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36) only represents `Exact` and `Inexact`, there is no way to represent "unexact, but bounded above/below"
   
   ### Describe the solution you'd like
   
   Per @berkaysynnada  I  propose changing `Precision::Inexact` to a new variant `Precision::Between` which would store an [`Interval`](https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/struct.Interval.html) of known min/maxes of the value. 
    
   ```rust
   enum Precision {
     ...
     /// The value is known to be in the specified interval
     Between(Interval)
   }
   ```
   
   This is a quite general formulation, and it can describe "how" inexact the values are. 
   
   This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)
   
   ### Describe alternatives you've considered
   
   There is also a possibility of introducing a simpler, but more limited version of these statistics, like:
   
   ```rust
   enum Precision {
     // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
     // but the actual values may be lower/higher. 
     Bounded(ScalarValue)
   }
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804639527

   > Any thoughts on @berkaysynnada's proposal; i.e.
   
   I really like it. 
   
   My plan was spend time trying to code up some of these proposals and see if it can fit in the existing framework (and thus potentially reveal in what cases Inexact would be used), etc
   
   However, I spent much more of my time this week on https://github.com/apache/arrow-rs/issues/5050 and related discussions around https://github.com/apache/arrow-datafusion/issues/7994 and https://github.com/apache/arrow-datafusion/pull/8017 than I had planned, so I am sadly behind
   
   I hope to make progress tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810699396

   In general I am strugg
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.

berkaysynnada commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1801750904

   > I think there is a mismatch between the current use of `Precision::Inexact` and its documentation
   > 
   > Specifically, `Precision::Inexact` appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename `Precision::Exact` to `Precision::Conservative` and document that it is a conservative estimate of the actual value 🤔
   > 
   
   I think the documentation is correct but you have found 2 bugs in the code:
   1) Here, we must ensure the values are exact.
   https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73
   
   2) Singleton check must be for exact values only.
   https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286
   
   I will open a PR now fixing these issues.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810936320

   Thank you for investigating this. There are some parts I don't quite follow. Particularly,
   
   > I don't think the proposal (at least as I understand it) in https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, can capture the known min/max values
   
   I don't think this is entirely accurate. In addition to the two configurations you identified, it can also be:
   ```rust
   Precision::Range(
     PointEstimate::Exact(0), // min
     PointEstimate::Exact(N), // max
   )
   ```
   which captures the known min/max values. However, this loses our inexact guess `0.1 * N`. I think what you are trying to convey is that @berkaysynnada's proposal can not represent the triple of (*lower bound*, *guess*, *upper bound*). If this is what you mean, I agree with you.
   
   ## Example 2 `(x <= y)`
   ```
   x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
   y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }
   ```
   I don't understand what you mean by this. In this snippet, lower/upper bounds refer to column bounds, while estimate refers to row count. Did you mean to write:
   ```
   x: Bounded { lower: 50, estimate: 75, upper: 100 }, 
   y: Bounded { lower: 50, estimate: 125, upper: 200 }
   row_count: Bounded { lower: 0, estimate: 62.5, upper: 100}
   ```
   where the 62.5 figure is derived from the factor:
   
   ![image](https://github.com/apache/arrow-datafusion/assets/2006258/dda3cb9d-538b-415e-9f06-62992f583943)
   
   which is 5/8.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.

berkaysynnada commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1799688158

   > ```
   > enum Precision {
   >     Exact(T),
   >     Inexact(T),
   >     Absent,
   >     Between(Interval)
   > }
   > ```
   > 
   > 
   >     
   >   
   > 
   > @berkaysynnada under what circumstances would we use `Inexact`? All the cases I can think of `Inexact` now would be handled by `Between` (potentially with unbounded intervals)
   
   You are right, I cannot find any case to use Inexact. Any exact value can be relaxed to positive or negative side using intervals.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1813372152

   Update here: I filed a larger Epic to track all the work that seems to be going on with statistics as it was getting pretty overwhelming to keep it all straight: https://github.com/apache/arrow-datafusion/issues/8227
   
   My current conclusion is the current state of the `Statistics` code in DataFusion makes further work on this ticket  somewhat intractable for awhile. 
   
   Thus, I plan to work on https://github.com/apache/arrow-datafusion/issues/8229 and  https://github.com/apache/arrow-datafusion/issues/8228 first and then come back here 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1811098404

   >  I think what you are trying to convey is that @berkaysynnada's proposal can not represent the triple of (lower bound, guess, upper bound). If this is what you mean, I agree with you.
   
   Yes, this is what I was trying to say. Thank you
   
   > I don't understand what you mean by this. In this snippet, lower/upper bounds refer to column bounds, while estimate refers to row count. Did you mean to write:
   
   Yes, I think so (though I am not 100% sure about the 62.5 probability -- I need to double check my math)
   
   
   > Do we need the Exact variant anymore? It seems no. If lower, upper and estimate values are the same, it would be an exact estimate. This struct can implement a method is_exact to check this.
   
   That is a good point
   
   > Do we need theAbsent variant anymore? I think yes. We may want to differentiate between "no information" and "all possible values" (i.e. [MIN, MAX]) in certain cases.
   
   I agree this is probably a useful special case. 
   
   > Should we use Interval in the bounded variant? I think yes, but we should probably start off without it and migrate when we move the interval library outside of the physical expression crate (this week hopefully).
   
   Yes, the more I play with the code, the more I think using Interval is important to avoid reimplementing the same thing over and over again
   
   > Should we use the name Precision still? IMO it is not a good name for this. We are making estimates after all. I think we should use the right name for what we are doing and use the name Estimate.
   
   I agree the term `Precision` is not a good match.  I think `Estimate` is better, though I am still not entirely happy with that because in some cases it will be "known exactly" rather than an estimate. Though perhaps this case is technically just a very precise estimate (no error) 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810386490

   I have been playing and studying this  code. While the suggestion from @ozankabak  and @berkaysynnada  in https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752 is very general and can represent many types of uncertainty in statistics, I haven't found cases yet where that full generality is important
   
   For example, I can't find (nor think of) an important case where the lower bound would be known with certainty and the upper bound was uncertain vs TYPE::MAX). 
   
   Another example would be a use case where distinguishing between ranges like
   
   ```
   min: `PointEstimate::Absent`, max: `PointEstimate::Precise(value)`
   min: PointEstimate::Precise(TYPE::MIN), max: PointEstimate::Precise(value)
   ```
   
   Thus I am going to prototype what adding `Bounded` variant to `Precision` looks like.
   
   I also plan to encapsulate more of the checks into `Precision` so that if choose to go with a more general formulation we won't have to change as much of the rest of the code. 
   
   ```
   pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
       /// The exact value is known
       Exact(T),
       /// The exact value is not known, but the real value is known to be within
       /// the specified range: `lower <= value <= upper` TOOD: we could use
       /// `Interval` here instead, which could represent more complex cases (like
       /// open/closed bounds)
       Bounded { lower: T, upper: T},
       /// The value is not known exactly, but is likely close to this value.
       /// NOTHING can assumed about the value for cor in this case.
       Inexact(T),
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   
   ```
   
   I'll report back here with how it goes shorty
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810620494

   > Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.
   
   Ok. Thank you for this point. I am still working through exactly what "inexact" means / when it would be used, I probably misunderstand. 
   
   I plan to work on this ticket in several PRs. In the first few I plan to centralize the use of `Precision` a bit more (e.g. https://github.com/apache/arrow-datafusion/pull/8172) so it is clearer how it is used


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810579445

   > When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.
   
   Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810981010

   I like your proposal for the new type (reproduced below).
   ```rust
   pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
       /// The exact value is known
       Exact(T),
       /// The exact value is not known, but the real value is **known** to be within
       /// the specified range: `lower <= value <= upper` and furthermore is estimated
       /// to be `estimate`. However, the real value may fall anywhere in the range
       /// TODO: we could use `Interval` here instead, which could represent more complex cases (like
       /// open/closed bounds rather than using T::MIN/T::MAX)
       Bounded { lower: T, estimate: T upper: T},
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   ```
   A few questions and remarks:
   1. Do we need the `Exact` variant anymore? It seems no. If `lower`, `upper` and `estimate` values are the same, it would be an exact estimate. This struct can implement a method `is_exact` to check this.
   2. Do we need the`Absent` variant anymore? I think yes. We may want to differentiate between "no information" and "all possible values" (i.e. [MIN, MAX]) in certain cases.
   3. Should we use `Interval` in the bounded variant? I think yes, but we should probably start off without it and migrate when we move the interval library outside of the physical expression crate (this week hopefully).
   4. Should we use the name `Precision` still? IMO it is not a good name for this. We are making estimates after all. I think we should use the right name for what we are doing and use the name `Estimate`.
   
   Assuming these tweaks, the type would look like:
   ```rust
   pub enum Estimate<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
       Range { lower: T, value: T, upper: T },
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   ```
   or with `ScalarValue`, we'd have:
   ```rust
   pub enum Estimate {
       Range { lower: ScalarValue, value: ScalarValue, upper: ScalarValue },
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   ```
   Once `Interval` moves out of `physical-expr`, we can use:
   ```rust
   pub enum Estimate {
       Range { bounds: Interval, value: ScalarValue },
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1806373051

   It is unlikely I will have time to work on this today -- I will pick it up next week
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.

berkaysynnada commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1801757513

   > Another challenge I found with using `Interval` as proposed in option 2 of [#8078 (comment)](https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798813378) is more mechanical but real: `Interval` is in the `datafusion_physical_expr` crate but `Precision` is in `datafusion_common`, meaning I can't use `Interval` in `Precision` without moving code around. Not impossible to do, but a data point.
   > 
   > I will try prototyping option 1 sugested by @berkaysynnada [#8078 (comment)](https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798813378) and see what I find
   
   Just so you know, I have completed the interval refactor and submitted it for our internal review. After that, interval arithmetic will be in `datafusion_expr`, and there will be no significant obstacles to moving it to `datafusion_common`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1799019396

   Actually, upon more thought I think the best next step here is for me to prototype what this would look like. I'll do that and report back


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798262348

   A further wrinkle that may be worth thinking about, even if the solution is not to do anything, is that some formats, including parquet, special case certain floating point values. There is some discussion on https://github.com/apache/parquet-format/pull/196 to try to make the statistics actually usable, and one potential outcome might be to just use totalOrder, so it may even be no action is required, but just an FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1800094571

   Another challenge I found with using `Interval` as proposed in https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798813378  is more mechanical but real: `Interval` is in the `datafusion_physical_expr` crate but the `Precision` are in datafusion_common, meaning I can't use `Interval` in `Precision` without moving code around. Not impossible to do, but a data point.
   
   I will try prototyping option 1 sugested by @berkaysynnada https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798813378 and see what I find


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.

berkaysynnada commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798813378

   I have two proposals:
   
   If we have the cases where the lower bound of a range might be inexact (or perhaps absent) and the upper bound is exact—or vice versa—we could define the following:
   
   ```
   enum Estimation {
       Exact(T),
       Inexact(T),
       Absent,
   }
   enum Precision {
       Estimation,
       // min - max
       Range(Estimation, Estimation)
   }
   ```
   
   2) If we don't need to consider the exactness of bounds and a simple interval suffices, this alternative could be used:
   
   ```
   enum Precision {
       Exact(T),
       Inexact(T),
       Absent,
       Between(Interval)
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1798959468

   ```
   enum Precision {
       Exact(T),
       Inexact(T),
       Absent,
       Between(Interval)
   }
   ```
   
   @berkaysynnada under what circumstances would we use `Inexact`? All the cases I can think of `Inexact` now would be handled by `Between` (potentially with unbounded intervals)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810844263

   After some thought, I think the key information we are trying to capture in statistics are:
   1. Known min/max values
   2. Estimated values that are imprecise (the result of applying some sort of estimate), but are better than just "somewhere between min/max"
   
   I don't think the proposal (at least as I understand it) in https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, can capture the known min/max values
   
   However, perhaps something like this could:
   
   ```rust
   pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
       /// The exact value is known
       Exact(T),
       /// The exact value is not known, but the real value is **known** to be within
       /// the specified range: `lower <= value <= upper` and furthermore is estimated
       /// to be `estimate`. However, the real value may fall anywhere in the range
       /// TODO: we could use `Interval` here instead, which could represent more complex cases (like
       /// open/closed bounds rather than using T::MIN/T::MAX)
       Bounded { lower: T, estimate: T upper: T},
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   ```
   
   Maybe estimate should be `Option<T>` rather than just `T` 🤔 
   
   # Example1: Filter `x <= 10`
   In the example of a filter where you know the column `x` has bounds `[1, 100]` and the row_count is `N`. 
   
   After applying a filter of `x <= 10`, assuming an even distribution you get a selectivity of `0.1 `, so we can conclude the following about the output `row_count`:
   1. It is still between 0 and N (the max value hasn't been changed by the filter)
   2. We estimate that the value is near `0.1 * N` given our cardinality estimation, but this is just an estimate
   
   As I understand @berkaysynnada 's proposal https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, it would represent the output as either as:
   
   
   ```rust
   Precision::PointEstimate(Inexact(0.1*N))
   ```
   
   or
   
   ```rust
   Precision::Range(
     PointEstimate::Exact(0), // min
     PointEstimate::Inexact(0.1 * N), // max
   )
   ```
   
   But neither captures the fact that we know for sure that row count lies between 0 and `N` (the max wasn't changed by the filter). Using a `Bounded` variant could capture this information:
   
   ```rust
   Precision::Bounded {
     lower: 0,
     estimate: 0.1 * N,
     upper: 100
   }
   ```
   
   Here is the setup in pictures for those who like that
   
   ```
        Output Statistics  for row_count
         value: estimated to be 0.1*N           
         min: Exactly 0           
         max: Exactly 100 (can not possibly be higher)         
                                  
                ▲                 
                │                 
   ┌────────────┴────────────┐    
   │                         │    
   │       FilterExec        │    
   │         x <= 10         │    
   │                         │    
   └─────────────────────────┘    
                ▲                 
                │                 
                │                 
                                  
       Input Statistics for row count:     
        value: Exactly N      
        min: Exactly 0            
        max: Exactly 100          
                                  
   ```
   
   
   # `x <= y` example
   
   The other example raised above is a predicate like `x <= y`. I assume the challenge is now how do we represent the column ranges with this information. Let's say we know that `x` was between `[50, 100]` and `y` was between [0, 200] and there were N input rows
   
   Assuming even and uncorrelated distribution, we would get a selectivity estimate of 0.75
   
   ```
     │                                                      │                      
     │                                                      │                      
     │──────────────────────────────────────────────────────┤       y              
     │                                                      │       min: 0         
     │                                                      │       max: 200       
                                                                                   
     0             │                      │                200                     
                   │                      │                                        
                   ├──────────────────────┤                         x              
                   │                      │                         min: 50        
                   │                      │                         max: 100       
                                                                                   
                   0                     200                                       
                                                                                   
                                                                                   
                  **     Only range that can be true      **                       
                   **                                    **                        
                    **************************************                         
                                                                                   
   ```
   
   We could again represent the output statistic with the proper range as well as the cardinality estimate like this, which both captures the fact `y` can't be less than `50` as well as the `0.75` selectivity):
   
   ```
   x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
   y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1811250556

   FWIW what I am doing now is studying how the existing `Precision` struct is used, to see how a change might look like, and filing small cleanup PRs along the way (so that when we go to change the implementation it will be a small change)
   
   I will think on this more tonight and continue the work tomorrow.
   
   Proposed PRs so far:
   https://github.com/apache/arrow-datafusion/pull/8172
   https://github.com/apache/arrow-datafusion/pull/8174
   https://github.com/apache/arrow-datafusion/pull/8177


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752

   Any thoughts on @berkaysynnada's proposal; i.e.
   
   ```
   enum PointEstimate {
       Exact(T),
       Inexact(T),
       Absent,
   }
   enum Precision {
       PointEstimate,
       Range(PointEstimate, PointEstimate)
   }
   ```
   
   I think extending the `Precision` struct like this, along with bugfixes like #8094, will address #8099.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1799810503

   You still need `Inexact`; e.g. cases involving selectivity. Assume that you have a column whose bounds are `[1, 100]`. After applying a filter of `<= 10`, the row count estimate is an inexact estimate of `0.1 * N` where `N` was the estimate prior to the filter.
   
   On the contrary: If you have `Between`, you do not need `Exact` anymore -- An `Exact` value is simply a `Between` value with a singleton interval inside.
   
   @berkaysynnada's first proposal is the more general one: It allows range estimations where either bound can be exact or inexact estimations. His second proposal is slightly simpler to implement (and leverages the interval library), but it constrains range estimations to exact bounds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1800080462

   I think there is a mismatch between the current use of `Precision::Inexact` and its documentation 
   
   Specifically, `Precision::Inexact` appears to be treated as a  "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename `Precision::Exact` to `Precision::Conservative` and document that it is a conservative estimate of the actual value 🤔 
   
   For example the comments on `Precision::Inexact` say
   
   https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L32-L33
   
   However, it is used to skip processing files when the (inexact) row count is above the fetch/limit:
   
   https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73
   
   I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for `Precision::Inexact`:
   
   
   https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L42-L46
   
   Another example is `ColumnStatistics::is_singleton`, which I think is also only correct if the statistics are conservative (aka the actual min is no lower than the reported min and the actual max is no larger than the reported mx)
   
   https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org