You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/21 23:59:14 UTC

[GitHub] [arrow-datafusion] freeformstu opened a new issue, #4314: Variance aggregation calculation does not work for a single item

freeformstu opened a new issue, #4314:
URL: https://github.com/apache/arrow-datafusion/issues/4314

   **Describe the bug**
   A clear and concise description of what the bug is.
   
   The variance aggregation calculation returns an error if there is only one item. But it should return `0`.
   
   If given the sequence `[1, 1, 1, 1]`, the variance will be zero. If given the sequence `[1]` the variance should also be zero.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   Run the tests, there is a test which asserts that given a single item, variance will return an error.
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   If given a sequence of length 1, the variance should calculate `0`.
   
   **Additional context**
   Add any other context about the problem here.
   
   The variance accumulator cites its reference implementation here:
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/aggregate/variance.rs#L173-L179
   > /// The algrithm used is an online implementation and numerically stable. It is based on this paper:
   > /// Welford, B. P. (1962). "Note on a method for calculating corrected sums of squares and products".
   > /// Technometrics. 4 (3): 419–420. doi:10.2307/1266577. JSTOR 1266577.
   The summary of which can be found here https://www.tandfonline.com/doi/abs/10.1080/00401706.1962.10490022.
   
   As far as I can tell, this is just a bug not a feature of that algorithm.
   
   Additionally, there is another bug in the current logic:
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/aggregate/variance.rs#L297-L307
   ```rust
           if count <= 1 {
               return Err(DataFusionError::Internal(
                   "At least two values are needed to calculate variance".to_string(),
               ));
           }
   
           if self.count == 0 {
               Ok(ScalarValue::Float64(None))
           } else {
               Ok(ScalarValue::Float64(Some(self.m2 / count as f64)))
           }
   ```
   If `count` is `0`, then we will always return an error even though that second `if` block suggests intent to return `None` in the case of zero.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4314: Variance aggregation calculation does not work for a single item

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4314:
URL: https://github.com/apache/arrow-datafusion/issues/4314#issuecomment-1323083667

   cc @realno Could you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org